In this project, I developed a machine learning model to predict the risk of heart disease in patients based bases on various health related variables.

In this article, I will share snippets of the Jupyter notebook, highlighting the key steps taken.

The complete Python Jupyter notebook is provided below.

Importing Libraries

import pandas as pd
import numpy as np

load Dataset

dataset = pd.read_csv(‘MGH_PredictionDataSet.csv’)

The first step is looking at your dataset just to familiarize yourself with it.

dataset.head() – This shows you for the 1st 5 rows of your dataset.

	sex	age	education	currentSmoker	cigsPerDay	prevalentHyp	totChol	sysBP	diaBP	BMI	heartRate	glucose	TenYearCHD
0	1	39	4.0	0	0.0	0	195.0	106.0	70.0	26.97	80.0	77.0	0
1	0	46	2.0	0	0.0	0	250.0	121.0	81.0	28.73	95.0	76.0	0
2	1	48	1.0	1	20.0	0	245.0	127.5	80.0	25.34	75.0	70.0	0
3	0	61	3.0	1	30.0	1	225.0	150.0	95.0	28.58	65.0	103.0	1
4	0	46	3.0	1	23.0	0	285.0	130.0	84.0	23.10	85.0	85.0

dataset.info()– This gives you a summary of your dataset. The number or rows, number of columns, datatype and so on. From this info you can be able to tell how many nulls are in your dataset.

class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
#   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
0   sex              4240 non-null   int64  
1   age              4240 non-null   int64  
2   education        4135 non-null   float64
3   currentSmoker    4240 non-null   int64  
4   cigsPerDay       4211 non-null   float64
5   BPMeds           4187 non-null   float64
6   prevalentStroke  4240 non-null   int64  
7   prevalentHyp     4240 non-null   int64  
8   diabetes         4240 non-null   int64  
9   totChol          4190 non-null   float64
10  sysBP            4240 non-null   float64
11  diaBP            4240 non-null   float64
12  BMI              4221 non-null   float64
13  heartRate        4239 non-null   float64
14  glucose          3852 non-null   float64
15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB

Data Cleaning

Step 1 – Replacing nulls – After identifying nulls, the next step is to replace the null values. In this case, for the categorical data nulls were replaced with the Mode and for the continuous data, nulls were replace with the mean.

Categorical Data – Replace nulls with Mode

dataset[‘education’].fillna(dataset[‘education’].mode()[0], inplace = True)
dataset[‘BPMeds’].fillna(dataset[‘BPMeds’].mode()[0], inplace = True)

Continuous Data – Replace nulls with Mean

dataset[‘cigsPerDay’].fillna(dataset[‘cigsPerDay’].mean(), inplace = True)
dataset[‘totChol’].fillna(dataset[‘totChol’].mean(), inplace = True)
dataset[‘BMI’].fillna(dataset[‘BMI’].mean(), inplace = True)
dataset[‘heartRate’].fillna(dataset[‘heartRate’].mean(), inplace = True)
dataset[‘glucose’].fillna(dataset[‘glucose’].mean(), inplace = True)

Step 2 – This step took care of any duplicate values which were removed from the dataset

dataset = dataset.drop_duplicates()

EXPLORATORY DATA ANALYSIS (EDA)

The selection helps us to understand the structure of the dataset, identify patterns, anomalies, outliers, develop hypotheses, check assumptions, summarize the data using graphs.

Represent correlation matrix as a heatmap

fig, ax = plt.subplots(figsize =(15,15))
sns.heatmap(heatmap_data, annot = True,cmap = ‘BuPu’)

Histogram Using Seaborn

sns.histplot(dataset[‘age’], bins = 15)
plt.title(‘Age in Year’)
plt.ylabel(‘Frequency’)

Feature Selection

From the 15 features, we selected the first 14, Age, Education, CurrentSmoker, CigPerDay, BP Meds, PrevalentStroke, PrevalentHyp, Diabetes, TotalChol, SysBP, DiaBP, BMI and HeartRate.

These 14 feaures are used in our model to predict the last feautre – TenYearCHD.

features = ['sex', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP','diaBP', 'BMI', 'heartRate', 'glucose']

Splitting the dataset

80% of our dataset is used to train the model and 20% is used to test the model for accuracy.

from sklearn.model_selection import train_test_split

Splitting the data into training and testing sets (80% train, 20% spolit)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

MODEL SELECTION AND TRAINING

1. Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
# Initialize the Decision Tree Model
tree_model = DecisionTreeClassifier()
# Fitting the model with the training data
tree_model.fit(X_train, y_train)
DecisionTreeClassifier()
accuracy_score = tree_model.score(X_test, y_test)
accuracy_score
0.7547169811320755

2. Logistical Regression

# Supress/Ignore warnings

import warnings

warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()

# Fiiting the model with the training data

logistic_model.fit(X_train,y_train)

LogisticRegression

LogisticRegression()

accuracy_score = logistic_model.score(X_test, y_test)

accuracy_score

0.8573113207547169

3. KNN (K-Nearest Neighbors)

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors =5)
# Fitting the model with the training data
knn_model.fit(X_train,y_train)
KNeighborsClassifier
KNeighborsClassifier()
accuracy_score = knn_model.score(X_test, y_test)
accuracy_score
0.8419811320754716

4. Support Vector Machine SVM

from sklearn.svm import SVC
svm_model = SVC(kernel = 'linear')
# Fitting the model with training data
svm_model.fit(X_train,y_train)
SVC(kernel='linear')
accuracy_score = svm_model.score(X_test, y_test)
accuracy_score
0.8549528301886793

5. Random Forest

from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fitting the model with the training data

forest_model.fit(X_train,y_train)

RandomForestClassifier(random_state=42)

accuracy_score = forest_model.score(X_test, y_test)

accuracy_score

0.8549528301886793

From the 5 models tested in this project Linear Rejection is selected as the model to use because it gives us the highest accuracy score of 86%.

Python Project – Supervised Machine Learning Model Using Jupyter Notebook