In this project, I developed a machine learning model to predict the risk of heart disease in patients based bases on various health related variables.
In this article, I will share snippets of the Jupyter notebook, highlighting the key steps taken.
The complete Python Jupyter notebook is provided below.
Importing Libraries
import pandas as pd
import numpy as np
load Dataset
dataset = pd.read_csv(‘MGH_PredictionDataSet.csv’)
The first step is looking at your dataset just to familiarize yourself with it.
dataset.head() – This shows you for the 1st 5 rows of your dataset.
sex | age | education | currentSmoker | cigsPerDay | BPMeds | prevalentStroke | prevalentHyp | diabetes | totChol | sysBP | diaBP | BMI | heartRate | glucose | TenYearCHD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39 | 4.0 | 0 | 0.0 | 0.0 | 0 | 0 | 0 | 195.0 | 106.0 | 70.0 | 26.97 | 80.0 | 77.0 | 0 |
1 | 0 | 46 | 2.0 | 0 | 0.0 | 0.0 | 0 | 0 | 0 | 250.0 | 121.0 | 81.0 | 28.73 | 95.0 | 76.0 | 0 |
2 | 1 | 48 | 1.0 | 1 | 20.0 | 0.0 | 0 | 0 | 0 | 245.0 | 127.5 | 80.0 | 25.34 | 75.0 | 70.0 | 0 |
3 | 0 | 61 | 3.0 | 1 | 30.0 | 0.0 | 0 | 1 | 0 | 225.0 | 150.0 | 95.0 | 28.58 | 65.0 | 103.0 | 1 |
4 | 0 | 46 | 3.0 | 1 | 23.0 | 0.0 | 0 | 0 | 0 | 285.0 | 130.0 | 84.0 | 23.10 | 85.0 | 85.0 |
dataset.info() – This gives you a summary of your dataset. The number or rows, number of columns, datatype and so on. From this info you can be able to tell how many nulls are in your dataset.
class 'pandas.core.frame.DataFrame'> RangeIndex: 4240 entries, 0 to 4239 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 4240 non-null int64 1 age 4240 non-null int64 2 education 4135 non-null float64 3 currentSmoker 4240 non-null int64 4 cigsPerDay 4211 non-null float64 5 BPMeds 4187 non-null float64 6 prevalentStroke 4240 non-null int64 7 prevalentHyp 4240 non-null int64 8 diabetes 4240 non-null int64 9 totChol 4190 non-null float64 10 sysBP 4240 non-null float64 11 diaBP 4240 non-null float64 12 BMI 4221 non-null float64 13 heartRate 4239 non-null float64 14 glucose 3852 non-null float64 15 TenYearCHD 4240 non-null int64 dtypes: float64(9), int64(7) memory usage: 530.1 KB
Data Cleaning
Step 1 – Replacing nulls – After identifying nulls, the next step is to replace the null values. In this case, for the categorical data nulls were replaced with the Mode and for the continuous data, nulls were replace with the mean.
Categorical Data – Replace nulls with Mode
dataset[‘education’].fillna(dataset[‘education’].mode()[0], inplace = True)
dataset[‘BPMeds’].fillna(dataset[‘BPMeds’].mode()[0], inplace = True)
Continuous Data – Replace nulls with Mean
dataset[‘cigsPerDay’].fillna(dataset[‘cigsPerDay’].mean(), inplace = True)
dataset[‘totChol’].fillna(dataset[‘totChol’].mean(), inplace = True)
dataset[‘BMI’].fillna(dataset[‘BMI’].mean(), inplace = True)
dataset[‘heartRate’].fillna(dataset[‘heartRate’].mean(), inplace = True)
dataset[‘glucose’].fillna(dataset[‘glucose’].mean(), inplace = True)
Step 2 – This step took care of any duplicate values which were removed from the dataset
dataset = dataset.drop_duplicates()
EXPLORATORY DATA ANALYSIS (EDA)
The selection helps us to understand the structure of the dataset, identify patterns, anomalies, outliers, develop hypotheses, check assumptions, summarize the data using graphs.
Represent correlation matrix as a heatmap
fig, ax = plt.subplots(figsize =(15,15))
sns.heatmap(heatmap_data, annot = True,cmap = ‘BuPu’)
Histogram Using Seaborn
sns.histplot(dataset[‘age’], bins = 15)
plt.title(‘Age in Year’)
plt.ylabel(‘Frequency’)

Feature Selection
From the 15 features, we selected the first 14, Age, Education, CurrentSmoker, CigPerDay, BP Meds, PrevalentStroke, PrevalentHyp, Diabetes, TotalChol, SysBP, DiaBP, BMI and HeartRate.
These 14 feaures are used in our model to predict the last feautre – TenYearCHD.
features = ['sex', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP','diaBP', 'BMI', 'heartRate', 'glucose']
Splitting the dataset
80% of our dataset is used to train the model and 20% is used to test the model for accuracy.
from sklearn.model_selection import train_test_split
Splitting the data into training and testing sets (80% train, 20% spolit)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
MODEL SELECTION AND TRAINING
1. Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier # Initialize the Decision Tree Model tree_model = DecisionTreeClassifier() # Fitting the model with the training data tree_model.fit(X_train, y_train) DecisionTreeClassifier() accuracy_score = tree_model.score(X_test, y_test) accuracy_score 0.7547169811320755
2. Logistical Regression
# Supress/Ignore warnings
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
# Fiiting the model with the training data
logistic_model.fit(X_train,y_train)
LogisticRegression
LogisticRegression()
accuracy_score = logistic_model.score(X_test, y_test)
accuracy_score
0.8573113207547169
3. KNN (K-Nearest Neighbors)
from sklearn.neighbors import KNeighborsClassifier knn_model = KNeighborsClassifier(n_neighbors =5) # Fitting the model with the training data knn_model.fit(X_train,y_train) KNeighborsClassifier KNeighborsClassifier() accuracy_score = knn_model.score(X_test, y_test) accuracy_score 0.8419811320754716
4. Support Vector Machine SVM
from sklearn.svm import SVC svm_model = SVC(kernel = 'linear') # Fitting the model with training data svm_model.fit(X_train,y_train) SVC(kernel='linear') accuracy_score = svm_model.score(X_test, y_test) accuracy_score 0.8549528301886793
5. Random Forest
from sklearn.ensemble import RandomForestClassifier
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Fitting the model with the training data
forest_model.fit(X_train,y_train)
RandomForestClassifier(random_state=42)
accuracy_score = forest_model.score(X_test, y_test)
accuracy_score
0.8549528301886793
From the 5 models tested in this project Linear Rejection is selected as the model to use because it gives us the highest accuracy score of 86%.