Predicting Coronary Heart Disease

This tutorial will discuss how to build classification model and how to evaluate a model.

Topics covered in this tutorial

  • Basic exploration of data before building models
  • Encoding categorical features
  • Splitting datasets into train and test datasets
  • Build a Logistics Regression Model
  • How logit funtion, log odds, odds and probabilities are related
  • How to find probabilities from the logistic model
  • Find overall accuracy of the model
  • Understand Confusion matrix
  • Understand TPR, FPR, Precision, Recall, Sensitivity & Speficity
  • Understand ROC and how it is used
  • Find optimal Cutoff probability

Here is an intersting problem of understanding what factors contribute to CHD and can CHD be predicted by building an analytical model.

The next two sections will introduce some basics of CHD, where the dataset is derived from and what are the attributes available in the dataset.

What is coronary heart disease?

Coronary heart disease (CHD) is when your coronary arteries (the arteries that supply your heart muscle with oxygen-rich blood) become narrowed by a gradual build-up of fatty material within their walls. These arteries can become narrowed through build-up of plaque, which is made up of cholesterol and other substances. Narrowed arteries can cause symptoms, such as chest pain (angina), shortness of breath, and fatigue.

Dataset Description

Data is avaialable at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ And header informtion is available at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info.txt

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

Import and load the dataset

In [1]:
import pandas as pd
import numpy as np
In [2]:
saheart_ds = pd.read_csv( "SAheart.data" )
In [3]:
saheart_ds.head()
Out[3]:
row.names sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
0 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1
1 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
2 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0
3 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1
4 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1
In [4]:
saheart_ds.columns
Out[4]:
Index(['row.names', 'sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea',
     'obesity', 'alcohol', 'age', 'chd'],
    dtype='object')
In [5]:
saheart_ds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 11 columns):
row.names    462 non-null int64
sbp          462 non-null int64
tobacco      462 non-null float64
ldl          462 non-null float64
adiposity    462 non-null float64
famhist      462 non-null object
typea        462 non-null int64
obesity      462 non-null float64
alcohol      462 non-null float64
age          462 non-null int64
chd          462 non-null int64
dtypes: float64(5), int64(5), object(1)
memory usage: 39.8+ KB

The class label int the column chd indicates if the person has a coronary heart disease: negative (0) or positive (1).

Attributes description:

  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low densiity lipoprotein cholesterol
  • adiposity: the size of the hips compared to the person's height
  • famhist: family history of heart disease (Present, Absent)
  • typea: type-A behavior
  • obesity: BMI index
  • alcohol: current alcohol consumption
  • age: age at onset
In [6]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

import missingno

missingno.matrix( saheart_ds )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/missingno/missingno.py:232: MatplotlibDeprecationWarning: The set_axis_bgcolor function was deprecated in version 2.0. Use set_facecolor instead.
ax1.set_axis_bgcolor((1, 1, 1))

There are no missing information. This is good news as we do not have to impute any data.

Exploratory Data Analysis

Number of observations available for people with CHD and without CHD

In [7]:
saheart_ds.chd.value_counts()
Out[7]:
0    302
1    160
Name: chd, dtype: int64
In [8]:
chd_df = pd.DataFrame( saheart_ds.chd.value_counts() )
In [9]:
chd_df
Out[9]:
chd
0 302
1 160
In [10]:
chd_df['has_chd'] = chd_df.index
chd_df
Out[10]:
chd has_chd
0 302 0
1 160 1
In [11]:
sn.barplot(  x = 'has_chd', y = 'chd', data = chd_df  )
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1115a8a90>
In [12]:
famhist_chd = pd.crosstab( saheart_ds.famhist, saheart_ds.chd )
famhist_chd
Out[12]:
chd 0 1
famhist
Absent 206 64
Present 96 96
In [13]:
famhist_chd = famhist_chd.unstack().reset_index()
famhist_chd
Out[13]:
chd famhist 0
0 0 Absent 206
1 0 Present 96
2 1 Absent 64
3 1 Present 96
In [14]:
famhist_chd.columns = ['chd', 'famhist', 'total']
In [15]:
sn.barplot( famhist_chd.famhist,
          famhist_chd.total,
          hue = famhist_chd.chd )
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1145d8e10>

Note:

It can be observed that the chances of CHD for people having family history is higher compared to people with no famility history.

How all the variable are inter-related?

We can draw a pair plot and understand the relationship between variables.

In [16]:
saheart_ds_sub = saheart_ds[['sbp', 'tobacco', 'ldl'
                       , 'adiposity', 'typea', 'obesity'
                       , 'alcohol', 'age', 'chd']]
sn.pairplot( saheart_ds_sub
         , hue = "chd"
         , palette="husl")
Out[16]:
<seaborn.axisgrid.PairGrid at 0x1147742b0>