Introduction

German credit rating dataset is provided by Prof. Hofmann, contains categorical/symbolic attributes of the persons who availed the credit and the current status of the credit (at the time when the dataset is prepared). The status of the credit is indicated by 1 for good credits and 2 for bad credits.

The dataset used in this exercise is downloaded from UCI Repository.

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

The detailed description of variables can be found at the same link.

Objective

  • Load the dataset
  • Create dummy variables for the categorical features
  • Specify X and y features
  • Split datasets into train and test datasets
  • Build a logistic regression model
  • Find significant variables
  • Create confusion matrix to find out Sensitivity and specificity
  • Draw ROC Plot
  • Find optimal cutoff probability using yoden's index
  • Find optimal cutoff probability using cost

Loading the dataset

In [1]:
import pandas as pd
import numpy as np
In [2]:
credit_df = pd.read_csv( "german.data.txt", delim_whitespace = True, header = None )
In [3]:
credit_df.head()
Out[3]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

In [4]:
columns = ['checkin_acc', 'duration', 'credit_history', 'purpose', 'amount',
           'svaing_acc', 'present_emp_since', 'inst_rate', 'personal_status',
           'other_debtors', 'residing_since', 'property', 'age',
           'inst_plans', 'housing', 'num_credits',
           'job', 'dependents', 'telephone', 'foreign_worker', 'status']
In [5]:
credit_df.columns = columns
In [6]:
credit_df.head()
Out[6]:
checkin_acc duration credit_history purpose amount svaing_acc present_emp_since inst_rate personal_status other_debtors ... property age inst_plans housing num_credits job dependents telephone foreign_worker status
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

In [7]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
checkin_acc          1000 non-null object
duration             1000 non-null int64
credit_history       1000 non-null object
purpose              1000 non-null object
amount               1000 non-null int64
svaing_acc           1000 non-null object
present_emp_since    1000 non-null object
inst_rate            1000 non-null int64
personal_status      1000 non-null object
other_debtors        1000 non-null object
residing_since       1000 non-null int64
property             1000 non-null object
age                  1000 non-null int64
inst_plans           1000 non-null object
housing              1000 non-null object
num_credits          1000 non-null int64
job                  1000 non-null object
dependents           1000 non-null int64
telephone            1000 non-null object
foreign_worker       1000 non-null object
status               1000 non-null int64
dtypes: int64(8), object(13)
memory usage: 164.1+ KB

How many default and no default observations in the sample

In [8]:
credit_df.status.value_counts()
Out[8]:
1    700
2    300
Name: status, dtype: int64

There are about 300 defaults and 700 non-default observations

Creating Dummy Features

In [9]:
credit_df.columns
Out[9]:
Index(['checkin_acc', 'duration', 'credit_history', 'purpose', 'amount',
       'svaing_acc', 'present_emp_since', 'inst_rate', 'personal_status',
       'other_debtors', 'residing_since', 'property', 'age', 'inst_plans',
       'housing', 'num_credits', 'job', 'dependents', 'telephone',
       'foreign_worker', 'status'],
      dtype='object')
In [10]:
list( credit_df.columns )
Out[10]:
['checkin_acc',
 'duration',
 'credit_history',
 'purpose',
 'amount',
 'svaing_acc',
 'present_emp_since',
 'inst_rate',
 'personal_status',
 'other_debtors',
 'residing_since',
 'property',
 'age',
 'inst_plans',
 'housing',
 'num_credits',
 'job',
 'dependents',
 'telephone',
 'foreign_worker',
 'status']

Remove the response variable from the dataset

In [11]:
X_features = list( credit_df.columns )
X_features.remove( 'status' )
X_features
Out[11]:
['checkin_acc',
 'duration',
 'credit_history',
 'purpose',
 'amount',
 'svaing_acc',
 'present_emp_since',
 'inst_rate',
 'personal_status',
 'other_debtors',
 'residing_since',
 'property',
 'age',
 'inst_plans',
 'housing',
 'num_credits',
 'job',
 'dependents',
 'telephone',
 'foreign_worker']
In [12]:
credit_df_complete = pd.get_dummies( credit_df[X_features], drop_first = True )
In [13]:
len( credit_df_complete.columns )
Out[13]:
48

Change the reponse variable to 0 = No Default and 1 = Default

In [14]:
Y = credit_df.status - 1
X = credit_df_complete

Splitting Datasets into Train and Test Sets

In [15]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 42 )

Building a Logistic Regression Model

In [16]:
import statsmodels.api as sm
In [17]:
logit = sm.Logit( y_train, sm.add_constant( X_train ) )
In [18]:
lg = logit.fit()
Optimization terminated successfully.
         Current function value: 0.452608
         Iterations 7
In [19]:
lg.summary()
Out[19]:
Logit Regression Results
Dep. Variable: status No. Observations: 700
Model: Logit Df Residuals: 651
Method: MLE Df Model: 48
Date: Wed, 26 Jul 2017 Pseudo R-squ.: 0.2576
Time: 06:00:40 Log-Likelihood: -316.83
converged: True LL-Null: -426.75
LLR p-value: 7.805e-24
coef std err z P>|z| [95.0% Conf. Int.]
const 0.4672 1.289 0.362 0.717 -2.060 2.994
duration 0.0198 0.011 1.793 0.073 -0.002 0.042
amount 0.0001 5.04e-05 2.510 0.012 2.77e-05 0.000
inst_rate 0.3471 0.106 3.286 0.001 0.140 0.554
residing_since 0.0218 0.103 0.211 0.833 -0.181 0.224
age -0.0293 0.012 -2.520 0.012 -0.052 -0.007
num_credits 0.3377 0.228 1.483 0.138 -0.108 0.784
dependents 0.2143 0.321 0.668 0.504 -0.415 0.843
checkin_acc_A12 -0.2825 0.264 -1.070 0.285 -0.800 0.235
checkin_acc_A13 -0.9467 0.461 -2.055 0.040 -1.850 -0.044
checkin_acc_A14 -1.5046 0.277 -5.426 0.000 -2.048 -0.961
credit_history_A31 0.0720 0.635 0.113 0.910 -1.172 1.316
credit_history_A32 -0.4420 0.481 -0.919 0.358 -1.385 0.501
credit_history_A33 -1.0992 0.539 -2.039 0.041 -2.156 -0.042
credit_history_A34 -1.4632 0.492 -2.972 0.003 -2.428 -0.498
purpose_A41 -1.6565 0.436 -3.799 0.000 -2.511 -0.802
purpose_A410 -0.9521 0.847 -1.125 0.261 -2.611 0.707
purpose_A42 -1.0981 0.326 -3.365 0.001 -1.738 -0.459
purpose_A43 -1.0147 0.293 -3.458 0.001 -1.590 -0.440
purpose_A44 -0.5344 0.778 -0.687 0.492 -2.060 0.991
purpose_A45 -0.6317 0.649 -0.973 0.330 -1.904 0.640
purpose_A46 0.1820 0.498 0.365 0.715 -0.795 1.159
purpose_A48 -2.1140 1.228 -1.722 0.085 -4.520 0.292
purpose_A49 -0.5624 0.399 -1.408 0.159 -1.345 0.220
svaing_acc_A62 -0.3141 0.344 -0.912 0.362 -0.989 0.361
svaing_acc_A63 -0.6561 0.484 -1.356 0.175 -1.604 0.292
svaing_acc_A64 -1.0924 0.615 -1.777 0.076 -2.298 0.113
svaing_acc_A65 -0.6663 0.300 -2.219 0.027 -1.255 -0.078
present_emp_since_A72 -0.0647 0.509 -0.127 0.899 -1.063 0.934
present_emp_since_A73 -0.1742 0.485 -0.359 0.719 -1.124 0.776
present_emp_since_A74 -0.4882 0.525 -0.929 0.353 -1.518 0.541
present_emp_since_A75 -0.3425 0.494 -0.693 0.488 -1.311 0.626
personal_status_A92 -0.2695 0.505 -0.534 0.594 -1.259 0.720
personal_status_A93 -0.7277 0.498 -1.460 0.144 -1.704 0.249
personal_status_A94 -0.2190 0.587 -0.373 0.709 -1.370 0.932
other_debtors_A102 0.0075 0.499 0.015 0.988 -0.971 0.986
other_debtors_A103 -0.7804 0.476 -1.640 0.101 -1.713 0.152
property_A122 0.4397 0.303 1.450 0.147 -0.155 1.034
property_A123 0.2453 0.284 0.865 0.387 -0.311 0.801
property_A124 0.7832 0.513 1.526 0.127 -0.223 1.789
inst_plans_A142 -0.0715 0.540 -0.132 0.895 -1.130 0.988
inst_plans_A143 -0.5210 0.290 -1.795 0.073 -1.090 0.048
housing_A152 -0.4157 0.291 -1.431 0.153 -0.985 0.154
housing_A153 -0.2840 0.583 -0.487 0.626 -1.426 0.858
job_A172 0.5672 0.851 0.666 0.505 -1.101 2.236
job_A173 0.5941 0.816 0.729 0.466 -1.004 2.193
job_A174 0.6126 0.821 0.746 0.455 -0.996 2.221
telephone_A192 -0.3034 0.243 -1.247 0.213 -0.780 0.174
foreign_worker_A202 -1.3456 0.746 -1.804 0.071 -2.808 0.117

Find the significant variables

In [20]:
def get_significant_vars( lm ):
    var_p_vals_df = pd.DataFrame( lm.pvalues )
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals', 'vars']
    return list( var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'] )
In [21]:
significant_vars = get_significant_vars( lg )
In [22]:
significant_vars
Out[22]:
['amount',
 'inst_rate',
 'age',
 'checkin_acc_A13',
 'checkin_acc_A14',
 'credit_history_A33',
 'credit_history_A34',
 'purpose_A41',
 'purpose_A42',
 'purpose_A43',
 'svaing_acc_A65']

Predict Test Data and Measure Accuracy

In [23]:
from sklearn import metrics
In [24]:
def get_predictions( y_test, model ):
    y_pred_df = pd.DataFrame( { 'actual': y_test,
                               "predicted_prob": lg.predict( sm.add_constant( X_test ) ) } )
    return y_pred_df
In [25]:
y_pred_df = get_predictions( y_test, lg )

Status_1 columns provides the predicted probability for the default classes

In [26]:
y_pred_df[0:10]
Out[26]:
actual predicted_prob
521 1 0.394631
737 0 0.540659
740 0 0.640590
660 0 0.299997
411 0 0.037954
678 0 0.363224
626 0 0.043453
513 0 0.462469
859 0 0.023631
136 0 0.031622
In [27]:
y_pred_df['predicted'] = y_pred_df.predicted_prob.map( lambda x: 1 if x > 0.5 else 0)
In [28]:
y_pred_df[0:10]
Out[28]:
actual predicted_prob predicted
521 1 0.394631 0
737 0 0.540659 1
740 0 0.640590 1
660 0 0.299997 0
411 0 0.037954 0
678 0 0.363224 0
626 0 0.043453 0
513 0 0.462469 0
859 0 0.023631 0
136 0 0.031622 0

Build a Confusion Matrix

In [29]:
import matplotlib.pylab as plt
import seaborn as sn
%matplotlib inline
In [30]:
def draw_cm( actual, predicted ):
    cm = metrics.confusion_matrix( actual, predicted, [1,0] )
    sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Default", "No Default"] , yticklabels = ["Default", "No Default"] )
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
In [31]:
draw_cm( y_pred_df.actual, y_pred_df.predicted )

Note: the model could only predict very few default classes.

Overall accuracy of the model

In [33]:
print( 'Total Accuracy : ',np.round( metrics.accuracy_score( y_test, y_pred_df.predicted ), 2 ) )
print( 'Precision : ',np.round( metrics.precision_score( y_test, y_pred_df.predicted ), 2 ) )
print( 'Recall : ',np.round( metrics.recall_score( y_test, y_pred_df.predicted ), 2 ) )

cm1 = metrics.confusion_matrix( y_pred_df.actual, y_pred_df.predicted, [1,0] )

sensitivity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', round( sensitivity, 2) )

specificity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', round( specificity, 2 ) )
Total Accuracy :  0.78
Precision :  0.68
Recall :  0.49
Sensitivity :  0.49
Specificity :  0.9

Predicted Probability distribution Plots for Defaults and Non Defaults

In [34]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [35]:
sn.distplot( y_pred_df[y_pred_df.actual == 1]["predicted_prob"], kde=False, color = 'b' )
sn.distplot( y_pred_df[y_pred_df.actual == 0]["predicted_prob"], kde=False, color = 'g' )
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x11dc11f98>
In [36]:
auc_score = metrics.roc_auc_score( y_pred_df.actual, y_pred_df.predicted_prob  )
round( float( auc_score ), 2 )
Out[36]:
0.82
In [37]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 4))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds
In [38]:
fpr, tpr, thresholds = draw_roc( y_pred_df.actual, y_pred_df.predicted_prob )

Finding Optimal Cutoff Probability

In [39]:
thresholds[0:10]
Out[39]:
array([ 0.94491505,  0.93212151,  0.91523649,  0.91246038,  0.87393746,
        0.84751505,  0.8468848 ,  0.84230738,  0.83966557,  0.82827232])
In [40]:
fpr[0:10]
Out[40]:
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.00478469,  0.00478469,  0.00478469])
In [41]:
tpr[0:10]
Out[41]:
array([ 0.01098901,  0.02197802,  0.03296703,  0.04395604,  0.05494505,
        0.06593407,  0.07692308,  0.07692308,  0.08791209,  0.0989011 ])

Find optimal cutoff using youden's index

  • Youden's index is where (Sensitivity+Specificity - 1) is maximum.
  • That is when (TPR+TNR -1) is maximum.
    • max( TPR - (1 - TNR) )
    • max( TPR - FPR )
In [42]:
tpr_fpr = pd.DataFrame( { 'tpr': tpr, 'fpr': fpr, 'thresholds': thresholds } )
tpr_fpr['diff'] = tpr_fpr.tpr - tpr_fpr.fpr
tpr_fpr.sort_values( 'diff', ascending = False )[0:10]
Out[42]:
fpr thresholds tpr diff
112 0.220096 0.326331 0.736264 0.516168
119 0.244019 0.298348 0.758242 0.514223
116 0.234450 0.310227 0.747253 0.512803
113 0.224880 0.321486 0.736264 0.511383
120 0.248804 0.296035 0.758242 0.509438
117 0.239234 0.305681 0.747253 0.508018
114 0.229665 0.314444 0.736264 0.506599
124 0.263158 0.288240 0.769231 0.506073
111 0.220096 0.328648 0.725275 0.505179
121 0.253589 0.293635 0.758242 0.504653
In [43]:
y_pred_df['predicted_new'] = y_pred_df.predicted_prob.map( lambda x: 1 if x > 0.29 else 0)
In [44]:
draw_cm( y_pred_df.actual, y_pred_df.predicted_new )

Find optimal cutoff probability using cost

In [45]:
cm = metrics.confusion_matrix( y_pred_df.actual, y_pred_df.predicted_new, [1,0] )
In [46]:
cm_mat = np.array( cm )
In [47]:
cm_mat[1, 0]
Out[47]:
54
In [48]:
cm_mat[0, 1]
Out[48]:
22
In [49]:
def get_total_cost( actual, predicted ):
    cm = metrics.confusion_matrix( actual, predicted, [1,0] )
    cm_mat = np.array( cm )
    return cm_mat[0,1] * 2 + cm_mat[0,1] * 1
In [50]:
get_total_cost( y_pred_df.actual, y_pred_df.predicted_new )
Out[50]:
66
In [51]:
cost_df = pd.DataFrame( columns = ['prob', 'cost'])
In [52]:
idx = 0
for each_prob in range( 20, 50):
    cost = get_total_cost( y_pred_df.actual,
                          y_pred_df.predicted_prob.map(
            lambda x: 1 if x > (each_prob/100)  else 0) )
    cost_df.loc[idx] = [(each_prob/100), cost]
    idx += 1
In [53]:
cost_df.sort_values( 'cost', ascending = True )[0:5]
Out[53]:
prob cost
0 0.20 39.0
1 0.21 45.0
2 0.22 51.0
3 0.23 54.0
4 0.24 57.0
In [54]:
y_pred_df['predicted_final'] = y_pred_df.predicted_prob.map( lambda x: 1 if x > 0.20 else 0)
In [55]:
draw_cm( y_pred_df.actual, y_pred_df.predicted_final )

Note:

Predicting Defaults as No Defaults have been minimized.

In [56]:
print( 'Total Accuracy : ',np.round( metrics.accuracy_score( y_test, y_pred_df.predicted_final ), 2 ) )
print( 'Precision : ',np.round( metrics.precision_score( y_test, y_pred_df.predicted_final ), 2 ) )
print( 'Recall : ',np.round( metrics.recall_score( y_test, y_pred_df.predicted_final ), 2 ) )

cm1 = metrics.confusion_matrix( y_pred_df.actual, y_pred_df.predicted_final, [1,0] )

sensitivity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', round( sensitivity, 2) )

specificity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', round( specificity, 2 ) )
Total Accuracy :  0.67
Precision :  0.48
Recall :  0.86
Sensitivity :  0.86
Specificity :  0.59

Note:

Total accuracy of the model is 67%, becuase the objective is not to improve total accuracy but minimize the quadrants that contribute to the cost.