Analyzing German Credit Rating

German credit rating dataset is provided by Prof. Hofmann, contains categorical/symbolic attributes of the persons who availed the credit and the current status of the credit (at the time when the dataset is prepared). The status of the credit is indicated by 1 for good credits and 2 for bad credits.

The dataset used in this exercise is downloaded from UCI Repository

The detailed description of variables can be found at the same link.

In [2]:
import pandas as pd
import numpy as np

Read the dataset

The data file does not contain header and it delimited by white space.

In [3]:
credit_df = pd.read_csv( "german.data.txt", delim_whitespace = True, header = None )
In [4]:
credit_df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

Initialize the column headers

In [317]:
columns = ['checkin_acc', 'duration', 'credit_history', 'purpose', 'amount',
         'saving_acc', 'present_emp_since', 'inst_rate', 'personal_status',
         'other_debtors', 'residing_since', 'property', 'age',
         'inst_plans', 'housing', 'num_credits',
         'job', 'dependents', 'telephone', 'foreign_worker', 'status']
In [318]:
credit_df.columns = columns
In [319]:
credit_df.head()
Out[319]:
checkin_acc duration credit_history purpose amount saving_acc present_emp_since inst_rate personal_status other_debtors ... property age inst_plans housing num_credits job dependents telephone foreign_worker status
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 0
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 1
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 0
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 0
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 1

5 rows × 21 columns

The variable types

In [320]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
checkin_acc          1000 non-null object
duration             1000 non-null int64
credit_history       1000 non-null object
purpose              1000 non-null object
amount               1000 non-null int64
saving_acc           1000 non-null object
present_emp_since    1000 non-null object
inst_rate            1000 non-null int64
personal_status      1000 non-null object
other_debtors        1000 non-null object
residing_since       1000 non-null int64
property             1000 non-null object
age                  1000 non-null int64
inst_plans           1000 non-null object
housing              1000 non-null object
num_credits          1000 non-null int64
job                  1000 non-null object
dependents           1000 non-null int64
telephone            1000 non-null object
foreign_worker       1000 non-null object
status               1000 non-null int64
dtypes: int64(8), object(13)
memory usage: 164.1+ KB

Modifying status indicator

Changing the credit status variable to 0 and 1 for conventional classification represenations. 0 indicates good credit and 1 indicates bad credit.

In [9]:
credit_df.status = credit_df.status - 1
In [43]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

Analyzing Credit Amount Disbursed - Univariate Analysis

How does the distribution of credit amount disbursed look like?

In [177]:
sn.distplot( credit_df.amount, kde = False )
plt.title( "Histogram of Credit Amount Disbursed", fontsize = 15)
plt.ylabel( "Frequency")
Out[177]:
<matplotlib.text.Text at 0x12b35c2e8>

Note: Most of the credit amounts are less than 5000 with some higher credit amounts. The largest amount disbursed is as high as 18000+.

In [250]:
credit_df.amount.describe()
Out[250]:
count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64
In [181]:
sn.boxplot( credit_df.amount, orient = 'v' )
plt.title( "Boxplot of Credit Amount Disbursed", fontsize = 15)
Out[181]:
<matplotlib.text.Text at 0x1309710b8>

Note: The middle 50% of the population lies between 1300 to 3900.

In [242]:
sn.boxplot( x = 'status', y = 'amount', data = credit_df, orient = 'v' )
plt.title( "Boxplot of Credit Amount Disbursed by Credit Status", fontsize = 15)
Out[242]:
<matplotlib.text.Text at 0x1347f6c50>

Note: Lot of higher credit amounts seem to have been defaulted.

Distribution plot for credit amounts for differernt status

In [195]:
sn.distplot( credit_df[credit_df.status == 0].amount, color = 'g', hist = False )
sn.distplot( credit_df[credit_df.status == 1].amount, color = 'r', hist = False )
plt.title( "Distribution plot of Amount comparison for Different Credit Status", fontsize = 10 )
plt.ylabel( "Frequency")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[195]:
<matplotlib.text.Text at 0x131c5b4a8>
In [46]:
g = sn.FacetGrid(credit_df, col="status", size = 6)
g.map(sn.distplot, "amount", kde = False, bins = 20 )
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x11c7b1dd8>

Note: Amounts higher than 10000 have been mostly defaulted.

What is the credit default rate?

In [253]:
d_rate_df = pd.DataFrame( credit_df.status.value_counts( normalize=True ) )
d_rate_df
Out[253]:
status
0 0.7
1 0.3
In [258]:
sn.barplot( x = d_rate_df.index, y = d_rate_df.status )
Out[258]:
<matplotlib.axes._subplots.AxesSubplot at 0x13358e898>
In [95]:
credit_df.amount.describe()
Out[95]:
count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64

Credit Amount - Outliers & it's impact on credit defaults

In [96]:
amount_desc = credit_df.amount.describe()
In [111]:
outliers = amount_desc['75%'] + 1.5 * ( amount_desc['75%'] - amount_desc['25%'] )
In [112]:
outliers
Out[112]:
7882.375
In [259]:
d_rate_outliers_df = pd.DataFrame( credit_df[credit_df.amount >
                      outliers ].status.
                   value_counts( normalize = True ) )
In [260]:
sn.barplot( x = d_rate_outliers_df.index, y = d_rate_outliers_df.status )
Out[260]:
<matplotlib.axes._subplots.AxesSubplot at 0x133436438>
In [114]:
extreme_outliers = amount_desc['75%'] + 3 * ( amount_desc['75%'] - amount_desc['25%'] )
In [261]:
extreme_outliers_df = pd.DataFrame( credit_df[credit_df.amount >
                      extreme_outliers ].status.
                    value_counts( normalize = True ) )
In [262]:
sn.barplot( x = extreme_outliers_df.index, y = extreme_outliers_df.status )
Out[262]:
<matplotlib.axes._subplots.AxesSubplot at 0x135853470>

Analyzing Interest Rates on Credit Defaults

In [47]:
credit_df.inst_rate.unique()
Out[47]:
array([4, 2, 3, 1])
In [48]:
rate_count = credit_df[['inst_rate', 'status']].groupby(['inst_rate', 'status']).size().reset_index()
rate_count.columns = ['inst_rate', 'status', 'count']
In [49]:
rate_count
Out[49]:
inst_rate status count
0 1 0 102
1 1 1 34
2 2 0 169
3 2 1 62
4 3 0 112
5 3 1 45
6 4 0 317
7 4 1 159
In [50]:
g = sn.factorplot(x="inst_rate", y = 'count', hue="status", data=rate_count,
                 size=6, kind="bar", palette="muted")

Credit Amount for different interest rates and their impact on bad credit

In [248]:
sn.boxplot( x = 'inst_rate', y = 'amount', hue = 'status', data = credit_df, orient = 'v' )
plt.title( "Boxplot of Credit Amount Disbursed by Credit Status", fontsize = 12)
Out[248]:
<matplotlib.text.Text at 0x1352f3c88>
In [263]:
sn.lmplot( x = 'inst_rate', y = 'amount', data = credit_df )
Out[263]:
<seaborn.axisgrid.FacetGrid at 0x13585fe80>

Does the average credit amount for interest rate = 1 different for good and bad credit ?

In [264]:
credit_inst_rate_1_df = credit_df[ credit_df.inst_rate == 1 ]
sn.distplot( credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount, color = 'g', hist = False )
sn.distplot( credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1].amount, color = 'r', hist = False )
sn.plt.axvline( x = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount.mean(), color = 'g' )
sn.plt.axvline( x = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount.mean(), color = 'r' )
plt.title( "Distribution plot of Amount Disbured for inst_rate = 1", fontsize = 10 )
plt.ylabel( "Frequency")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[264]:
<matplotlib.text.Text at 0x135a2e7f0>

Note: For interest rate = 1, the average credit amount for bad credits seems to be higher then good credits

Hypothesis Test

${H}_{0}$: average Credit amount for good credit = average credit amount for bad credit

${H}_{1}$: average Credit amount for good credit <> average credit amount for bad credit

In [245]:
stats.ttest_ind( credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount,
               credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount)
Out[245]:
Ttest_indResult(statistic=-4.529326080420299, pvalue=1.2948704790797532e-05)

Note: A p-value < 0.05 indicates average credit for good credit is less than average credit amount for bad credits

Average credit amount for bad credit

In [265]:
credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount.mean()
Out[265]:
6425.323529411765
In [275]:
credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount.mean()
Out[275]:
3619.823529411765

Can we find a cutoff for rejecting credit amount for inst_rate = 1? What is the reduction in default and what is opportunity cost?

In [274]:
stats.ttest_1samp( credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount, 4700 )
Out[274]:
Ttest_1sampResult(statistic=2.0723224768428312, pvalue=0.046122741983344887)
In [281]:
1 - stats.norm.cdf( 4700,
                  loc = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount.mean() ,
                  scale = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 1 ].amount.bstd() )
Out[281]:
0.63885516516948215
In [286]:
credit_inst_rate_1_df[ (credit_inst_rate_1_df.status == 1)
                    & (credit_inst_rate_1_df.amount > 4700)].amount.sum()
Out[286]:
177609
In [282]:
1 - stats.norm.cdf( 4700,
                  loc = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount.mean() ,
                  scale = credit_inst_rate_1_df[credit_inst_rate_1_df.status == 0 ].amount.std() )
Out[282]:
0.31914568252537312
In [287]:
credit_inst_rate_1_df[ (credit_inst_rate_1_df.status == 0)
                    & (credit_inst_rate_1_df.amount > 4700)].amount.sum()
Out[287]:
166366

Note: Assuming there is 10% return on the credits, the bank is loosing only 16636.6, where as the loss becuase of bad credit is complete principal amount i.e. 177609, which is very high.

Analyzing impact of customer having checkin account on bad credits

In [288]:
sn.barplot( x = 'checkin_acc', y = 'amount', hue = 'status', data = credit_df )
plt.title( "Average credit amount by different checkin account holders")
plt.figtext(1, 0.5,"""A11 : < 0 DM \n A12 : 0 <= ... < 200 DM \n A13 : >= 200 DM \n A14 : no checking account 
""", wrap=True, horizontalalignment='left', fontsize=12)
Out[288]:
<matplotlib.text.Text at 0x135cce278>
In [289]:
sn.barplot( x = 'checkin_acc',
         y = 'amount',
         hue = 'status',
         data = credit_df,
         estimator = sum )
plt.title( "Total credit amount by different checkin account holders")
plt.figtext(1, 0.5,"""A11 : < 0 DM \n A12 : 0 <= ... < 200 DM \n A13 : >= 200 DM \n A14 : no checking account 
""", wrap=True, horizontalalignment='left', fontsize=12)
Out[289]:
<matplotlib.text.Text at 0x135d4dd30>
In [290]:
sn.countplot( y = 'checkin_acc', hue = 'status', data = credit_df )
Out[290]:
<matplotlib.axes._subplots.AxesSubplot at 0x135d7e128>

Note: Customers having no checkin account seems to have lesser chance of making a default where as customers having checkin accoun without any balance have higher chance of making a default.

Impact of credit history on bad credits

In [ ]:
figure_text = """A30 : no credits taken/ all credits paid back duly \n
A31 : all credits at this bank paid back duly \n
A32 : existing credits paid back duly till now \n
A33 : delay in paying off in the past \n
A34 : critical account/ other credits existing (not at this bank) """
In [292]:
sn.barplot( x = 'credit_history', y = 'amount', hue = 'status', data = credit_df, estimator = sum )
plt.figtext(1, 0.5,figure_text, wrap=True, horizontalalignment='left', fontsize=12)
Out[292]:
<matplotlib.text.Text at 0x136212908>
In [293]:
sn.countplot( y = 'credit_history', hue = 'status', data = credit_df )
plt.figtext(1, 0.5,figure_text, wrap=True, horizontalalignment='left', fontsize=12)
Out[293]:
<matplotlib.text.Text at 0x136302160>

Analyzing impact of credit purpose on bad credit

In [148]:
purpose_text = '''
A40 : car (new) \n
A41 : car (used) \n
A42 : furniture/equipment \n
A43 : radio/television \n
A44 : domestic appliances \n 
A45 : repairs \n
A46 : education \n
A47 : (vacation - does not exist?) \n
A48 : retraining \n
A49 : business \n
A410 : others '''
In [152]:
sn.barplot( x = 'purpose', y = 'amount', hue = 'status', data = credit_df )
plt.figtext(1, 0.3,purpose_text, wrap=True, horizontalalignment='left', fontsize=8)
Out[152]:
<matplotlib.text.Text at 0x12334de10>
In [153]:
sn.barplot( x = 'purpose', y = 'amount', hue = 'status', data = credit_df, estimator = sum )
plt.figtext(1, 0.3,purpose_text, wrap=True, horizontalalignment='left', fontsize=8)
Out[153]:
<matplotlib.text.Text at 0x1235ada20>
In [197]:
from scipy import stats

Is the average loan amount taken for used car purchse differnent for defaulters and non-defaulters?

In [228]:
credit_used_car_df = credit_df[ credit_df.purpose == 'A41' ]
sn.distplot( credit_used_car_df[credit_used_car_df.status == 0 ].amount, color = 'g', hist = False )
sn.distplot( credit_used_car_df[credit_used_car_df.status == 1].amount, color = 'r', hist = False )
sn.plt.axvline( x = credit_used_car_df[credit_used_car_df.status == 0 ].amount.mean(), color = 'g' )
sn.plt.axvline( x = credit_used_car_df[credit_used_car_df.status == 1 ].amount.mean(), color = 'r' )
plt.title( "Distribution plot of Amount Disbured for Used Car Purchase and Credit Status", fontsize = 10 )
plt.ylabel( "Frequency")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[228]:
<matplotlib.text.Text at 0x133996748>
In [229]:
stats.ttest_ind( credit_used_car_df[credit_used_car_df.status == 0 ].amount,
               credit_used_car_df[credit_used_car_df.status == 1 ].amount)
Out[229]:
Ttest_indResult(statistic=-4.8740982077770427, pvalue=4.0684088629381788e-06)

Is the average loan amount taken for new car purchse differnent for defaulters and non-defaulters?

In [237]:
credit_new_car_df = credit_df[ credit_df.purpose == 'A40' ]
sn.distplot( credit_new_car_df[credit_new_car_df.status == 0 ].amount, color = 'g', hist = False )
sn.distplot( credit_new_car_df[credit_new_car_df.status == 1].amount, color = 'r', hist = False )
sn.plt.axvline( x = credit_new_car_df[credit_new_car_df.status == 0 ].amount.mean(), color = 'g' )
sn.plt.axvline( x = credit_new_car_df[credit_new_car_df.status == 1 ].amount.mean(), color = 'r' )
plt.title( "Distribution plot of Amount Disbured for Used Car Purchase and Credit Status", fontsize = 10 )
plt.ylabel( "Frequency")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[237]:
<matplotlib.text.Text at 0x134540550>
In [239]:
stats.ttest_ind( credit_new_car_df[credit_new_car_df.status == 0 ].amount,
               credit_new_car_df[credit_new_car_df.status == 1 ].amount)
Out[239]:
Ttest_indResult(statistic=-1.6247519062706208, pvalue=0.10557317267686608)

Is the average loan amount taken for domestic appliances purchse differnent for defaulters and non-defaulters?

In [241]:
credit_appliances_df = credit_df[ credit_df.purpose == 'A44' ]
sn.distplot( credit_appliances_df[credit_appliances_df.status == 0 ].amount, color = 'g', hist = False )
sn.distplot( credit_appliances_df[credit_appliances_df.status == 1].amount, color = 'r', hist = False )
sn.plt.axvline( x = credit_appliances_df[credit_appliances_df.status == 0 ].amount.mean(), color = 'g' )
sn.plt.axvline( x = credit_appliances_df[credit_appliances_df.status == 1 ].amount.mean(), color = 'r' )
plt.title( "Distribution plot of Amount Disbured for Used Car Purchase and Credit Status", fontsize = 10 )
plt.ylabel( "Frequency")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[241]:
<matplotlib.text.Text at 0x13468c5c0>
In [240]:
stats.ttest_ind( credit_appliances_df[credit_appliances_df.status == 0 ].amount,
               credit_appliances_df[credit_appliances_df.status == 1 ].amount)
Out[240]:
Ttest_indResult(statistic=-0.16862387467700832, pvalue=0.86945436741701787)
In [154]:
sn.countplot( x = 'purpose', hue = 'status', data = credit_df )
plt.figtext(1, 0.3,purpose_text, wrap=True, horizontalalignment='left', fontsize=8)
Out[154]:
<matplotlib.text.Text at 0x1238cbdd8>

Relationship betweeb Credit amount and Duration

In [88]:
sn.lmplot( x = 'duration', y = 'amount', fit_reg = False, data = credit_df )
Out[88]:
<seaborn.axisgrid.FacetGrid at 0x1211970f0>
In [83]:
sn.lmplot( x = 'duration', y = 'amount', fit_reg = True, data = credit_df )
Out[83]:
<seaborn.axisgrid.FacetGrid at 0x11ebe5470>
In [84]:
sn.lmplot( x = 'duration', y = 'amount', hue = 'status', fit_reg = False, data = credit_df )
Out[84]:
<seaborn.axisgrid.FacetGrid at 0x11dc66550>

Note: Customers taking credit for large amount for lesser duration are making default mostly. This kind of loans can be restricted.

Impact of age on bad credits

In [310]:
sn.lmplot( x = 'age', y = 'amount', hue = 'status', fit_reg = True, data = credit_df )
Out[310]:
<seaborn.axisgrid.FacetGrid at 0x1376e02b0>