This tutorial blog gives a quick overview of

  • Basic statistical operations using pandas dataframe
  • Ue matplotlib and seaborn APIs to create basic plots like barplot, distribution plot, box plot, pair plot, scatter plot etc.
In [1]:
import pandas as pd
import numpy as np

Original dataset is available here

Attributes Information

A data frame with 392 observations on the following 9 variables.

  • mpg
    • miles per gallon
  • cylinders
    • Number of cylinders between 4 and 8
  • displacement
    • Engine displacement (cu. inches)
  • horsepower
    • Engine horsepower
  • weight
    • Vehicle weight (lbs.)
  • acceleration
    • Time to accelerate from 0 to 60 mph (sec.)
  • year
    • Model year (modulo 100)
  • origin
    • Origin of car (1. American, 2. European, 3. Japanese)
  • name
    • Vehicle name
In [2]:
autos = pd.read_csv( "Auto.csv")
In [3]:
autos.head( 5 )
Out[3]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
In [4]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
In [5]:
autos[["mpg", "displacement","horsepower","weight","acceleration"]].describe()
Out[5]:
mpg displacement weight acceleration
count 397.000000 397.000000 397.000000 397.000000
mean 23.515869 193.532746 2970.261965 15.555668
std 7.825804 104.379583 847.904119 2.749995
min 9.000000 68.000000 1613.000000 8.000000
25% 17.500000 104.000000 2223.000000 13.800000
50% 23.000000 146.000000 2800.000000 15.500000
75% 29.000000 262.000000 3609.000000 17.100000
max 46.600000 455.000000 5140.000000 24.800000

Cleanup horsepower column

In [6]:
### horsepower is an object type. Which means it has some non numeric characters.
autos["horsepower"].isnull().values.any()
Out[6]:
False
  • Looks like there is no NULL values, but the column is of Object type.*

  • That means there must be some non-numeric characters. We must coerce all values to be numeric, which will make non-numeric values into NANs and we can then filter them out.*

In [7]:
autos["horsepower"] = pd.to_numeric( autos["horsepower"], errors = 'corece' )
In [8]:
autos["horsepower"].isnull().values.any()
Out[8]:
True
In [9]:
autos = autos.dropna()

Basic statistics

In [10]:
autos[["mpg", "displacement","horsepower","weight","acceleration"]].describe()
Out[10]:
mpg displacement horsepower weight acceleration
count 392.000000 392.000000 392.000000 392.000000 392.000000
mean 23.445918 194.411990 104.469388 2977.584184 15.541327
std 7.805007 104.644004 38.491160 849.402560 2.758864
min 9.000000 68.000000 46.000000 1613.000000 8.000000
25% 17.000000 105.000000 75.000000 2225.250000 13.775000
50% 22.750000 151.000000 93.500000 2803.500000 15.500000
75% 29.000000 275.750000 126.000000 3614.750000 17.025000
max 46.600000 455.000000 230.000000 5140.000000 24.800000

Using Matplotlib and seaborn for plotting graphs and charts

  • %matplotlib inline is a directive to the ipython notebook to render the plots here.
In [11]:
%matplotlib inline
import seaborn as sn
import matplotlib.pyplot as plt

Average mpg by cylinders

In [12]:
mpg_cylinders_df = autos.groupby('cylinders')['mpg'].mean().reset_index()
In [13]:
sn.barplot( y = 'mpg',
          x = 'cylinders',
          data = mpg_cylinders_df )
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x119a95080>
In [14]:
sn.barplot( y = 'mpg',
          x = 'cylinders',
          data = mpg_cylinders_df,
          order = mpg_cylinders_df.sort_values('mpg')['cylinders'])
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x119e0e630>
In [15]:
mpg_cylinders_origin_df = autos.groupby(['cylinders', 'origin'])['mpg'].mean().reset_index()
In [16]:
mpg_cylinders_origin_df
Out[16]:
cylinders origin mpg
0 3 3 20.550000
1 4 1 28.013043
2 4 2 28.106557
3 4 3 31.595652
4 5 2 27.366667
5 6 1 19.645205
6 6 2 20.100000
7 6 3 23.883333
8 8 1 14.963107

Average mpg by cyinders and grouped by origin

In [17]:
sn.barplot( y = 'mpg',
          x = 'cylinders',
          data = mpg_cylinders_origin_df,
          hue = 'origin');

Trend in average MPG by year for different origin cars

In [18]:
mpg_year_origin_df = autos.groupby(['year', 'origin'])['mpg'].mean().reset_index()
In [19]:
sn.factorplot( x = 'year', y = 'mpg', hue = 'origin', kind = 'point', data = mpg_year_origin_df, size = 6 )
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x119dd0d30>

Creating a histogram - Distribution of mpg (miles per gallon)

In [20]:
plt.hist( autos.mpg )
Out[20]:
(array([ 13.,  78.,  73.,  58.,  53.,  48.,  37.,  22.,   4.,   6.]),
array([  9.  ,  12.76,  16.52,  20.28,  24.04,  27.8 ,  31.56,  35.32,
       39.08,  42.84,  46.6 ]),
<a list of 10 Patch objects>)
In [21]:
plt.hist( autos.mpg, bins = 50 );
In [22]:
plt.hist( autos.acceleration );
In [23]:
plt.hist( autos.weight );
In [24]:
sn.distplot( autos.mpg )
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a768940>

Comparing mpg distributions of cars by different origins

In [25]:
sn.distplot( autos[autos.origin == 1].mpg, hist = False, label= 'American' )
sn.distplot( autos[autos.origin == 2].mpg, hist = False, label= 'European' )
sn.distplot( autos[autos.origin == 3].mpg, hist = False, label= 'Japaneese' )
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a79eac8>

Calculating Statistics

In [27]:
mpg_desc = autos.mpg.describe()
In [28]:
mpg_desc['mean']
Out[28]:
23.445918367346941

Setting image size and setting title in seaborn

In [29]:
sn.set(rc={"figure.figsize": (8, 6)});

Distribution of mpg for all cars

In [30]:
sn.boxplot( y = autos.mpg )
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab0f160>

MPG distribution for different number of cylinders.

In [31]:
sn.boxplot( x = autos.cylinders,
          y = autos.mpg,
          order = autos.cylinders.unique().sort() )
sn.plt.title( "Box plots for various cylinder counts")
Out[31]:
<matplotlib.text.Text at 0x11a0d0518>

Horsepower distribution for different number of cylinders.

In [32]:
sn.boxplot( x = autos.cylinders,
          y = autos.horsepower,
          order = autos.cylinders.unique().sort() )
sn.plt.title( "Box plots for various cylinder counts")
Out[32]:
<matplotlib.text.Text at 0x11a5e7390>
  • Note: It can be observed as the number of cylinders increases to 4, mpg also increases. But after that mpg starts to decrease. It is because more cylinders are added to give more power at the cost of mpg.

Finding Outliers for 6 Cylinder Cars

In [33]:
autos[autos.cylinders == 6].mpg.quantile( 0.75 )
Out[33]:
21.0
In [34]:
import scipy.stats as sts
In [35]:
sts.iqr( autos[autos.cylinders == 6].mpg )
Out[35]:
3.0

Extreme outliers: 0.75 percetile + 3 * iqr

In [36]:
outlier = autos[autos.cylinders == 6].mpg.quantile( 0.75 ) + 3 * sts.iqr( autos[autos.cylinders == 6].mpg )
In [37]:
outlier
Out[37]:
30.0
In [38]:
autos[autos.cylinders == 6][ autos[autos.cylinders == 6].mpg > outlier ]
Out[38]:
mpg cylinders displacement horsepower weight acceleration year origin name
333 32.7 6 168.0 132.0 2910 11.4 80 3 datsun 280-zx
360 30.7 6 145.0 76.0 3160 19.6 81 2 volvo diesel
386 38.0 6 262.0 85.0 3015 17.0 82 1 oldsmobile cutlass ciera (diesel)

Note: IQR (inter quartile range) spans between 17 to 29 mpg. 50% of the cars lies in this range.

Creating scatter plots - weight vs. mpg

In [39]:
plt.scatter( autos.weight, autos.mpg );

Setting titles, x label, y lable & saving image to file

In [40]:
plt.scatter( autos.weight, autos.mpg )
plt.title("Autos Mpg Vs. Weight")
plt.xlabel('weight', fontsize=18)
plt.ylabel('mpg', fontsize=16)
plt.savefig('test.jpg')

Joint plots

In [41]:
sn.jointplot( autos.mpg, autos.weight, size = 6 );

Multivariate distribution plot

In [26]:
sn.jointplot(x="mpg", y="acceleration", data=autos, kind="kde");

Weight vs. mpg for different number of cylinders in cars

In [42]:
sn.lmplot(x="weight", y="mpg", data= autos, fit_reg = True, size = 6 )
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x11b2c6160>

Visualizing the correlation between more than 2 variables, pair-wise at same time.

In [43]:
autos_stats = autos[['mpg', 'displacement',
                   'weight', 'acceleration']]
In [44]:
plt.figure( figsize = (6,6));
sn.pairplot( autos_stats );
<matplotlib.figure.Figure at 0x11b2dceb8>

Finding correlations

In [45]:
autos_stats.corr()
Out[45]:
mpg displacement weight acceleration
mpg 1.000000 -0.805127 -0.832244 0.423329
displacement -0.805127 1.000000 0.932994 -0.543800
weight -0.832244 0.932994 1.000000 -0.416839
acceleration 0.423329 -0.543800 -0.416839 1.000000

Creating a heatmap to depict correlations

In [46]:
sn.heatmap( autos_stats.corr(), annot = True )
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bb085f8>

Comparing distribution of variables together

In [58]:
autos_subset_df = autos[['mpg', 'displacement', 'weight', 'acceleration', 'origin']]
auto_melt_df = pd.melt(autos_subset_df, "origin", var_name="measures")
auto_melt_df.sample( 10 )
Out[58]:
origin measures value
253 1 mpg 25.1
1003 1 weight 3880.0
1390 3 acceleration 18.5
833 2 weight 2123.0
743 3 displacement 108.0
36 1 mpg 18.0
1466 1 acceleration 13.0
869 1 weight 3672.0
350 3 mpg 33.7
348 1 mpg 29.9
In [63]:
sn.violinplot(x="measures", y="value", hue="origin", data=auto_melt_df)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ba88940>
In [67]:
autos_subset_df = autos[['mpg', 'displacement', 'weight', 'acceleration', 'origin']]
autos_subset_df = autos_subset_df.apply(lambda x:(x.astype(float) - min(x))/(max(x)-min(x)), axis = 0)
auto_melt_df = pd.melt(autos_subset_df, "origin", var_name="measures")
sn.swarmplot(x="measures", y="value", hue="origin", data=auto_melt_df);

Does america cars have different mpg than japaneese cars?

Hypothesis Test: 1

  • Null Hypothesis: average mpg for american cars = average mpg for japaneese cars.
  • Alternative Hypothesis: average mpg for amarican cars <> average mpg for japaneese cars.
In [47]:
from scipy import stats
In [48]:
stats.ttest_ind( autos[ autos.origin == 1]["mpg"],
              autos[ autos.origin == 3 ]["mpg"],
              equal_var=True)
Out[48]:
Ttest_indResult(statistic=-12.664889006229084, pvalue=4.1728371467655198e-30)

Conclusion: As p-value is less than 0.05. Yes, average mpg for american cars are different than average mpg for japaneese cars