This tutorial gives a quick hands on overview of how to use pandas APIs to load and apply data munging operations on a dataset.

Importing pandas and numpy library

In [1]:
import pandas as pd

Read the dataset, which is in csv format

  • Pandas as many read_* functions to read data from multiples data sources or formats like json, jdbc, excel, pickel (python serialized objects) etc.
In [2]:
titanic_data = pd.read_csv("titanic.csv")
  • head() also takes an argument n, which specifies how many records will be printed.
In [3]:
titanic_data.head( 3 )
Out[3]:
row.names pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male

Check dataset dimensions.. how many row and columns?

In [4]:
titanic_data.shape
Out[4]:
(1313, 11)

List column names

In [5]:
titanic_data.columns
Out[5]:
Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked',
     'home.dest', 'room', 'ticket', 'boat', 'sex'],
    dtype='object')
In [6]:
titanic_data.dtypes
Out[6]:
row.names      int64
pclass        object
survived       int64
name          object
age          float64
embarked      object
home.dest     object
room          object
ticket        object
boat          object
sex           object
dtype: object

More information with info() command

How many total entries? What are the columns and their types. Each column has how many not-null values?

In [7]:
titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
age          633 non-null float64
embarked     821 non-null object
home.dest    754 non-null object
room         77 non-null object
ticket       69 non-null object
boat         347 non-null object
sex          1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 112.9+ KB

Select Specific columns and rows

In [8]:
titanic_data['survived'][0:5]
## or titanic_data.survived
Out[8]:
0    1
1    0
2    0
3    0
4    1
Name: survived, dtype: int64
In [9]:
titanic_data.survived[0:5]
Out[9]:
0    1
1    0
2    0
3    0
4    1
Name: survived, dtype: int64

Selecting multiple columns. Multiple column names should be provided as a list

In [10]:
titanic_data[['survived','age']][0:5]
Out[10]:
survived age
0 1 29.0000
1 0 2.0000
2 0 30.0000
3 0 25.0000
4 1 0.9167

How many people survived and what is the percentage?

  • value_counts() provides occurance of each unique values in the column.
In [11]:
titanic_data['survived'].value_counts()
Out[11]:
0    864
1    449
Name: survived, dtype: int64
In [12]:
titanic_data['survived'].value_counts(normalize=True) * 100
Out[12]:
0    65.803503
1    34.196497
Name: survived, dtype: float64

Cross tabulation between gender and survived?

In [13]:
pd.crosstab( titanic_data.sex, titanic_data.survived )
Out[13]:
survived 0 1
sex
female 156 307
male 708 142

Normalize by rows

In [14]:
pd.crosstab( titanic_data.sex, titanic_data.survived, normalize = "index" )
Out[14]:
survived 0 1
sex
female 0.336933 0.663067
male 0.832941 0.167059

Normalize by columns

In [15]:
pd.crosstab( titanic_data.sex, titanic_data.survived, normalize = "columns" )
Out[15]:
survived 0 1
sex
female 0.180556 0.683742
male 0.819444 0.316258

Normalize all

In [16]:
pd.crosstab( titanic_data.sex, titanic_data.survived, normalize = "all" )
Out[16]:
survived 0 1
sex
female 0.118812 0.233816
male 0.539223 0.108149

Filtering records based on a condition

How many children survived who are less than 5 years old?

In [17]:
below_5_years = titanic_data[ titanic_data.age <= 5 ]
In [18]:
below_5_years[0:3]
Out[18]:
row.names pclass survived name age embarked home.dest room ticket boat sex
1 2 1st 0 Allison, Miss Helen Loraine 2.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
4 5 1st 1 Allison, Master Hudson Trevor 0.9167 Southampton Montreal, PQ / Chesterville, ON C22 NaN 11 male
86 87 1st 1 Dodge, Master Washington 4.0000 Southampton San Francisco, CA NaN NaN 5/7 male

How many children below 5 years age were on board the ship?

In [19]:
len( below_5_years )
Out[19]:
29
In [20]:
titanic_data[ titanic_data.age <= 5 ]["survived"].value_counts( )
Out[20]:
1    24
0     5
Name: survived, dtype: int64

How many of those children survive?

In [21]:
titanic_data[ titanic_data.age <= 5 ]["survived"].value_counts( normalize = True )
Out[21]:
1    0.827586
0    0.172414
Name: survived, dtype: float64
In [22]:
titanic_data.columns
Out[22]:
Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked',
     'home.dest', 'room', 'ticket', 'boat', 'sex'],
    dtype='object')

Filtering records based on string comparison

In [23]:
titanic_data[titanic_data.name.str.contains( "Allen") ]
Out[23]:
row.names pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female
337 338 2nd 1 Becker, Mrs Allen Oliver (Nellie E. Baumgardner) 36.0 Southampton Guntur, India / Benton Harbour, MI NaN 230136 L39 11 female
620 621 3rd 0 Allen, Mr William Henry 35.0 Southampton Lower Clapton, Middlesex or Erdington, Birmingham NaN NaN NaN male

Get unique values for a column

How many embark points were there? dataframe.unique() lists unique values of the column

In [24]:
titanic_data.embarked.unique()
Out[24]:
array(['Southampton', 'Cherbourg', nan, 'Queenstown'], dtype=object)

Working with NA values

Count and drop NA Values

In [25]:
titanic_data.embarked.unique()
Out[25]:
array(['Southampton', 'Cherbourg', nan, 'Queenstown'], dtype=object)

How many NA Values?

In [26]:
titanic_data.embarked.value_counts(dropna=False)
Out[26]:
Southampton    573
NaN            492
Cherbourg      203
Queenstown      45
Name: embarked, dtype: int64
  • But lot of entries contain NA as embarked port. Can we drop the NA values? dropna() drops all the null values.
In [27]:
titanic_data.embarked.dropna().unique()
Out[27]:
array(['Southampton', 'Cherbourg', 'Queenstown'], dtype=object)

Dropping NA Values: Strategies?

Remove rows where there are NA values in any of the columns

In [28]:
clean_titanic_data = titanic_data.dropna()
In [29]:
len( clean_titanic_data )
Out[29]:
20

Remove rows where there are NA values in all columns

In [30]:
clean_titanic_data = titanic_data.dropna( how = "all" )
len( clean_titanic_data )
Out[30]:
1313

Remove rows where there are NA values in specific columns

In [31]:
clean_titanic_data = titanic_data.dropna( subset = ['age','pclass'], how = 'any' )
len( clean_titanic_data )
Out[31]:
633

Remove columns where all the values are NAs

In [32]:
clean_titanic_data = titanic_data.dropna( axis = 1, how = "any" )
clean_titanic_data.shape
Out[32]:
(1313, 5)
In [33]:
clean_titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 5 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
sex          1313 non-null object
dtypes: int64(2), object(3)
memory usage: 51.4+ KB

Rename a column

  • The first column name is row.names. We can change it to rownum. As it is mostly a unique number.
In [34]:
titanic_data.head( 2)
Out[34]:
row.names pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
In [35]:
new_titanic = titanic_data.rename( columns = { 'row.names': 'rownum' },
                  inplace = True  )
In [36]:
titanic_data.head(2)
Out[36]:
rownum pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female

Indexing and Selecting

  • Select first n rows and all the columns
In [37]:
first_10 = titanic_data[0:3]
In [38]:
first_10
Out[38]:
rownum pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male
  • titanic_data[0:10] is same as titanic_data[:10]

Slicing a dataframe

  • Select only first 3 columns of first 4 rows
In [39]:
first_10_3 = titanic_data.iloc[0:4,0:3]
In [40]:
first_10_3
Out[40]:
rownum pclass survived
0 1 1st 1
1 2 1st 0
2 3 1st 0
3 4 1st 0

How to access last rows

In [41]:
## Accesising last row
titanic_data[-1:]
Out[41]:
rownum pclass survived name age embarked home.dest room ticket boat sex
1312 1313 3rd 0 Zimmerman, Leo NaN NaN NaN NaN NaN NaN male
In [42]:
last_10 = titanic_data[-4:]
In [43]:
last_10
Out[43]:
rownum pclass survived name age embarked home.dest room ticket boat sex
1309 1310 3rd 0 Zakarian, Mr Maprieder NaN NaN NaN NaN NaN NaN male
1310 1311 3rd 0 Zenn, Mr Philip NaN NaN NaN NaN NaN NaN male
1311 1312 3rd 0 Zievens, Rene NaN NaN NaN NaN NaN NaN female
1312 1313 3rd 0 Zimmerman, Leo NaN NaN NaN NaN NaN NaN male

Selecting rows and columns and applying a filtering criteria

Applying multiple filtering criteria

Only age, sex and pclass of passengers who have survived

  • Combining multiple filtering and slicing to dataframes simulteneously
In [44]:
titanic_data[ ( titanic_data.survived == 1 ) &
           ( titanic_data.age <= 5 ) ][['age',
                                        'sex',
                                        'pclass']][0:5]
Out[44]:
age sex pclass
4 0.9167 male 1st
86 4.0000 male 1st
338 4.0000 female 2nd
339 1.0000 male 2nd
358 0.8333 male 2nd

Only age, sex, survived and pclass of passengers whose age are not known

In [45]:
titanic_data[ titanic_data.age.isnull() ][['age',
                                         'survived',
                                         'sex',
                                         'pclass']][0:5]
Out[45]:
age survived sex pclass
12 NaN 1 female 1st
13 NaN 1 male 1st
14 NaN 0 male 1st
29 NaN 0 male 1st
32 NaN 1 male 1st

Only age, sex, survived and pclass of passengers whose age are known

In [46]:
titanic_data[ -titanic_data.age.isnull() ][['age',
                                          'survived',
                                          'sex',
                                          'pclass']][0:5]
Out[46]:
age survived sex pclass
0 29.0000 1 female 1st
1 2.0000 0 female 1st
2 30.0000 0 male 1st
3 25.0000 0 female 1st
4 0.9167 1 male 1st

Add a new column and map values of an existing column

In [47]:
titanic_data["gender"] = titanic_data.sex.map( lambda x:
                                            int( x == 'male') )
In [48]:
titanic_data.head( 3 )
Out[48]:
rownum pclass survived name age embarked home.dest room ticket boat sex gender
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 Southampton St Louis, MO B-5 24160 L221 2 female 0
1 2 1st 0 Allison, Miss Helen Loraine 2.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female 0
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male 1

Remove a column from dataframe

In [58]:
titanic_data.drop( "rownum", inplace = True, axis = 1 )

Grouping and Aggregating

In [50]:
titanic_data.groupby( 'pclass' )['age'].mean()
Out[50]:
pclass
1st    39.667773
2nd    28.300314
3rd    24.519658
Name: age, dtype: float64
In [51]:
pclass_age_mean_df = titanic_data.groupby( 'pclass' )['age'].mean().reset_index()
pclass_age_mean_df
Out[51]:
pclass age
0 1st 39.667773
1 2nd 28.300314
2 3rd 24.519658
In [52]:
pclass_gender_age_mean_df = titanic_data.groupby( ['pclass', 'sex'] )['age'].mean().reset_index()
pclass_gender_age_mean_df
Out[52]:
pclass sex age
0 1st female 37.772277
1 1st male 41.199334
2 2nd female 27.388235
3 2nd male 28.910761
4 3rd female 22.564328
5 3rd male 25.327294
In [53]:
pclass_gender_survival_count_df = titanic_data.groupby( ['pclass', 'sex'] )['survived'].sum().reset_index()
pclass_gender_survival_count_df
Out[53]:
pclass sex survived
0 1st female 134
1 1st male 59
2 2nd female 94
3 2nd male 25
4 3rd female 79
5 3rd male 58
In [54]:
pclass_age_gender_survival_df = pclass_gender_age_mean_df.merge(
  pclass_gender_survival_count_df, on = ['pclass', 'sex'])
In [55]:
pclass_age_gender_survival_df
Out[55]:
pclass sex age survived
0 1st female 37.772277 134
1 1st male 41.199334 59
2 2nd female 27.388235 94
3 2nd male 28.910761 25
4 3rd female 22.564328 79
5 3rd male 25.327294 58

Sorting dataframes by column values

In [56]:
pclass_age_gender_survival_df.sort_values( 'survived')
Out[56]:
pclass sex age survived
3 2nd male 28.910761 25
5 3rd male 25.327294 58
1 1st male 41.199334 59
4 3rd female 22.564328 79
2 2nd female 27.388235 94
0 1st female 37.772277 134
In [57]:
pclass_age_gender_survival_df.sort_values( 'survived', ascending = False)
Out[57]:
pclass sex age survived
0 1st female 37.772277 134
2 2nd female 27.388235 94
4 3rd female 22.564328 79
1 1st male 41.199334 59
5 3rd male 25.327294 58
3 2nd male 28.910761 25