Overview

  • This dataset contains sample of beer brands and analyzes to understand the type of beer the is manufactured and what kind of market is targets. This gives an insight into different segments of market and presence of different beers in different segments.
In [10]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sn
%matplotlib inline
In [11]:
beer = pd.read_csv( "beer.csv" )

Attribute Descriction

  • name - the beer brand
  • calories - calories per ounce
  • sodium
  • alcohol - alcohol percentage present
  • cost - in dollars
In [12]:
beer.head( 20 )
Out[12]:
name calories sodium alcohol cost
0 Budweiser 144 15 4.7 0.43
1 Schlitz 151 19 4.9 0.43
2 Lowenbrau 157 15 0.9 0.48
3 Kronenbourg 170 7 5.2 0.73
4 Heineken 152 11 5.0 0.77
5 Old_Milwaukee 145 23 4.6 0.28
6 Augsberger 175 24 5.5 0.40
7 Srohs_Bohemian_Style 149 27 4.7 0.42
8 Miller_Lite 99 10 4.3 0.43
9 Budweiser_Light 113 8 3.7 0.40
10 Coors 140 18 4.6 0.44
11 Coors_Light 102 15 4.1 0.46
12 Michelob_Light 135 11 4.2 0.50
13 Becks 150 19 4.7 0.76
14 Kirin 149 6 5.0 0.79
15 Pabst_Extra_Light 68 15 2.3 0.38
16 Hamms 139 19 4.4 0.43
17 Heilemans_Old_Style 144 24 4.9 0.43
18 Olympia_Goled_Light 72 6 2.9 0.46
19 Schlitz_Light 97 7 4.2 0.47

k-means clustering algorithms need to be provided how many clusters or segmens it need to create. k stands for number of clusters. There are techniques that are avaialble to understand how many clusters might exists.

sklearn library has KMeans algorithm. Initialize KMeans with the number of clusters (k) as an argument and call fit() with the dataframe as an argument, which contains entities and their features which need to be clustered.

Now, let's assume there are 3 segments exist. Little later, we will discuss how to find optimal number of clusters.

Seggregate the brands into 3 segments

  • We will build 3 clusters using k-means
In [13]:
from sklearn.cluster import KMeans
In [14]:
beer.columns
Out[14]:
Index(['name', 'calories', 'sodium', 'alcohol', 'cost'], dtype='object')
In [15]:
X = beer[['calories', 'sodium', 'alcohol', 'cost']]
clusters = KMeans(3)  # 3 clusters
clusters.fit( X )
Out[15]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
  n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
  random_state=None, tol=0.0001, verbose=0)
In [16]:
clusters.labels_
Out[16]:
array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 2, 1, 1, 2, 0], dtype=int32)
In [18]:
beer["cluster_id"] = clusters.labels_
In [19]:
beer.head( 20 )
Out[19]:
name calories sodium alcohol cost cluster_id
0 Budweiser 144 15 4.7 0.43 1
1 Schlitz 151 19 4.9 0.43 1
2 Lowenbrau 157 15 0.9 0.48 1
3 Kronenbourg 170 7 5.2 0.73 1
4 Heineken 152 11 5.0 0.77 1
5 Old_Milwaukee 145 23 4.6 0.28 1
6 Augsberger 175 24 5.5 0.40 1
7 Srohs_Bohemian_Style 149 27 4.7 0.42 1
8 Miller_Lite 99 10 4.3 0.43 0
9 Budweiser_Light 113 8 3.7 0.40 0
10 Coors 140 18 4.6 0.44 1
11 Coors_Light 102 15 4.1 0.46 0
12 Michelob_Light 135 11 4.2 0.50 1
13 Becks 150 19 4.7 0.76 1
14 Kirin 149 6 5.0 0.79 1
15 Pabst_Extra_Light 68 15 2.3 0.38 2
16 Hamms 139 19 4.4 0.43 1
17 Heilemans_Old_Style 144 24 4.9 0.43 1
18 Olympia_Goled_Light 72 6 2.9 0.46 2
19 Schlitz_Light 97 7 4.2 0.47 0

Verifying the cluster centers

In [20]:
clusters.cluster_centers_
Out[20]:
array([[ 102.75      ,   10.        ,    4.075     ,    0.44      ],
     [ 150.        ,   17.        ,    4.52142857,    0.52071429],
     [  70.        ,   10.5       ,    2.6       ,    0.42      ]])

Conclusion:

  • It can be observed that the segments are mostly based on calories. High, medium and low calories. This is because scale of calogies is larger than the scale of other parameters. So, we need to scale all parameters and then cluster it.
In [21]:
beer.drop( 'cluster_id', axis = 1, inplace = True )

Normalizing the features

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform( X )

Creating 3 Segments

In [23]:
clusters = KMeans(3)  # 3 clusters
clusters.fit( X_scaled )
Out[23]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
  n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
  random_state=None, tol=0.0001, verbose=0)
In [24]:
beer["cluster_new"] = clusters.labels_
In [25]:
beer
Out[25]:
name calories sodium alcohol cost cluster_new
0 Budweiser 144 15 4.7 0.43 1
1 Schlitz 151 19 4.9 0.43 1
2 Lowenbrau 157 15 0.9 0.48 0
3 Kronenbourg 170 7 5.2 0.73 2
4 Heineken 152 11 5.0 0.77 2
5 Old_Milwaukee 145 23 4.6 0.28 1
6 Augsberger 175 24 5.5 0.40 1
7 Srohs_Bohemian_Style 149 27 4.7 0.42 1
8 Miller_Lite 99 10 4.3 0.43 0
9 Budweiser_Light 113 8 3.7 0.40 0
10 Coors 140 18 4.6 0.44 1
11 Coors_Light 102 15 4.1 0.46 0
12 Michelob_Light 135 11 4.2 0.50 0
13 Becks 150 19 4.7 0.76 2
14 Kirin 149 6 5.0 0.79 2
15 Pabst_Extra_Light 68 15 2.3 0.38 0
16 Hamms 139 19 4.4 0.43 1
17 Heilemans_Old_Style 144 24 4.9 0.43 1
18 Olympia_Goled_Light 72 6 2.9 0.46 0
19 Schlitz_Light 97 7 4.2 0.47 0
In [26]:
beer.groupby('cluster_new' ).mean()
Out[26]:
calories sodium alcohol cost
cluster_new
0 105.375 10.875 3.3250 0.4475
1 148.375 21.125 4.7875 0.4075
2 155.250 10.750 4.9750 0.7625

Even now the clusters are not very distinct. The clusters are mostly based on calories and alcohol percentage. This may be because we are looking for wrong number of clusters.

Finding how many clusters might exist?

A dendrograms would help us determine the number of clusters.

In [70]:
beer.drop( 'cluster_new', axis = 1, inplace = True )
In [71]:
cmap = sn.cubehelix_palette(as_cmap=True, rot=-.3, light=1)
g = sn.clustermap(X_scaled, cmap=cmap, linewidths=.5)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)

The dendogram shows there are 5 disctinct clusters. So, we will create 5 clusters.

In [72]:
clusters = KMeans(5)  # 5 clusters
clusters.fit( X_scaled )
beer["cluster_final"] = clusters.labels_
In [73]:
beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']]
Out[73]:
name calories sodium alcohol cost cluster_final
0 Budweiser 144 15 4.7 0.43 0
1 Schlitz 151 19 4.9 0.43 0
2 Lowenbrau 157 15 0.9 0.48 3
3 Kronenbourg 170 7 5.2 0.73 2
4 Heineken 152 11 5.0 0.77 2
5 Old_Milwaukee 145 23 4.6 0.28 0
6 Augsberger 175 24 5.5 0.40 0
7 Srohs_Bohemian_Style 149 27 4.7 0.42 0
8 Miller_Lite 99 10 4.3 0.43 1
9 Budweiser_Light 113 8 3.7 0.40 1
10 Coors 140 18 4.6 0.44 0
11 Coors_Light 102 15 4.1 0.46 1
12 Michelob_Light 135 11 4.2 0.50 1
13 Becks 150 19 4.7 0.76 2
14 Kirin 149 6 5.0 0.79 2
15 Pabst_Extra_Light 68 15 2.3 0.38 4
16 Hamms 139 19 4.4 0.43 0
17 Heilemans_Old_Style 144 24 4.9 0.43 0
18 Olympia_Goled_Light 72 6 2.9 0.46 4
19 Schlitz_Light 97 7 4.2 0.47 1
In [76]:
beer.groupby('cluster_final').mean()
Out[76]:
calories sodium alcohol cost
cluster_final
0 148.375 21.125 4.7875 0.4075
1 109.200 10.200 4.1000 0.4520
2 155.250 10.750 4.9750 0.7625
3 157.000 15.000 0.9000 0.4800
4 70.000 10.500 2.6000 0.4200

Let's look at each segment one by one

In [55]:
beer_0 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 0]
In [56]:
beer_0
Out[56]:
name calories sodium alcohol cost cluster_final
0 Budweiser 144 15 4.7 0.43 0
1 Schlitz 151 19 4.9 0.43 0
5 Old_Milwaukee 145 23 4.6 0.28 0
6 Augsberger 175 24 5.5 0.40 0
7 Srohs_Bohemian_Style 149 27 4.7 0.42 0
10 Coors 140 18 4.6 0.44 0
16 Hamms 139 19 4.4 0.43 0
17 Heilemans_Old_Style 144 24 4.9 0.43 0
In [77]:
beer_1 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 1]
In [78]:
beer_1
Out[78]:
name calories sodium alcohol cost cluster_final
8 Miller_Lite 99 10 4.3 0.43 1
9 Budweiser_Light 113 8 3.7 0.40 1
11 Coors_Light 102 15 4.1 0.46 1
12 Michelob_Light 135 11 4.2 0.50 1
19 Schlitz_Light 97 7 4.2 0.47 1

Segment 1 seems to be also light beers, which are probably made for people who drink regularly, but also health conscious, so want to keep calories and sodium level low.

In [79]:
beer_1.mean()
Out[79]:
calories         109.200
sodium            10.200
alcohol            4.100
cost               0.452
cluster_final      1.000
dtype: float64
In [80]:
beer_2 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 2]
In [81]:
beer_2
Out[81]:
name calories sodium alcohol cost cluster_final
3 Kronenbourg 170 7 5.2 0.73 2
4 Heineken 152 11 5.0 0.77 2
13 Becks 150 19 4.7 0.76 2
14 Kirin 149 6 5.0 0.79 2

Segment 2 seems to be expensive beers. These are mostly meant for brand sensitive segment.

In [82]:
beer_3 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 3]
In [83]:
beer_3
Out[83]:
name calories sodium alcohol cost cluster_final
2 Lowenbrau 157 15 0.9 0.48 3

Lowenbrau seem to be a cluster be itself. It has extremely low alcohol content. This does not look right. It might be a simple recording error.

So, clustering is also a way to find analmolies in the system. We can remove this data point and create clusters again.

In [86]:
beer_3.mean()
Out[86]:
calories         157.00
sodium            15.00
alcohol            0.90
cost               0.48
cluster_final      3.00
dtype: float64
In [87]:
beer_4 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 4]
beer_4
Out[87]:
name calories sodium alcohol cost cluster_final
15 Pabst_Extra_Light 68 15 2.3 0.38 4
18 Olympia_Goled_Light 72 6 2.9 0.46 4

Segment 3 seems to be extra light beers, which are low on calogies and low on alcohol. These are made for people who donot drink regularly, but may be forced to drink on social occasions or gatherings. Not very crowded market. But consumers would also be less.

Remove outlier

In [88]:
beer.drop( 2, axis = 0, inplace = True )
In [90]:
X = beer[['calories', 'sodium', 'alcohol', 'cost']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform( X )

Create new dendrogram

In [92]:
cmap = sn.cubehelix_palette(as_cmap=True, rot=-.3, light=1)
g = sn.clustermap(X_scaled, cmap=cmap, linewidths=.5)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
In [93]:
clusters = KMeans(5)  # 5 clusters
clusters.fit( X_scaled )
beer["cluster_final"] = clusters.labels_
In [94]:
beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']]
Out[94]:
name calories sodium alcohol cost cluster_final
0 Budweiser 144 15 4.7 0.43 2
1 Schlitz 151 19 4.9 0.43 2
3 Kronenbourg 170 7 5.2 0.73 0
4 Heineken 152 11 5.0 0.77 0
5 Old_Milwaukee 145 23 4.6 0.28 4
6 Augsberger 175 24 5.5 0.40 4
7 Srohs_Bohemian_Style 149 27 4.7 0.42 4
8 Miller_Lite 99 10 4.3 0.43 1
9 Budweiser_Light 113 8 3.7 0.40 1
10 Coors 140 18 4.6 0.44 2
11 Coors_Light 102 15 4.1 0.46 1
12 Michelob_Light 135 11 4.2 0.50 1
13 Becks 150 19 4.7 0.76 0
14 Kirin 149 6 5.0 0.79 0
15 Pabst_Extra_Light 68 15 2.3 0.38 3
16 Hamms 139 19 4.4 0.43 2
17 Heilemans_Old_Style 144 24 4.9 0.43 4
18 Olympia_Goled_Light 72 6 2.9 0.46 3
19 Schlitz_Light 97 7 4.2 0.47 1
In [95]:
beer.groupby('cluster_final').mean()
Out[95]:
calories sodium alcohol cost
cluster_final
0 155.25 10.75 4.975 0.7625
1 109.20 10.20 4.100 0.4520
2 143.50 17.75 4.650 0.4325
3 70.00 10.50 2.600 0.4200
4 153.25 24.50 4.925 0.3825

Let's look at each segment one by one

In [109]:
beer_0 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 0]
In [110]:
beer_0
Out[110]:
name calories sodium alcohol cost cluster_final
3 Kronenbourg 170 7 5.2 0.73 0
4 Heineken 152 11 5.0 0.77 0
13 Becks 150 19 4.7 0.76 0
14 Kirin 149 6 5.0 0.79 0
In [111]:
beer_1 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 1]
In [112]:
beer_1
Out[112]:
name calories sodium alcohol cost cluster_final
8 Miller_Lite 99 10 4.3 0.43 1
9 Budweiser_Light 113 8 3.7 0.40 1
11 Coors_Light 102 15 4.1 0.46 1
12 Michelob_Light 135 11 4.2 0.50 1
19 Schlitz_Light 97 7 4.2 0.47 1
In [113]:
beer_2 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 2]
beer_2
Out[113]:
name calories sodium alcohol cost cluster_final
0 Budweiser 144 15 4.7 0.43 2
1 Schlitz 151 19 4.9 0.43 2
10 Coors 140 18 4.6 0.44 2
16 Hamms 139 19 4.4 0.43 2
In [114]:
beer_3 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 3]
beer_3
Out[114]:
name calories sodium alcohol cost cluster_final
15 Pabst_Extra_Light 68 15 2.3 0.38 3
18 Olympia_Goled_Light 72 6 2.9 0.46 3
In [116]:
beer_4 = beer[['name', 'calories', 'sodium', 'alcohol', 'cost', 'cluster_final']][beer.cluster_final == 4]
beer_4
Out[116]:
name calories sodium alcohol cost cluster_final
5 Old_Milwaukee 145 23 4.6 0.28 4
6 Augsberger 175 24 5.5 0.40 4
7 Srohs_Bohemian_Style 149 27 4.7 0.42 4
17 Heilemans_Old_Style 144 24 4.9 0.43 4