Overview

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Source: Wikipedia

Primarily used for exploratory data analysis and business applications like customer segmentation, product segmentation, market segmentation.

In this tutorial we will explore a cluster technique called k-means and understand how it works.

  • Introduction to k-means() clustering
  • Scaling of data before cluster analysis
  • Dendogram to find out optimal number of clusters
  • Other clustering techniques
In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sn
%matplotlib inline

Generate some random points

  • The points generated are clustered around 3 centers. This is to demonstrate clustering techniques.
In [2]:
from sklearn.datasets.samples_generator import make_blobs
In [3]:
X, y = make_blobs(n_samples=300, centers=3,
                random_state=0, cluster_std=0.60)
In [4]:
all_points = pd.concat( [pd.DataFrame( X ),
                       pd.DataFrame( y ) ],
                     axis = 1 )
In [5]:
all_points.columns = ["x1", "x2", "y"]
In [6]:
all_points.head()
Out[6]:
x1 x2 y
0 0.428577 4.973997 0
1 1.619909 0.067645 1
2 1.432893 4.376792 0
3 -1.578462 3.034458 2
4 -1.658629 2.267460 2

Draw the points on a graph and find out how they are scattered

In [7]:
sn.lmplot( "x1", "x2", data=all_points, fit_reg=False, size = 5 )
Out[7]:
<seaborn.axisgrid.FacetGrid at 0xa56f278>

Can a clustering algorithm group them together by how nearer they are to each other?

Using K-means clustering technique

  • k-mean calculates the distance between the points and the center using euclidean distance and then allocates the points to different clusters.
In [8]:
from sklearn.cluster import KMeans
In [9]:
X = all_points[["x1", "x2"]]
clusters = KMeans(3)  # 3 clusters
clusters.fit( X )
Out[9]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
  n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
  verbose=0)

k-means clustering has figured out the cluster centers and assigned each points to the cluster centers.

In [10]:
clusters.cluster_centers_
Out[10]:
array([[-1.60811992,  2.85881658],
     [ 1.95159369,  0.83467497],
     [ 0.95625704,  4.37226546]])
In [11]:
clusters.labels_
Out[11]:
array([2, 1, 2, 0, 0, 0, 1, 2, 0, 0, 1, 1, 1, 2, 1, 0, 2, 2, 0, 1, 0, 2, 1,
     2, 0, 0, 2, 0, 1, 1, 0, 2, 2, 1, 1, 0, 1, 0, 2, 1, 0, 1, 2, 1, 1, 0,
     1, 0, 0, 1, 0, 1, 0, 0, 1, 2, 2, 0, 0, 2, 1, 1, 2, 0, 1, 0, 2, 1, 2,
     1, 0, 0, 0, 0, 1, 2, 1, 0, 2, 2, 0, 2, 1, 2, 2, 2, 1, 0, 2, 2, 0, 1,
     0, 2, 1, 1, 2, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0, 2, 1, 1, 2, 0, 2, 2, 0,
     0, 2, 2, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 1, 0, 2,
     0, 0, 1, 2, 1, 2, 0, 0, 2, 0, 0, 1, 2, 0, 2, 1, 0, 0, 1, 1, 2, 1, 2,
     2, 1, 2, 0, 2, 2, 2, 2, 0, 1, 2, 0, 1, 1, 1, 2, 1, 2, 2, 1, 0, 2, 2,
     2, 2, 1, 0, 2, 0, 2, 2, 1, 1, 0, 2, 1, 0, 2, 0, 1, 0, 2, 0, 1, 0, 2,
     0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 2, 0, 1, 2, 1, 1, 1, 2, 0, 0, 2, 2, 0,
     2, 1, 1, 2, 1, 0, 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 2, 1, 1, 2, 1,
     1, 0, 2, 1, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2,
     0, 0, 1, 2, 2, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 2, 2, 0, 1, 2,
     0])
In [12]:
all_points["clusterid_1"] = clusters.labels_
In [13]:
all_points.head()
Out[13]:
x1 x2 y clusterid_1
0 0.428577 4.973997 0 2
1 1.619909 0.067645 1 1
2 1.432893 4.376792 0 2
3 -1.578462 3.034458 2 0
4 -1.658629 2.267460 2 0

We can verify if the clustering is done properly, by coloring the point differently depending on how they are assigned to each cluster.

In [14]:
sn.lmplot( "x1", "x2", data=all_points,
        hue = "clusterid_1",
        fit_reg=False, size = 5 )
Out[14]:
<seaborn.axisgrid.FacetGrid at 0xb0405c0>

How well the points were clustered

In [36]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(all_points.y, all_points.clusterid_1)
Out[36]:
0.6451340375210516

Does the scale of dimensions impact the clustering?

In [37]:
all_points["x1"] = all_points.x1 * 100
In [38]:
all_points.head()
Out[38]:
x1 x2 y clusterid_1 clusterid_2
0 4285.767433 4.973997 0 2 2
1 16199.090944 0.067645 1 0 0
2 14328.927136 4.376792 0 0 0
3 -15784.624734 3.034458 2 1 1
4 -16586.286302 2.267460 2 1 1
In [39]:
X = all_points[["x1", "x2"]]
clusters = KMeans(3)  # 3 clusters
clusters.fit( X )
Out[39]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
  n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
  verbose=0)
In [40]:
all_points["clusterid_2"] = clusters.labels_
sn.lmplot( "x1", "x2", data=all_points,
        hue = "clusterid_2",
        fit_reg=False, size = 5 )
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x58f1160>

It seems if the dimensions have different scale, it may impact clsutering as the distance calculated will be dominated by the dimensions with large scale.

So, in clustering the variables need to be scaled or standardized.

Scale the dimensions to remove the impact

In [41]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform( X )
In [42]:
clusters = KMeans(3)  # 3 clusters
clusters.fit( X_scaled )
all_points["clusterid_3"] = clusters.labels_
sn.lmplot( "x1", "x2", data=all_points,
        hue = "clusterid_3",
        fit_reg=False, size = 5 )
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x59aad68>

Can K-means work if the clusters are not well segregated.. what if the clustered are interspersed

In [43]:
from sklearn import datasets
moon_points = datasets.make_moons(n_samples=1000, noise=.05)
In [44]:
X, y = enumerate( moon_points )
In [45]:
moon_points = pd.DataFrame( X[1] )
In [46]:
moon_points.columns = ["x1", "x2"]
In [47]:
moon_points["y"] = y[1]
In [48]:
moon_points.head()
Out[48]:
x1 x2 y
0 -0.047878 0.284204 1
1 1.898842 -0.081369 1
2 -0.220682 0.886735 0
3 0.061936 0.162546 1
4 1.873130 -0.041746 1
In [49]:
sn.lmplot( "x1", "x2", data=moon_points, fit_reg=False, size = 5 )
Out[49]:
<seaborn.axisgrid.FacetGrid at 0x5a1f588>
In [50]:
moon_clusters = KMeans(2)  # 3 clusters
moon_clusters.fit( moon_points[["x1", "x2"]] )
moon_points["clusterid_1"] = moon_clusters.labels_
sn.lmplot( "x1", "x2", data=moon_points,
        hue = "clusterid_1",
        fit_reg=False, size = 5 )
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x5a98908>

Using DBSCAN for density based clutering

In [51]:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=.2)
In [52]:
moon_clusters = DBSCAN( eps=.2 )
moon_clusters.fit( moon_points[["x1", "x2"]] )
moon_points["clusterid_1"] = moon_clusters.labels_
sn.lmplot( "x1", "x2", data=moon_points,
        hue = "clusterid_1",
        fit_reg=False, size = 5 )
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x5b08828>

Using DBSCAN for points in circles..

In [53]:
circle_points = datasets.make_circles(n_samples=1000, factor=.5,
                                    noise=.05)
X, y = enumerate( circle_points )
circle_points = pd.DataFrame( X[1] )
circle_points.columns = ["x1", "x2"]
circle_points["y"] = y[1]
circle_points.head()
sn.lmplot( "x1", "x2", data=circle_points, fit_reg=False, size = 5 )
Out[53]:
<seaborn.axisgrid.FacetGrid at 0x5b43080>
In [54]:
circle_clusters = KMeans(2)  # 3 clusters
circle_clusters.fit( circle_points[["x1", "x2"]] )
circle_points["clusterid_1"] = circle_clusters.labels_
sn.lmplot( "x1", "x2", data=circle_points,
        hue = "clusterid_1",
        fit_reg=False, size = 5 )
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x5b7ee80>
In [55]:
circle_clusters = DBSCAN( eps=.2 )
circle_clusters.fit( circle_points[["x1", "x2"]] )
circle_points["clusterid_1"] = circle_clusters.labels_
sn.lmplot( "x1", "x2", data=circle_points,
        hue = "clusterid_1",
        fit_reg=False, size = 5 )
Out[55]:
<seaborn.axisgrid.FacetGrid at 0x5c80860>