Finding optimal number of clusters

  • This tutorial explains how to find optimal number of clusters in a given dataset by using various techniques.
  • Different techniques discussed here are
    1. Dendogram
    2. Elbow method
    3. Silhoutte score Analysis
  • We will first load the data into dataframe and scale the features and create clusters. And then various metrics are calculated to validate the number of cluster creations and what will be the optimal number of clusters.
In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sn
%matplotlib inline
In [16]:
beer = pd.read_csv( "beer.csv" )

Attribute Descriction

  • name - the beer brand
  • calories - calories per ounce
  • sodium
  • alcohol - alcohol percentage present
  • cost - in dollars
In [17]:
beer.head( 10 )
Out[17]:
name calories sodium alcohol cost
0 Budweiser 144 15 4.7 0.43
1 Schlitz 151 19 4.9 0.43
2 Lowenbrau 157 15 0.9 0.48
3 Kronenbourg 170 7 5.2 0.73
4 Heineken 152 11 5.0 0.77
5 Old_Milwaukee 145 23 4.6 0.28
6 Augsberger 175 24 5.5 0.40
7 Srohs_Bohemian_Style 149 27 4.7 0.42
8 Miller_Lite 99 10 4.3 0.43
9 Budweiser_Light 113 8 3.7 0.40

Using Dendogram

  • Dendogram shows the distance between any two observations in a dataset. The vertical axis determines the distance. The longer the axis, the larger the distance.
  • The clustermap feature in seaborn provides the dendogram. It also displays the distance between observations as well as the features. But we are mostly interested in observations.
In [45]:
from sklearn.cluster import KMeans
In [46]:
beer.columns
Out[46]:
Index(['name', 'calories', 'sodium', 'alcohol', 'cost'], dtype='object')
In [47]:
X = beer[['calories', 'sodium', 'alcohol', 'cost']]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform( X )
In [48]:
cmap = sn.cubehelix_palette(as_cmap=True, rot=-.3, light=1)
In [49]:
sn.clustermap(X_scaled, cmap=cmap, linewidths=.5)
Out[49]:
<seaborn.matrix.ClusterGrid at 0xcf8898>

Note:

  • From the diagram it can observed that the number of clusters or groups of observations present are between 3 to 6. Based on business context the number of clusters can be created.

Elbow Analysis

  • This is taken from wikipedia)
  • The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in a dataset.
  • Explained Variance: This method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
  • if one plots the percentage of variance explained by the clusters against the number of clusters the first clusters will add much information (explain a lot of variance), but at some point the marginal gain in explained variance will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion"
  • Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center).
  • In other words, its objective is to find:
$${\underset {\mathbf {S} }{\operatorname {arg\,min} }}\sum _{i=1}^{k}\sum _{\mathbf {x} \in S_{i}}\left\|\mathbf {x} -{\boldsymbol {\mu }}_{i}\right\|^{2}$$
In [51]:
cluster_range = range( 1, 20 )
cluster_errors = []

for num_clusters in cluster_range:
  clusters = KMeans( num_clusters )
  clusters.fit( X_scaled )
  cluster_errors.append( clusters.inertia_ )
In [36]:
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
In [42]:
clusters_df[0:10]
Out[42]:
cluster_errors num_clusters
0 80.000000 1
1 51.459153 2
2 27.849901 3
3 17.843595 4
4 12.388815 5
5 9.570665 6
6 7.232908 7
7 5.918853 8
8 4.923308 9
9 3.644205 10
In [54]:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Out[54]:
[<matplotlib.lines.Line2D at 0xba56d68>]

Note:

  • The elbow diagram shows that the gain in explained variance reduces significantly from 3 to 4 to 5. So, optimal number of clusters could either 4 or 5. The actual number of clusters chosen can be finally based on business context and convenience of dealing with number of segments or clusters.

Silhouette Analysis

  • Taken from wikipedia
  • The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
  • The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
  • If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
  • The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

Silhouette score of an observation is given by:

$$s(i) = \frac{b(i) - a(i)}{\max\{a(i),b(i)\}}$$

Which can be also written as:

$$s(i) = \begin{cases} 1-a(i)/b(i), & \mbox{if } a(i) < b(i) \\ 0, & \mbox{if } a(i) = b(i) \\ b(i)/a(i)-1, & \mbox{if } a(i) > b(i) \\ \end{cases}$$

From the above definition it is clear that silhoutte score always lies between.

$${\displaystyle -1\leq s(i)\leq 1} -1 \le s(i) \le 1$$

Score closer to 1 means assigned to the cluster correctly and score closer to -1 is assigned to a wrong cluster. A score close to 0 means the point lies between almost at the boundary of both the clusters.

In [43]:
from sklearn.metrics import silhouette_samples, silhouette_score
In [44]:
cluster_range = range( 2, 6 )

for n_clusters in cluster_range:
  # Create a subplot with 1 row and 2 columns
  fig, (ax1, ax2) = plt.subplots(1, 2)
  fig.set_size_inches(18, 7)

  # The 1st subplot is the silhouette plot
  # The silhouette coefficient can range from -1, 1 but in this example all
  # lie within [-0.1, 1]
  ax1.set_xlim([-0.1, 1])
  # The (n_clusters+1)*10 is for inserting blank space between silhouette
  # plots of individual clusters, to demarcate them clearly.
  ax1.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])

  # Initialize the clusterer with n_clusters value and a random generator
  # seed of 10 for reproducibility.
  clusterer = KMeans(n_clusters=n_clusters, random_state=10)
  cluster_labels = clusterer.fit_predict( X_scaled )

  # The silhouette_score gives the average value for all the samples.
  # This gives a perspective into the density and separation of the formed
  # clusters
  silhouette_avg = silhouette_score(X_scaled, cluster_labels)
  print("For n_clusters =", n_clusters,
        "The average silhouette_score is :", silhouette_avg)

  # Compute the silhouette scores for each sample
  sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)

  y_lower = 10
  for i in range(n_clusters):
      # Aggregate the silhouette scores for samples belonging to
      # cluster i, and sort them
      ith_cluster_silhouette_values = \
          sample_silhouette_values[cluster_labels == i]

      ith_cluster_silhouette_values.sort()

      size_cluster_i = ith_cluster_silhouette_values.shape[0]
      y_upper = y_lower + size_cluster_i

      color = cm.spectral(float(i) / n_clusters)
      ax1.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)

      # Label the silhouette plots with their cluster numbers at the middle
      ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

      # Compute the new y_lower for next plot
      y_lower = y_upper + 10  # 10 for the 0 samples

  ax1.set_title("The silhouette plot for the various clusters.")
  ax1.set_xlabel("The silhouette coefficient values")
  ax1.set_ylabel("Cluster label")

  # The vertical line for average silhoutte score of all the values
  ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

  ax1.set_yticks([])  # Clear the yaxis labels / ticks
  ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

  # 2nd Plot showing the actual clusters formed
  colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
  ax2.scatter(X_scaled[:, 0], X_scaled[:, 1], marker='.', s=30, lw=0, alpha=0.7,
              c=colors)

  # Labeling the clusters
  centers = clusterer.cluster_centers_
  # Draw white circles at cluster centers
  ax2.scatter(centers[:, 0], centers[:, 1],
              marker='o', c="white", alpha=1, s=200)

  for i, c in enumerate(centers):
      ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

  ax2.set_title("The visualization of the clustered data.")
  ax2.set_xlabel("Feature space for the 1st feature")
  ax2.set_ylabel("Feature space for the 2nd feature")

  plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                "with n_clusters = %d" % n_clusters),
               fontsize=14, fontweight='bold')

  plt.show()
For n_clusters = 2 The average silhouette_score is : 0.330715064698
For n_clusters = 3 The average silhouette_score is : 0.457774159109
For n_clusters = 4 The average silhouette_score is : 0.525464122522
For n_clusters = 5 The average silhouette_score is : 0.494065827016

Note:

  • At 2 number of clusters, the size of the clusters vary. Cluster 0 has large of observations assigned where as cluser 1 has few observations.
  • At 3 number of clusters, the size of each clusters vary and cluster 0 has some observations assigned to wrong cluster.
  • At 4 number off clusters, the cluster sizes are fairly homogeneous. And no observations are assigned to wrong cluster and almost all clusters have observations that are more than the average Silhouette score.
  • At 5 number of clusters, the size of clusters are not homogeneous. And one clusters have all observations that are less than average Silhouette score.
  • So, from Silhouette analysis it can be concluded that 4 is an optimal number of clusters for this dataset.
In [ ]: