Finding Optimal Cluster Number
After normalizing our dataset, we will use KMeans Clustering to group/cluster similar education data. K-Means Algorithm works by grouping together two data points that have the least Euclidean Distance.
We don’t know the number of clusters beforehand. We will determine it in the case of the KMeans Clustering algorithm using Elbow Method. This is called Elbow Method because we will choose the cluster at that point of the graph from where it stops falling steeply(just like an elbow of hand). It is the point where the WCSS(Within Cluster Sum of Squares) decreases very slowly.
WCSS is the distance between points in a cluster.
From the above plot, we can see 2 or 3 could be the ideal number of clusters.
To choose specifically among 2 and 3, the ideal number of clusters we will use some metrics as discussed below:
- Silhouette Score- It ranges from -1 to 1. The higher the value better our clusters are. Closer to 1 means perfect clusters. 0 means the point lies at the border of its cluster. A negative value means that the point is classified into the wrong cluster.
- Calinski-Harabasz Index denotes how the data points are spread within a cluster. The higher the score, the denser is the cluster thus the cluster is better. It starts at 0 and has no upper limit.
- Davies Boulden Index measures the average similarity between cluster using the ratio of the distance between a cluster and its closest point & the average distance between each data point of a cluster and its cluster center. The closer the score is to 0, the better our clusters areas it indicates clusters are well separated.
Let’s check the values of these metrics to find out the ideal number of the cluster for our KMeans algorithm on the scaled data. We already concluded before that 2 or 3 would be the ideal number of clusters but we will also test with 4 or 5 just for the purpose of demonstration.
As expected from Elbow Method, 2 has the best Silhouette Score and Davies Bouldin Score, and second-best Calinski Harabasz Score. So 2 numbers of the cluster can be an ideal choice.
Though we have discussed that, we should always normalize our data to a similar range before applying a distance-based clustering algorithm, but let’s also check the metrics values using the KMeans algorithm on unnormalized data. Remember that it’s always good to perform experimentation for better understanding and results.
We can see that the best cluster number, in this case, is 3. But both the Silhouette Score and Davies Bouldin Score deteriorated than 2-clusters we evaluated before though Calinski Harabarz score improved a bit. Overall, the model performance deteriorated a bit. So as mentioned before, normalizing data points before clustering do give good results.
Next, we will use a Hierarchical Clustering technique called Agglomerative Clustering. It is a bottom-up clustering approach where each data point is first considered as an individual cluster and then merges closest points according to a distance metric until a single cluster is obtained.
The Hierarchical Clustering can be visualized using a dendrogram as shown below. The suitable number of clusters for this data is shown as 2 in the dendrogram(red & green).
From the dendrogram we saw that the ideal number of clusters for the dataset is 2, the KMeans algorithm also found the same. We will again use Silhouette Score, Calinski Harabarz Index, and Davis Bouldin Score to validate this.
Different Types of Linkage(the metric/criteria to merge two clusters to form a bigger cluster) Functions:
- Single Cluster: It combines clusters by taking into account the closest(minimum) points between two clusters. min( Dist-a – Dist-b). Cluster pairs having minimum distance between pair of points distance gets merged.
- Complete Cluster: cluster-based on the farthest(maximum) distance between a pair of points between two clusters. Cluster pairs having maximum distance between pair of points distance gets merged. max(Dist-a – Dist-b)
- Ward Linkage: Finds the minimum squared distance between a pair of points between two clusters. Cluster pairs having the lowest squared distance between pair of points distance get merged.
- Average Linkage: Merges clusters based on the average distance of all points in one cluster from the points in other clusters. Cluster pairs having the lowest average distance get merged.
Performing Agglomerative Clustering on normalized data:
As observed from the output table, the ideal number of clusters is indeed 2 with linkage methods ‘average‘ or ‘ward‘.
Now let’s use Agglomerative Clustering(ward linkage) for unnormalized data and check how it performs.
In the case of an unnormalized dataset, 2 clusters with complete linkage are the best. But in the case of a normalized dataset, the performance is better. There the Silhouette Score and Calinski Harabarz Score performance are better though the Davies Bouldin score performance is a bit lower.
2 Clusters for both algorithms KMeans and Agglomerative on normalized datasets have the same performance. Let’s see how the values for each feature/parameter vary across the two clusters for both the algorithms.