Clustering in Machine Learning

Share this post

Clustering in machine learning is an unsupervised learning set of algorithms that divide objects into similar clusters based on similar characteristics.

What is Clustering in Machine Learning?

Clustering is used to group similar data points together based on their characteristics.

Clustering machine-learning algorithms are grouping similar elements in such a way that the distance between each element of the cluster are closer to each other than to any other cluster.


Subscribe to my Newsletter


Example of Clustering Algorithms

Here are the 3 most popular clustering algorithms that we will cover in this article:

  • KMeans
  • Hierarchical Clustering
  • DBSCAN

Below we show an overview of other Scikit-learn‘s clustering methods

Source: Scikit-learn (official documentation)

Examples of clustering problems

  • Recommender systems
  • Semantic clustering
  • Customer segmentation
  • Targetted marketing

How do clustering algorithms work?

Each clustering algorithm works differently than the other, but the logic of KMeans and Hierarchical clustering is similar. Clustering machine learning algorithm work by:

  1. Selecting cluster centers
  2. Computing distances between data points to cluster centers, or between each cluster centers.
  3. Redefining cluster center based on the resulting distances.
  4. Repeating the process until the optimal clusters are reached

This is an overly simplified view of clustering, but we will dive deeper into how each algorithm works specifically in the next sections.

How does KMeans Clustering Work?

Kmeans clustering algorithm works by starting with a fixed set of clusters and moving the cluster centres until the optimal clustering is met.

  1. Defining a number of clusters at the start
  2. Selecting random cluster centers
  3. Computing distances between each point to cluster center
  4. Finding new cluster centers using the mean of distances
  5. Repeating until convergence.

Some examples of KMeans clustering algorithms are:

  • KMeans from Scikit-learn’s sklearn.cluster
  • kmeans from SciPy’s scipy.cluster.vq

How does Hierarchical Clustering Work?

Hierarchical clustering algorithm works by starting with 1 cluster per data point and merging the clusters together until the optimal clustering is met.

  1. Having 1 cluster for each data point
  2. Defining new cluster centers using the mean of X and Y coordinates
  3. Combining clusters centers closest to each other
  4. Finding new cluster centers based on the mean
  5. Repeating until optimal number of clusters is met

The image below represents a dendrogram that can be used to visualize hierarchical clustering. Starting with 1 cluster per data point at the bottom and merging the closest clusters at each iteration, ending up with a single cluster for the entire dataset.

Dendrogram
Dendrogram

Some examples of hierarchical clustering algorithms are:

  • hirearchy from SciPy’s scipy.cluster

How does DBSCAN Clustering Work?

DBSCAN stands for Density-Based Spatial Clustering of Applications and Noise.

DBSCAN clustering algorithm works by assuming that the clusters are regions with high-density data points separated by regions of low-density.

Density-Based Spatial Clustering of Applications and Noise (DBSCAN)
Density-Based Spatial Clustering of Applications and Noise (DBSCAN)

Some examples of DBSCAN clustering algorithms are:

  • DBSCAN from Scikit-learn sklearn.cluster
  • HDBSCAN

How does Gaussian Mixture Clustering Models Work?

Gaussian Mixture Models, or GMMs, are probabilistic models that look at Gaussian distributions, also known as normal distributions, to cluster data points together.

By looking at a certain number of Gaussian distributions, the models assume that each distribution is a separate cluster.

source: towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

Some examples of Gaussian mixture clustering algorithms are:

  • GaussianMixture from Scikit-learn’s sklearn.mixture

Interesting Work from the Community

How to Master the Popular DBSCAN Clustering Algorithm for Machine Learning by Abhishek Sharma

Build Better and Accurate Clusters with Gaussian Mixture Models

Python Script: Automatically Cluster Keywords In Bulk For Actionable Insights V2 by Lee Foot

Polyfuzz auto-mapping + auto-grouping tests by Charly Wargnier

Conclusion

This concludes the introduction of clustering in machine learning. We have covered how clustering works and provided an overview of the most common clustering machine learning models.

The next step is to learn how to use Scikit-learn to train each clustering machine learning model on real data.