What Is the Curse of Dimensionality in machine learning

The curse of dimensionality refers the the various challenges of making meaningful machine learning generalizations from data containing large number of dimensions.

What is the Curse of Dimensionality?

The Curse of Dimensionality explains how as the number of dimension or features increases, the amount of data we need to make accurate generalizations grows exponentially.

The term was coined by Richard E. Bellman in 1966 in a paper name Dynamic programming refers to the challenges that arise when working with data in high-dimensionality. Objects tend to become more sparse and dissimilar, making it harder to gather meaningful insights from the data.

Subscribe to my Newsletter

Curse of Dimensionality Example

An example of the curse of dimensionality is when we try to find the k-nearest neighbours of a value. When you have a 1 dimensional line, and all the data spread uniformly across that line, the nearest neighbour will default to a very small neighbourhood of possible values.

  • 1D: each X represents 1/6th, n_neighbors=2, distance is small
  • 2D: each X represents 1/36th, n_neighbors=8, distance is medium
  • 3D: each X represents 1/216th, n_neighbors=26, distance is large

Why the Curse of Dimensionality Matters?

The curse of dimensionality matters because large number of dimensions tend to lead to underfitting or overfitting.

overfitting underfitting in curse of dimensionality

A lot of machine learning models (such as KNN, K-Means and Decision Trees) are heavily influenced by the distances between data points. By adding more data points, the relative distance increase exponentially. The more dimensions, the fewer data points per dimension that you have.

If you want to make simple predictions (e.g. ice cream sales and temperature). When you add too many dimensions (e.g. time of the day, ice cream brand, etc.), the model has a hard time to make sense of all that noise. The curse of dimensionality leads to underfitting the data.

If you have all this additional information, but do not increase the number of data points, the the model will have a hard time making sense of unseen data. This is when the curse of dimensionality leads to overfitting de data.

Finding the True Dimensionality of a Dataset

This works because datasets are generally high-dimensional because we choose to make them high dimensional.

When we have documents, we choose to perform things such as TF-IDF to make each word become its own attribute, further increasing the number of dimensions.

The dataset’s true dimensionality is generally much lower.

How to Overcome the Curse of Dimensionality?

There are three main ways to overcome the curse of dimensionality: dimensionality reduction, increase in the number of data points and feature engineering.

Dimensionality Reduction

You can perform dimension reduction using tools like principal component analysis on a dataset. This will reduce the number of features by selecting only the principal components that contribute the most to the variance in the data.

Increase in Data Points

Adding more data points (not features) to the dataset will make the models less subject to variations from the input training data.

Feature Engineering

By using domain knowledge to extract features from the dataset (feature engineering). For example, you could use the SIFT algorithm to select a part of an image and train the data only on the dimensions related to that part.

Other techniques to reduce high dimensionality can be:

  • Derivation: Evaluate each dimension independently
  • Smoothing: Spread class to neighbors
  • Symmetry: Reduce number of configurations of the same pattern
Enjoyed This Post?