What is Dimension Reduction in Machine Learning (with Python Example)

Dimensionality reduction, or dimension reduction, is a machine learning data transformation technique used in unsupervised learning to bring data from a high-dimensional space into a low-dimensional space retaining the meaningful properties of the original data.

In a nutshell, dimension reduction means representing data using fewer predictor variables (features).

It is a major component in making more efficient machine learning algorithms.

Navigation Show

Why Dimension Reduction?

The goal of dimension reduction is to represent the data using fewer variables, by preserving as much of the structure (variance) of the data as possible. Dimensionality reductions helps to:

simplify the models to make an easier interpretation,
reduce modelling costs
reduce training times,
avoid the curse of dimensionality

How Dimension Reduction Works?

Dimension reduction can be separated into linear and non-linear approaches.

Two techniques can be used for dimensionality reduction:

feature selection
feature extraction

1. Feature Selection

Feature selection, or variable selection, is used to remove redundant or irrelevant features without losing too much information from the data.

Three techniques of feature selection are:

Wrapper methods: Train model on each feature to see which have the fewest mistakes. Computationally intensive.
Filter methods: Uses a fast-to-compute proxy measure to evaluate the quality of the model. Less intensive than wrapper methods. Information gain is an example of filter methods.
Embedded methods: Performs feature selection as part of the model construction process. e.g. LASSO regression, Elasticnet regularization, RandomForestClassifier…

Common feature selections techniques are to:

Remove features with very low variance
Remove features with high number of missing values
Remove highly correlated features

Remove Features with Low Variance

To remove features with low variance in Scikit-learn, use the VarianceThreshold() class from sklearn.feature_selection module. Make sure that you normalize the variance first.

from sklearn.feature_selection import VarianceThreshold

# Define minimum threshold at variance=0.01
sel = VarianceThreshold(threshold=0.01)

# Fit Normalized variance
sel.fit(df / df.mean())

# Get True or False value
# if each feature variance is above threshold
mask = sel.get_support()

# Apply Feature Selector
selected_df = df.loc[:, mask]

Remove Features with Many Missing Values

To remove features that have too many missing values, you can simply perform calculations on the pandas DataFrame to remove features that have more than a threshold of missing values.

# fewer than 40% missing values = True values
mask = df.isna().sum() / len(df) < 0.4

selected_df = df.loc[:, mask]

Remove Highly Correlated Features

Perfectly correlated features (with a correlation coefficient of -1 or +1), bring no additional information to a dataset. One of two highly correlated features should be removed before training a machine learning algorithm. This helps reducing the complexity of a model, but also help avoid models to overfit on the correlated features.

This could also be done by removing feature coefficients that are too small to consider using a technique called Recursive Feature Elimination (RFE).

2. Feature Extraction

Feature extraction removes less informative features by finding patterns in data and using the patterns to return a compressed form of the data.

There are two groups of features, that require different extraction techniques:

Linear features
Non-linear features

Linear feature extraction techniques

The most popular linear feature extraction techniques used in machine learning are:

Principal component analysis (PCA)
Non-negative matrix factorization (NMF)
Linear discriminant analysis (LDA)

Non-linear feature extraction techniques

The most popular non-linear feature extraction techniques used in machine learning are:

T-distributed stochastic neighbor embedding (t-SNE)
Generalized discriminant analysis (GDA)
Autoencoder
Kernel PCA

Dimension Reduction Example in Python with Scikit-Learn

In this Python example, we will see how we can use PCA to perform dimensionality reduction on the Iris dataset of the Scikit-learn library.

from sklearn import datasets
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # Features

# Create a PCA (Principal Component Analysis) instance to reduce dimensions to 2
pca = PCA(n_components=2)

# Fit the PCA model to the data and transform it
X_reduced = pca.fit_transform(X)

# Print the original and reduced dimensions
print(f"Original dimensions: {X.shape}")
print(f"Reduced dimensions: {X_reduced.shape}")

As you can se with the output, we have reduced the dimensions from 4 to 2 principal components.

Original dimensions: (150, 4)
Reduced dimensions: (150, 2)

Conclusion

This concludes this article on dimensionality reduction techniques.

Enjoyed This Post?

Jean-Christophe Chouinard

SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.