What is Dimension Reduction in Machine Learning (with Python Example)

Dimensionality reduction, or dimension reduction, is a machine learning data transformation technique used in unsupervised learning to bring data from a high-dimensional space into a low-dimensional space retaining the meaningful properties of the original data.

In a nutshell, dimension reduction means representing data using fewer predictor variables (features).

It is a major component in making more efficient machine learning algorithms.

Join the Newsletter

    Why Dimension Reduction?

    The goal of dimension reduction is to represent the data using fewer variables, by preserving as much of the structure (variance) of the data as possible. Dimensionality reductions helps to:

    • simplify the models to make an easier interpretation,
    • reduce modelling costs
    • reduce training times,
    • avoid the curse of dimensionality

    How Dimension Reduction Works?

    Dimension reduction can be separated into linear and non-linear approaches.

    Two techniques can be used for dimensionality reduction:

    1. feature selection
    2. feature extraction

    1. Feature Selection

    Feature selection, or variable selection, is used to remove redundant or irrelevant features without losing too much information from the data.

    Three techniques of feature selection are:

    • Wrapper methods: Train model on each feature to see which have the fewest mistakes. Computationally intensive.
    • Filter methods: Uses a fast-to-compute proxy measure to evaluate the quality of the model. Less intensive than wrapper methods. Information gain is an example of filter methods.
    • Embedded methods: Performs feature selection as part of the model construction process. e.g. LASSO regression, Elasticnet regularization, RandomForestClassifier…

    Common feature selections techniques are to:

    1. Remove features with very low variance
    2. Remove features with high number of missing values
    3. Remove highly correlated features

    Remove Features with Low Variance

    To remove features with low variance in Scikit-learn, use the VarianceThreshold() class from sklearn.feature_selection module. Make sure that you normalize the variance first.

    from sklearn.feature_selection import VarianceThreshold
    
    # Define minimum threshold at variance=0.01
    sel = VarianceThreshold(threshold=0.01)
    
    # Fit Normalized variance
    sel.fit(df / df.mean())
    
    # Get True or False value
    # if each feature variance is above threshold
    mask = sel.get_support()
    
    # Apply Feature Selector
    selected_df = df.loc[:, mask]
    

    Remove Features with Many Missing Values

    To remove features that have too many missing values, you can simply perform calculations on the pandas DataFrame to remove features that have more than a threshold of missing values.

    # fewer than 40% missing values = True values
    mask = df.isna().sum() / len(df) < 0.4
    
    selected_df = df.loc[:, mask]
    

    Remove Highly Correlated Features

    Perfectly correlated features (with a correlation coefficient of -1 or +1), bring no additional information to a dataset. One of two highly correlated features should be removed before training a machine learning algorithm. This helps reducing the complexity of a model, but also help avoid models to overfit on the correlated features.

    This could also be done by removing feature coefficients that are too small to consider using a technique called Recursive Feature Elimination (RFE).

    2. Feature Extraction

    Feature extraction removes less informative features by finding patterns in data and using the patterns to return a compressed form of the data.

    There are two groups of features, that require different extraction techniques:

    • Linear features
    • Non-linear features

    Linear feature extraction techniques

    The most popular linear feature extraction techniques used in machine learning are:

    Non-linear feature extraction techniques

    The most popular non-linear feature extraction techniques used in machine learning are:

    • T-distributed stochastic neighbor embedding (t-SNE)
    • Generalized discriminant analysis (GDA)
    • Autoencoder
    • Kernel PCA

    Dimension Reduction Example in Python with Scikit-Learn

    In this Python example, we will see how we can use PCA to perform dimensionality reduction on the Iris dataset of the Scikit-learn library.

    from sklearn import datasets
    from sklearn.decomposition import PCA
    
    # Load the Iris dataset
    iris = datasets.load_iris()
    X = iris.data  # Features
    
    # Create a PCA (Principal Component Analysis) instance to reduce dimensions to 2
    pca = PCA(n_components=2)
    
    # Fit the PCA model to the data and transform it
    X_reduced = pca.fit_transform(X)
    
    # Print the original and reduced dimensions
    print(f"Original dimensions: {X.shape}")
    print(f"Reduced dimensions: {X_reduced.shape}")
    

    As you can se with the output, we have reduced the dimensions from 4 to 2 principal components.

    Original dimensions: (150, 4)
    Reduced dimensions: (150, 2)

    Conclusion

    This concludes this article on dimensionality reduction techniques.

    Enjoyed This Post?