what is explained variance in pca

As part of the series of tutorials on Python PCA, we will learn what the explained variance is and what it means in Principal Component Analysis.

What is the Explained Variance in Principal Component Analysis?

The explained variance in Principal Component Analysis (PCA) represents the proportion of the total variance attributed (explained) by each principal component.

It helps us understand how much information is retained after dimensionality reduction. It is the portion of the original data’s variability that is captured by each principal component.

Join the Newsletter

    The larger the eigenvalue, the more important the corresponding eigenvector is in explaining the variance of the data.

    Specifically, it is an array of values where each value equals the variance of each principal component and the length of the array is equal to the number of components defined with n_components.

    Explained Variance in Python

    In PCA, the explained variance is accessed using the explained_variance_ attribute of the pca object.

    pca.explained_variance_

    In this Python example, we load the iris dataset, scale its features and apply PCA to reduce the original dataset to two dimensions. Then, we train and transform the object and finally show the explained variance.

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.datasets import load_iris
    from sklearn.preprocessing import StandardScaler
    
    # Load Iris dataset
    iris = load_iris()
    X = iris.data
    y = iris.target
    
    # Standardize the data
    scaler = StandardScaler()
    X_standardized = scaler.fit_transform(X)
    
    # Apply PCA with two components
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_standardized)
    
    explained_variance = pca.explained_variance_
    explained_variance
    

    The result is an explained variance, expressed as an array with two values.

    array([2.93808505, 0.9201649 ])

    Interpret the Explained Variance in PCA

    The explained variance array is composed of absolute values. The greater the value, the more it contributes to the variance of the Principal Components. In the above, the PC1 contributes to 2.93 units of variance in the original dataset. The PC2, contributes to 0.92 units.

    To make it more useful, we generally use the explained variance ratio, that gives the ratio of each explained variance to the cumulative explained variance.

    cumulative explained variance = 2.93808505 + 0.9201649

    Here the explained variance ratio is accessed using the pca.explained_variance_ratio_ attribute.

    pca.explained_variance_ratio_
    
    array([0.72962445, 0.22850762])

    Now, we can see that the PC1 contributes to 73% of the variance, and PC2 to 23% of the variance, which sums up to 96% of the variance in the data is explained by these two Principal Components. The remaining 4% is what was “discarded” when reducing dimensions.

    pd.DataFrame({
        'Explained Variance': pca.explained_variance_,
        'Explained Variance Ratio': pca.explained_variance_ratio_,
    }, index=['PC1', 'PC2'])
    

    How to Plot the Feature Explained Variance in Python?

    We can plot the PCA explained variance to see the variance of each principal component feature.

    import pandas as pd 
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt 
    import seaborn as sns
    sns.set()
    
    # load features and targets separately
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    # Data Scaling
    x_scaled = StandardScaler().fit_transform(X)
    
    # Reduce from 4 to 3 features with PCA
    pca = PCA(n_components=3)
    
    # Fit and transform data
    pca_features = pca.fit_transform(x_scaled)
    
    # Bar plot of explained_variance
    plt.bar(
        range(1,len(pca.explained_variance_)+1),
        pca.explained_variance_
        )
    
    
    plt.xlabel('PCA Feature')
    plt.ylabel('Explained variance')
    plt.title('Feature Explained Variance')
    plt.show()
    

    What is the Difference Between the Explained Variance and the Eigenvalue?

    The eigenvalue and the explained variance in Principal Component Analysis (PCA) are related concepts and often used as synonyms, they are not exactly the same.

    Eigenvalues indicate the variance along each principal component. Explained variance is the proportion of total dataset variance captured by each principal component.

    EigenvaluesExplained Variance
    Variance along each componentProportion of total dataset variance
    Larger eigenvalues capture more varianceExpressed as a percentage

    What is a Eigenvector in PCA

    The eigenvector in PCA is a unit vector of the transformation matrix of the length equal to 1 that represents the direction of the principal component.

    What is an Eigenvalue in PCA

    The eigenvalue is the coefficient applied to the eigenvector showing the variance that can be attributed to each of the principal components and giving the eigenvectors their length. The larger the eigenvalue, the more important the corresponding eigenvector in explaining the variance of the data.

    An eigenvalue is an array of values where each value that equals the variance of each principal component.

    Enjoyed This Post?