PCA Feature Importance: How to Find the Important Original Features (Python Example)

In this tutorial, you will learn how to identify the importance of the original features in the reduced dataset. If we had 5 features and reduced to 3, which ones were the most important?

We assume that you have identified the reduced Principal Components (PC) that explained most of the variance in the dataset. Now, we want to know which feature are important in the remaining Principal Components after dimension reduction.

PCA Feature Explained Variance

How to Identify PCA Feature Importance in Python

To identify the contribution of original features to each principal component (PC), use the explained variance ratio for each component using explained_variance_ratio_ attribute on the pca object.


Subscribe to my Newsletter


This example applies Principal Component Analysis on the Iris dataset, using Scikit-learn.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
  
# Load Iris dataset 
iris = load_iris()
X = iris.data
y = iris.target
  
# Apply PCA with two components 
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Explained Variance Ratio
pca.explained_variance_ratio_

The higher the explained variance ratio, the more important the principal component is in explaining the variance of the data.

array([0.92461872, 0.05306648])

Here, we have an array where 92% of the variance is explained by the first principal component (PC1) and the 5% is explained by PC2. Together, they explain 97% of the variance of the data.

To learn how PCA works in Python, see our PCA Sklearn example in Python.

How to Identify the Importance of Each Original Feature

To identify the importance of each feature on each component, use the components_ attribute.

print(abs(pca.components_))

The result is an array containing the PCA loadings in which “rows” represents components and “columns” represent the original features.

[[0.36138659 0.08452251 0.85667061 0.3582892 ]
 [0.65658877 0.73016143 0.17337266 0.07548102]]

Here we can estimate that the third feature explained 86% of the first principal component and the second feature explained 73% of the second principal component.

The next steps in understanding the importance of each features is to:

  1. Plot the Feature Explained Variance
  2. Plot a Scree Plot
  3. Make PCA Biplots

What is the Explained Variance?

The explained variance, or eigenvalue, in PCA shows the variance that can be attributed to each of the principal components.

The larger the eigenvalue, the more important the corresponding eigenvector is in explaining the variance of the data.

It is an array of values where each value equals the variance of each principal component and the length of the array is equal to the number of components defined with n_components.

It can be accessed with the .explained_variance_ notation.

pca.explained_variance_
array([2.93808505, 0.9201649 ])

How to Plot the Explained Variance in Python

The explained variance in PCA helps us understand how much information is retained after dimensionality reduction. It is the portion of the original data’s variability that is captured by each principal component.

We can plot the explained variance to see the variance of each principal component feature.

# Bar plot of explained_variance
plt.bar(
    range(1,len(pca.explained_variance_)+1),
    pca.explained_variance_
    )

plt.xlabel('PCA Feature')
plt.ylabel('Explained variance')
plt.title('Feature Explained Variance')
plt.show()

The output graph shows that 1 of the PCA features is obviously more significant than the other.

Read our tutorial on the best PCA plots in Python for more data visualization examples with PCA and Python.

5/5 - (1 vote)