PCA: Principal Component Analysis in Python (Scikit-learn Examples)

In this tutorial, you will learn about the PCA machine learning algorithm using Python and Scikit-learn.

What is Principal Component Analysis (PCA)?

PCA, or Principal component analysis, is the main linear algorithm for dimension reduction often used in unsupervised learning. In simple words, PCA tries to reduce the number of dimension whilst retaining as much variation in the data as possible.

This algorithm identifies and discards features that are less useful to make a valid approximation on a dataset.


Subscribe to my Newsletter


PCA Feature Explained Variance

Introduction to PCA in Python

Principal Component Analysis (PCA) is a technique used in Python and machine learning to reduce the dimensionality of high-dimensional data while preserving the most important information.

Simply put, PCA makes complex data simpler by taking a lot of information and finding the most important parts. This helps to fight the curse of dimensionality.

Install Scikit-Learn to use PCA in Python

For this tutorial, you will also need to install Python and install Scikit-learn library from your command prompt or Terminal.

$ pip install scikit-learn

Simplest Example of PCA in Python

Here is a simple example of how to use Python PCA algorithm in Scikit-learn to reduce the features of the Iris dataset and plot a 2D graph.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
 
# Load Iris dataset 
iris = load_iris()
X = iris.data
y = iris.target
 
# Apply PCA with two components (for 2D visualization)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
 
# Plot the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Getting Started with Principal Component Analysis in Python

In this Python tutorial, we will perform principal component analysis on the Iris dataset using Scikit-learn. We will now install Scikit-learn and load the built-in Iris dataset.

  1. Explore the Iris Dataset
  2. Load the Dataset with Sciki-learn
  3. Perform Data Preprocessing in Python
  4. Perform Dimension Reduction using PCA in Python
  5. Give Names for Your Plot by Mapping targets to Principal Components
  6. Plot the 2D PCA Graph

1. Explore the Python Dataset (Iris)

The Iris dataset is useful to visualize how Principal Component Analysis works.

The iris dataset contains 4 features (predictor variables) describing 3 species of flowers (targets).

PCA Dataframe with Features and Targets Shown
source: iris dataset in pandas

2. Load the Iris Dataset with Scikit-Learn

To load the Iris dataset from Scikit-learn, let’s load features and targets as arrays stored in their respective X and y variables.

from sklearn import datasets

# load features and targets separately
iris = datasets.load_iris()
X = iris.data
y = iris.target

3. Perform Data Preprocessing in Python

Before running PCA, let’s perform very basic data preprocessing and scale the data using StandardScaler.

# data scaling
from sklearn.preprocessing import StandardScaler
x_scaled = StandardScaler().fit_transform(X)

StandardScaler will standardize the features by removing the mean and scaling to unit variance so that each feature has μ = 0 and σ = 1.

Converting this:

Unscaled features

Into this:

Scaled features

4. Perform Dimension Reduction using PCA in Python

To perform dimension reduction in Python, import PCA from sklearn.decomposition and use the fit_transform() method on the PCA() object. The n_components argument tells the number of dimensions to keep.

We have seen that the Iris dataset contains 4 features, making it a 4-dimensional dataset. Not all features are necessarily useful for the prediction. Therefore, we can remove those noisy features and make a faster model.

The n_components argument will define the number of components that we want to reduce the features to.

# Dimention Reduction
pca = PCA(n_components=2)
pca_features = pca.fit_transform(x_scaled)

# Show PCA characteristics
print('Shape before PCA: ', x_scaled.shape)
print('Shape after PCA: ', pca_features.shape)
Shape before PCA:  (150, 4)
Shape after PCA:  (150, 2)

For better understanding, PCA converted this data with 4 features.

This image has an empty alt attribute; its file name is image-120.png

Into this data with 2 principal components.

# Create PCA DataFrame 
pca_df = pd.DataFrame(
    data=pca_features, 
    columns=[
        'Principal Component 1', 
        'Principal Component 2'
        ])

5. Give Names for Your Plot by Mapping targets to Principal Components

We will prepare the PCA Dataframe for visualization by mapping the target names to the PCA features.

# Map target names to targets
target_names = {
    0:'setosa',
    1:'versicolor', 
    2:'virginica'
}

pca_df['target'] = y
pca_df['target'] = pca_df['target'].map(target_names)
pca_df.sample(10)

6. Plot the 2D PCA Graph

To plot the 2D PCA graph, use the Principal Component DataFrame to Seaborn’s lmplot() function.

import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()
# Plot 2D PCA Graph
sns.lmplot(
    x='Principal Component 1', 
    y='Principal Component 2', 
    data=pca_df, 
    hue='target', 
    fit_reg=False, 
    legend=True
    )

plt.title('2D PCA Graph of Iris Dataset')
plt.show()

Follow this tutorial if you want to plot a 3D PCA Graph instead.

What is Next?

  1. Identify Important PCA Features
  2. Plot the Feature Explained Variance
  3. Plot a Scree Plot
  4. Make PCA Biplots
  5. PCA Project: Clustering and De-duplication of web pages using KMeans and TF-IDF

Additional PCA Learning Materials

  1. Understand How the PCA Algorithm Works in Python
  2. Understand What the Explained Variance Is
  3. Understand the Different Types of PCA Plots
  4. Learn About the Common Methods used in PCA
  5. PCA using Python (scikit-learn) by Michael Galarnyk on Towardsdatascience
  6. Principal Component Analysis Visualization by Prasad Ostwal
  7. Performing and visualizing the Principal component analysis (PCA) from PCA function and scratch in Python by Renesh Bedre

How Many Dimensions Should You Reduce Your Data with PCA?

To decide the number of dimensions that you should reduce your data to, you should make a cumulative explained variance plot (scree plot) to show each of your individual component variance on top of your cumulative variance.

We could decide to keep 90%+ of the variance and thus keep 3 Principal components.

Full PCA Code

import pandas as pd 
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

# load features and targets separately
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Data Scaling
x_scaled = StandardScaler().fit_transform(X)

# Dimention Reduction
pca = PCA(n_components=2)
pca_features = pca.fit_transform(x_scaled)
 
# Show PCA characteristics
print('Shape before PCA: ', x_scaled.shape)
print('Shape after PCA: ', pca_features.shape)
print('PCA Explained variance:', pca.explained_variance_)

# Create PCA DataFrame 
pca_df = pd.DataFrame(
    data=pca_features, 
    columns=[
        'Principal Component 1', 
        'Principal Component 2'
        ])


# Map target names to targets
target_names = {
    0:'setosa',
    1:'versicolor', 
    2:'virginica'
}

pca_df['target'] = y
pca_df['target'] = pca_df['target'].map(target_names)
pca_df.sample(10)

# Plot 2D PCA Graph
sns.lmplot(
    x='Principal Component 1', 
    y='Principal Component 2', 
    data=pca_df, 
    hue='target', 
    fit_reg=False, 
    legend=True
    )

plt.title('2D PCA Graph of Iris Dataset')
plt.show()


# Bar plot of explained_variance
plt.bar(
    range(1,len(pca.explained_variance_)+1),
    pca.explained_variance_
    )

plt.xlabel('PCA Feature')
plt.ylabel('Explained variance')
plt.title('Feature Explained Variance')
plt.show()

Python PCA and Machine Learning Definitions

Principal Component AnalysisLinear algorithm for dimension reduction
PCA Python librarysklearn.decomposition.PCA
Plot PCA in PythonSearborn lmplot() can be used
PCA usageSpeed and simplicity

Video Series on PCA

Conclusion

Congratulations, you now have learned one of the most important dimension reduction techniques.

With principal component analysis (PCA) you have optimized machine learning models and created more insightful visualizations.

You also learned how to understand the relationship between each feature and the principal component by creating 2D and 3D loading plots and biplots.

5/5 - (3 votes)