As part of the series of tutorials on PCA with Python, we will learn how to plot a 2D PCA graph (scatter plot) on the Iris Dataset with Python, Scikit-learn and Matplotlib.
What is 2D PCA Scatter plot?
A 2D PCA (Principal Component Analysis) scatter plot is a PCA visualization that shows the distribution of data points in a two dimensional space after reducing a dataset to 2 PCA features.
How to Plot a 2D PCA Graph in Python?
To plot a 2D PCA scatter plot in Python, reduce the number of features to 2 principal components. After, use matplotlib
to generate a two-dimensional scatterplot from the data.
Here are the detailed steps to plot a 2D PCA scatter plot in Python:
- Load the required Python Libraries
- Load your Dataset
- Scale and Reduce the Number of Features Using PCA
- Prepare the PCA DataFrame
- Plot the 2D Scatterplot with Seaborn’s lmplot
1. Loading the Required Python Libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
sns.set()
2. Loading the Iris Dataset in Python
To start, we load the Iris dataset in Python, do some preprocessing and use PCA to reduce the dataset to 3 features. To learn what this means, follow our tutorial on PCA with Python.
# load features and targets separately
iris = datasets.load_iris()
X = iris.data
y = iris.target
From this data, we will learn various ways to plot the 3D PCA graph with Python.
3. Scale and Reduce the Number of Features Using PCA
Next, scale the date before applying PCA, and select the n_component
to be equal to 2.
# Data Scaling
x_scaled = StandardScaler().fit_transform(X)
# Reduce from 4 to 2 features with PCA
pca = PCA(n_components=2)
# Fit and transform data
pca_features = pca.fit_transform(x_scaled)
4. Prepare the PCA DataFrame
Next, we will create a PCA dataframe, using the principal component features and map the names to the target variables for better legibility.
# Create dataframe
pca_df = pd.DataFrame(
data=pca_features,
columns=['PC1', 'PC2'])
# map target names to PCA features
target_names = {
0:'setosa',
1:'versicolor',
2:'virginica'
}
pca_df['target'] = y
pca_df['target'] = pca_df['target'].map(target_names)
pca_df.head()
5. Plot the 2D Scatterplot with Seaborn’s lmplot
Finally, use seaborn’s lmplot
function to plot the PCA dataframe into a two-dimensional scatter plot.
sns.lmplot(
x='PC1',
y='PC2',
data=pca_df,
hue='target',
fit_reg=False,
legend=True
)
plt.title('2D PCA Graph')
plt.show()
Next Steps
After plotting a 2D PCA Scatterplot, it is interesting to learn how to plot a 3D PCA Scatterplot and how to plot a 2D PCA Biplot.
Full Code
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
sns.set()
# load features and targets separately
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Data Scaling
x_scaled = StandardScaler().fit_transform(X)
# Reduce from 4 to 2 features with PCA
pca = PCA(n_components=2)
# Fit and transform data
pca_features = pca.fit_transform(x_scaled)
# Create dataframe
pca_df = pd.DataFrame(
data=pca_features,
columns=['PC1', 'PC2'])
# map target names to PCA features
target_names = {
0:'setosa',
1:'versicolor',
2:'virginica'
}
pca_df['target'] = y
pca_df['target'] = pca_df['target'].map(target_names)
# Plot the 2D PCA Scatterplot
sns.lmplot(
x='PC1',
y='PC2',
data=pca_df,
hue='target',
fit_reg=False,
legend=True
)
plt.title('2D PCA Graph')
plt.show()
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.