How to use Confusion Matrix in Scikit-Learn

Share this post

The confusion matrix is often used in machine learning to compute the accuracy of a classification algorithm.

It can be used in binary classifications as well as multiclass classification problems.

What the Confusion Matrix Measures?

It measures the quality of predictions from a classification model by looking athow many predictions are True and how many are False.


Subscribe to my Newsletter


Specifically, it computes:

  • True positives (TP)
  • False positives (FP)
  • True negatives (TN)
  • False negatives (FN)

Understand the Confusion Matrix

Here, we will try to make sense of the true positive, true negative, false positive and false negative values mean.

True Positive

The model predicted true and it is true.

The model predicted that someone is sick and the person is sick.

True Negative

The model predicted false and it is false.

The model predicted that someone is not sick and the person is not sick.

False Positive

The model predicted True and it is false.

The model predicted that someone is sick and the person is not sick.

False Negative

The model predicted false and it is true.

The model predicted that someone is not sick and the person is sick.

How to Create a Confusion Matrix in Scikit-learn?

To create the confusion matrix, we will:

  • Run a classification algorithm
  • Create a confusion matrix
  • Plot the confusion matrix
  • Inspect the classification report

Run a classification algorithm

In a previous article, we classified breast cancers using the k-nearest neighbors algorithm from scikit-learn.

I will not explain this part of the code, but you can look at the detail in the article on the k-nearest neighbors.

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

def to_target(x):
    """Map targets to target names"""    
    return list(dataset.target_names)[x]

# Load data
dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data,columns=dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df['target_names'] = df['target'].apply(to_target)


# Define predictor and predicted datasets
X = df.drop(['target','target_names'], axis=1).values
y = df['target_names'].values

# split taining and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# train the model
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# compute accuracy of the model
knn.score(X_test, y_test)

The result is an accuracy score of the model.

0.9239766081871345

Create a confusion matrix

Use the confusion_matrix method from sklearn.metrics to compute the confusion matrix.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

The result is an array in which positions are the same as the quadrant we saw in the past.

array([[ 57,   7],
       [  5, 102]])
  • cm[0][0] = TP
  • cm[1][1] = TN
  • cm[0][1] = FP
  • cm[1][0] = FN

Plot the confusion matrix

You can use the plot_confusion_matrix method to visualize the confusion matrix.

import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

color = 'white'
matrix = plot_confusion_matrix(knn, X_test, y_test, cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()

The result is your confusion matrix plot.

  • Top left quadrant = True Positives = Number of benign labelled as benign
  • Bottom right quadrant = True Negatives = Number of malignant labelled as malignant
  • Top right quadrant = False Positives = Number of benign labelled as malignant
  • Bottom left quadrant = False Negatives = Number of malignant labelled as benign

Run the classification report

With data from the confusion matrix, you can interpret the results by looking at the classification report.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

If you don’t understand the result above, make sure that you read the article that I wrote on the classification report.

Conclusion

This article was quite big to grasp.

All I want you to leave with is that it is super important to look at the confusion matrix to help you fine-tune your machine learning models.

This can modify the accuracy score quite heavily in some cases.

Good work on building your first confusion matrix in Sci-kit learn.