How to Find Outliers (Python Example)

Outliers can have a big impact on the measures of variability and should be understood, and dealt with accordingly, by statistician and data scientists. In this tutorial, we will use summary statistics and Python to calculate outliers in a dataset.

Outliers on a box plot
Outliers on a box plot

What are Outliers in Statistics

Outliers in statistics represent the extreme data points that are significantly different from the others.

Outliers are defined by one of these rules:

Subscribe to my Newsletter

  • data point < Q1 – (1.5 * IQR)
  • data point > Q3 + (1.5 * IQR)

Simply put, outliers are data points in data set that are outside of the box plot limits.

How to Find Outliers in a Dataset with Python

To find outliers in a dataset, we need to compute the quartiles and find the first and the third quartiles. Then, we need to compute the interquartile ranges on the dataset and verify if the data point is greater than or smaller than 1.5 * IQR.

So, if we take our custom dataset:

import numpy as np

# Generate a random dataset with outliers
data = np.concatenate(
            np.random.normal(0, 1, 100), 
            np.random.normal(10, 1, 10)

We can calculate the quartiles and the iqr using either the np.percentile, or the np.quantile functions, or alternatively, use the scipy.stats.iqr function.

# Calculate quartiles and IQR for outlier detection
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1 # equivalent to scipy.stats.iqr(data)

# Define the lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
Enjoyed This Post?