measures of spread

In this guide, we will introduce one of the summary statistics: the measures of variability. We will also provide you with Python examples to illustrate how to apply this concept.

What are the Measures of Variability (Spread) in Statistics

The measures of variability, also known as measures of spread, are. the summary statistics used to understand the variability of the data how close or spread apart the data points are).

Statistical spread and measures of variability
Plots showing the spread with measures of variability and the Probability Density Function

The 8 measures of variability in summary statistics are the range, the interquartile range (IQR), the variance, the standard deviation, the Coefficient of Variation (CV), the Mean Absolute Deviation, the Root Mean Square (RMS) and the Percentile Ranges.

Join the Newsletter

    1. Range: Difference between the maximum and minimum values in a dataset
    2. Interquartile Range (IQR): Difference between the third and first quartiles (Q3 and Q1) . Focuses on the middle 50% to reduce the impact of outliers.
    3. Variance: Average distance of each data point from the mean
    4. Standard Deviation: Square-root of the variance
    5. Coefficient of Variation (CV): Percentage ratio of the standard deviation to the mean
    6. Mean Absolute Deviation: Average absolute difference between data points and the mean
    7. Root Mean Square (RMS): Square root of the mean of the squared values.
    8. Percentile Ranges: Ranges between specific percentiles to provide insights into the central of the data less influenced by extreme values.

    Overview of Measures of Variability in Python

    NameDescriptionWhen to UsePython Function
    RangeDifference between max and min.Quick overview of spread.max(data) - min(data)
    Interquartile Range (IQR)Range of middle 50% data.Robust to outliers.np.quantile(data, 0.75) - np.quantile(data, 0.25)
    VarianceAverage squared deviations.Sensitive to outliers.np.var(data)
    Standard DeviationSquare root of variance.Interpretable, in same units.np.std(data)
    Coefficient of Variation (CV)Standard deviation relative to mean.Comparing datasets with different scales.(np.std(data) / np.mean(data)) * 100
    Mean Absolute Deviation (MAD)Average absolute deviations.Robust to outliers.np.mean(np.abs(data - np.mean(data)))
    Root Mean Square (RMS)Square root of mean of squared values.Used in signal processing.np.sqrt(np.mean(np.square(data)))
    Percentile RangesRanges between specific percentiles.Highlight central data range.numpy.percentile(data, q) - numpy.percentile(data, p)

    What is the Range in Statistics

    The Range in statistics represents the difference between the maximum and minimum values of a dataset.

    How to Calculate the Range

    The range is calculated by subtracting the minimum value from the maximum value.

    range = max - min

    How to Calculate the Range in Python

    To calculate the range for values in Python, use the max() and the min() function. Note that the range() function is used to create the range, not to calculate it.

    values = range(1, 10) # create range
    rg = max(values) - min(values) # calculate range
    
    print('Values:', list(values))
    print('Range:', rg)
    
    Values: [1, 2, 3, 4, 5, 6, 7, 8, 9]
    Range: 8 

    What is the Interquartile Ranges (IQR) in Statistics

    The Interquartile Ranges, or IQR, in summary statistics represents the difference between the third quartile (Q3) and the first quartile (Q1). By focusing on the middle 50% of the data, the Interquartile Ranges allow an analysis that is less influenced by extreme values.

    Simply put, the interquartile range is the height of the box in a boxplot.

    Interquartile ranges (iqr) on a box plot
    Interquartile ranges on a box plot

    To understand IQR, we need to introduce the concept of quantiles.

    What are Quantiles (Percentiles) in Statistics

    Quantiles, also known as percentiles, occurs when we split the data in equal parts. For example, when we split quantiles into 4 equal parts, we call those quartiles.

    We can use the np.quantile() function in Python to split the data in equal parts.

    import numpy as np
    
    # Example numerical dataset
    data = np.array(range(1,10))
    
    # Calculate the median (50th percentile)
    median = np.quantile(data, 0.5)
    print("Median (50th percentile):", median)
    
    np.median(data) == np.quantile(data, 0.5)
    
    Median (50th percentile): 5.0
    True

    You can see how, in the code above, quantiles split in 50% is the same as computing the median value.

    You can also calculate the 25th, 75th and 90th percentiles of your dataset.

    # Calculate the 25th percentile (1st quartile)
    q1 = np.quantile(data, 0.25)
    print("25th Percentile (1st Quartile):", q1)
    
    # Calculate the 75th percentile (3rd quartile)
    q3 = np.quantile(data, 0.75)
    print("75th Percentile (3rd Quartile):", q3)
    
    # Calculate a custom quantile (e.g., the 90th percentile)
    custom_quantile = np.quantile(data, 0.9)
    print("90th Percentile (Custom Quantile):", custom_quantile)
    
    # Calculate all quartiles at once
    quartiles = np.quantile(data, [0,0.25,0.5,0.75,1])
    print("Quartiles:", quartiles)
    
    25th Percentile (1st Quartile): 3.0
    75th Percentile (3rd Quartile): 7.0
    90th Percentile (Custom Quantile): 8.2
    Quartiles: [1. 3. 5. 7. 9.]

    Visualize Quantiles with a Box Plot in Python

    The best way to visualize quantiles in Python, is to create a box plot from your data.

    import matplotlib.pyplot as plt
    import numpy as np
    
    # Example numerical dataset
    data = np.array(range(10))
    
    # Create a box plot
    plt.boxplot(data)
    
    # Add labels and a title
    plt.xlabel("Dataset")
    plt.ylabel("Values")
    plt.title("Box Plot of Quartiles")
    
    # Show the plot
    plt.show()
    

    How to Calculate the Interquartile Ranges

    The interquartile range is calculated by subtracting quartile 1 data from quartile 3 data.

    IQR = Q3(data) - Q1(data)

    How to Calculate the Interquartile Ranges in Python

    The interquartile ranges (IQR) can be calculated in Python by using np.quantiles() to subtract the first quantile from the third, or by using the iqr() function from the scipy.stats module.

    With np.quantile
    np.quantile(data, 0.75) - np.quantile(data, 0.25)
    With Scipy.stats
    from scipy.stats import iqr
    
    iqr(range(1,10))
    
    4.0

    What is the Variance in Statistics

    The variance in statistics is a measure of the spread or variability of a dataset.

    The variance quantifies how far individual data points in a dataset differ from the mean (average) of the dataset.

    Simply put, the variance show how dispersed or scattered the data points are around the mean.

    On a scatter plot, we can easily visualize the dispersion of the data by modifying the variance.

    scatterplots showing different variance
    Scatterplots showing data with different variance

    How to Calculate the Variance

    To calculate the variance, subtract the mean from each data point, square the distances, sum all the squared values and divide by the number of data points minus 1.

    Formula of the variance

    source: Google

    How to Calculate the Variance in Python

    To compute the variance in Python, either use the var() function from numpy or compute the sum the squared distances from the mean and divide by the number of values minus 1.

    Calculate the variance with Python Numpy

    To calculate the variance in Python with Numpy, use the var() function with the ddof argument set to 1. The ddof argument is used to specify the formula to use when working with a sample of data.

    # Variance with np.var()
    import numpy as np
    
    data_points = [1,2,3,4,5] 
    np.var(data_points, ddof=1)
    
    Calculate the variance manually with Python

    Follow these steps to calculate the variance manually in Python:

    1. Calculate distances from the mean
    2. Square the distances
    3. Sum the squared distances
    4. Find Number of data points
    5. Divide the sum of the squared distances by n-1
    import numpy as np
    
    data_points = [1,2,3,4,5] 
    
    
    # 1. Calculate distances from the mean
    distances = data_points - np.mean(data_points)
    print('1. Mean distances:', distances)
    
    # 2. Square the distances
    sqr_distances = distances ** 2
    print('2. Squared distances:', sqr_distances)
    
    # 3. Sum the squared distances
    summed_distances = sum(sqr_distances)
    print('3. Sum of the squared distances:', summed_distances)
    
    # 4. Find Number of data points
    n = len(data_points)
    print('4. Number of data points:', n)
    
    # 5. Divide the sum of the squared distances by n-1
    variance = summed_distances / (n - 1)
    print('5. Variance:', variance)
    

    Output

    1. Mean distances: [-2. -1.  0.  1.  2.]
    2. Squared distances: [4. 1. 0. 1. 4.]
    3. Sum of the squared distances: 10.0
    4. Number of data points: 5
    5. Variance: 2.5

    Note that the S^2 notation shows how the result of the variance is a squared value on which the sqare-root could be applied to compute the standard deviation.

    What is the Standard Deviation in Statistics

    The standard deviation in statistics is a measure of the spread or variability of a dataset calculated using the square-root of the variance.

    Simply put, the standard deviation is the square-root of the variance.

    The advantage of the standard deviation compared to the variance is that the standard deviation is in the same units as your data points (seconds, minutes, days, etc.).

    How to Calculate the Standard Deviation

    To calculate the standard deviation, calculate the variance and compute the square-root of the variance.

    Formula of the standard deviation

    Source: Google

    How to Calculate the Standard Deviation in Python

    To compute the standard deviation in Python, either use the std() function from numpy or use the np.sqrt() function on the computed variance.

    np.std(data_points)
    np.sqrt(np.var(data_points))

    Just note that using the np.std() function will compute the sqrt() function on the variation calculated from a division with n, not n-1.

    import numpy as np
    
    data_points = [1,2,3,4,5] 
    
    # Standard deviation with np.std
    print('std():',np.std(data_points))
    
    # Standard deviation with np.sqrt(np.var()) n
    variance = np.var(data_points)
    print('np.sqrt(np.var()):',np.sqrt(variance))
    
    # Standard deviation with np.std
    print('std(ddof=1):',np.std(data_points, ddof=1))
    
    # Standard deviation with np.sqrt(np.var()) n-1
    variance = np.var(data_points, ddof=1)
    print('np.sqrt(np.var(ddof=1)):',np.sqrt(variance))
    

    Output

    std(): 1.4142135623730951
    np.sqrt(np.var()): 1.4142135623730951
    
    std(ddof=1): 1.5811388300841898
    np.sqrt(np.var(ddof=1)): 1.5811388300841898

    What is the Coefficient of Variation (CV) in Statistics

    The Coefficient of Variation, or CV, in statistics is a a relative measure of variability represented by the percentage ratio of the standard deviation and the mean.

    How to Calculate the Coefficient of Variation

    To calculate the coefficient of variation (CV), you need to calculate the standard deviation and the mean of a dataset. Then, find the coefficient of variation by computing the ration of the standard deviation to the mean.

    cv = (std/mean) * 100

    How to Calculate the Coefficient of Variation in Python

    To calculate the coefficient of variation in Python, use the mean() and the std() function of the numpy library, and then divide the standard deviation by the mean.

    import numpy as np
    
    # Sample data
    data_points = [1, 2, 3, 4, 5]
    
    # Calculate the mean (average) of the data
    mean = np.mean(data_points)
    
    # Calculate the standard deviation of the data
    std_dev = np.std(data_points)
    
    # Calculate the Coefficient of Variation (CV)
    cv = (std_dev / mean) * 100
    
    # Print the result
    print(f"Coefficient of Variation (CV): {cv:.2f}%")
    
    
    Coefficient of Variation (CV): 47.14%

    What is the Mean Absolute Deviation in Statistics

    The Mean Absolute Deviation, or MAD, in statistics is the average absolute difference between data points and the mean. It is often used as an alternative to the standard deviation since it does not require to square deviations.

    How to Calculate the Mean Absolute Deviation

    To calculate the mean absolute deviation, calculate the differences between each data point and the mean. Then, get the absolute values. Finally, compute the mean.

    mad = mean(absolute(data_points - mean(data_points)))

    How to Calculate the Mean Absolute Deviation in Python

    To calculate the mean absolute deviation (MAD) in Python, get the absolute values from the subtraction of the mean from each data point, then find out the mean of the absolute values.

    import numpy as np
    
    data_points = [1,2,3,4,5] 
    
    # 1. Calculate distances from the mean
    distances = data_points - np.mean(data_points)
    print('1. Mean distances:', distances)
    
    # 2. Calculate the mean absolute deviation
    mad = np.mean(np.abs(distances))
    print('2. Mean absolute deviation:', mad)
    
    1. Mean distances: [-2. -1.  0.  1.  2.]
    2. Mean absolute deviation: 1.2

    What is the Root Mean Square (RMS) in Statistics

    The Root Mean Square, or RMS, in statistics represents the square root of the mean of the squared values used to measure the magnitude of variations.

    How to Calculate the Root Mean Square

    Mathematically, the RMS can be represented as:

    RMS = sqrt((x₁² + x₂² + ... + xₙ²) / n)

    How to Calculate the Root Mean Square in Python

    To calculate the root mean square in Python,

    1. Square each data point in your dataset.
    2. Calculate the mean of the squared values
    3. Take the square root the mean of the squared values.
    import numpy as np
    
    # Sample data
    data_points = [1, 2, 3, 4, 5]
    
    # 1. Square each value
    squared_data = [x**2 for x in data_points]
    
    # 2. Find the mean of the squared values
    mean_squared = np.mean(squared_data)
    
    # 3. Take the square root
    rms = np.sqrt(mean_squared)
    
    print(f"Root Mean Square (RMS): {rms}")
    
    Root Mean Square (RMS): 3.3166247903554

    What is the Percentile Ranges in Statistics

    The Percentile Ranges is a summary statistic extracted using the ranges between specific percentiles (e.g 10th and 90th percentiles). The percentile ranges provide insights into the central X% of the dat, which is less influenced by outliers.

    How to Calculate the Percentile Ranges in Python

    Calculating percentile ranges in Python is similar to calculating interquartile ranges, but with custom percentiles specified. You can do so by using the percentile() function of the numpy library.

    # Calculate the first percentile (Q1)
    q1 = np.percentile(sorted_data, 25)
    
    # Calculate the third percentile (Q3)
    q3 = np.percentile(sorted_data, 75)
    
    # Calculate the percentile range
    percentile_range = q3 - q1
    

    Outliers

    Outliers in statistics are the extreme data points that are significantly different from the others. They can have a big impact on the measures of variability and should be understood, and dealt with accordingly, by statistician and data scientists. Check out our tutorial to understand how to identify outliers in a dataset.

    Enjoyed This Post?