L’article Introduction to Summary Statistics for Data Science est apparu en premier sur JC Chouinard.

]]>Summary statics are used in descriptive statistics to summarize and describe observations in a dataset.

Summary statistics are generally used for data exploration to communicate large amounts of data into their simplest patterns.

Summary statistics include measures such as:

- mean (average),
- median (middle value),
- mode (most frequent value),
- standard deviation (measure of data dispersion),
- quartiles (dividing data into four equal parts).

Summary Statistics summarize and describe observations from a dataset by looking at:

**Measures of central tendency**: mean, median, mode**Measures of the shapes of the distributions**: skewness, kurtosis**Measures of variability (spread, dispersion)**: variance, standard deviation**Measures of statistical dependence:**correlation

In statistics, measures of central tendency are used to summarize the data by finding where the center of the data is. The 3 measures of center are the *mean* the *median* and the *mode*.

**Mean:**Average value of a dataset.**Median:**Middle value in a dataset.**Mode:**Most frequently occurring value in a dataset.

Here are general guidelines that will help you choose the right measure of central tendency.

- Mean: More sensitive to outliers. Better for symmetrical data (normally distributed).
- Media: Less sensitive to outliers. Better for non-symmetrical (skewed) data.
- Mode: More appropriate for categorical data

Read our tutorial on the measure of central tendency to learn more about the topic.

Data may be distributed in different ways. Sometimes it is symmetrical (e.g. normal distribution), sometimes it is non-symmetrical (e.g. right/left skewed), and sometimes is narrower and steeper than others (e.g. kurtosis).

To identify describe these different shapes of distribution, statisticians use mainly two different kinds of summary statistics:

**Skewness**: measure of the asymmetry of a distribution**Kurtosis**: measure of the tailedness of a distribution

When evaluating the skewness in the data, we are evaluating the asymmetry of the distribution and whether it is normal, left or right skewed.

**Zero skew:**mean = median, normal distribution**Left Skew:**mean < median,**Right Skew:**mean > median

When evaluating kurtosis in the data, we are evaluating the tailedness of the distribution and how extreme the data is to outliers. The three types of distributions with kurtosis are:

**Leptokurtic**: Large tails, more extreme outliers, positive kurtosis**Mesokurtic**: Medium tails, kurtosis equal to zero**Platykurtic**: Thin tails, less extreme outliers, negative kurtosis

To learn more, read our article on the measures of shapes and distributions in summary statistics.

To understand the distribution of the data, it is important to understand the variability (or spread) of the data. The variability describes how close or spread apart the data points are.

The 8 measures of variability in summary statistics are the range, the interquartile range (IQR), the variance, the standard deviation, the Coefficient of Variation (CV), the Mean Absolute Deviation, the Root Mean Square (RMS) and the Percentile Ranges.

**Range:**Difference between the maximum and minimum values in a dataset**Interquartile Range (IQR):**Difference between the third and first quartiles (Q3 and Q1) . Focuses on the middle 50% to reduce the impact of outliers.**Variance:**Average distance of each data point from the mean**Standard Deviation**: Square-root of the variance**Coefficient of Variation (CV):**Percentage ratio of the standard deviation to the mean**Mean Absolute Deviation:**Average absolute difference between data points and the mean**Root Mean Square (RMS):**Square root of the mean of the squared values.**Percentile Ranges:**Ranges between specific percentiles to provide insights into the central of the data less influenced by extreme values.

To learn more about this topic, read our article on the measures of variability in statistics.

The 8 measures of statistical dependence used to evaluate the correlation between multiple variables are:

**Covariance:**How much two random variables change together**Correlation Coefficient:**Linear relationship of two continuous variables**Spearman’s Rank Correlation:**Strength/direction of the monotonic relationship between two variables.**Kendall’s Tau (τ)**: Strength/direction of ordinal association between two variables.**Point-Biserial Correlation**: Relationship between a continuous and a binary variables**Phi Coefficient (φ)**: Association between two binary variables.**Contingency Tables / Chi-Square Tests**: Association between two categorical variables**Cramér’s V**: Association for categorical variables based on chi-square statistics

Read our tutorial on the topic to understand what are the measures of statistical dependence, how they work, and how to use them in Python.

Summary statistics are a crucial component of data preprocessing for machine learning.

As we have seen, summary statistics are used by data scientists to gain a insight into the dataset’s central tendencies, variations, and outliers.

Using this information, data scientist can prepare the data for machine learning through data scaling, normalization, and handling missing values.

Summary statistics are also used by popular machine learning algorithms like decision trees, random forests, and k-nearest neighbors to make split decisions and/or determine feature importance.

For example, decision tree algorithms uses measures like Gini impurity or entropy, which rely on summary statistics to evaluate feature splits. In regression models, summary statistics help identify relationships between variables, aiding in feature selection and model interpretation.

At their core, summary statistics help to enhance data quality and facilitate machine learning model training.

In conclusion, summary statistics in Python play a important role in data understanding.

Summary statistics provide insights for data scientists and analysts into the central tendency, spread, and distribution of data.

Common summary statistics include:

- measures of central tendency (mean, median, mode),
- measures of variability (variance, standard deviation, range),
- Measures of the shapes of the distributions
- measures of statistical dependence (correlation coefficients like Pearson’s r, Spearman’s ρ, and Kendall’s τ).

Python has libraries such as NumPy, pandas, and SciPy to calculate and visualize summary statistics.

Understanding summary statistics is essential for data exploration, hypothesis testing, and model building, making it a fundamental skill for data scientists and analysts working with Python.

L’article Introduction to Summary Statistics for Data Science est apparu en premier sur JC Chouinard.

]]>L’article Measures of Central Tendency in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>In statistics and data science, measures of central tendency are used to summarize the data by finding where the center of the data is. The 3 measures of center are the *mean* the *median* and the *mode*.

**Mean:**Average value of a dataset.**Median:**Middle value in a dataset.**Mode:**Most frequently occurring value in a dataset.

In statistics, the mean, also known as the average, is a measure of central tendency.

The mean is calculated by summing up all the values in a dataset and then dividing that sum by the total number of values. The formula for calculating the mean (μ) of a dataset is:

`Mean (μ) = Sum of all values / Number of values (n) `

For example, if we have 3 people aged 5, 7 and 8 years old, then the mean is

`Mean (μ) = (5 + 7 + 8) / 3 ~= 6.67`

And can be calculated in Python using `np.mean`

of the numpy library.

```
import numpy as np
np.mean([5,7,8])
# 6.666666666666667
```

In statistics, the median is a measure of central tendency where 50% of the data is lower than it an 50% of the data is higher.

The median is calculated by sorting all the values in a dataset and then selecting the middle one.

In Python, the median can be calculated using `np.median`

of the numpy library.

```
import numpy as np
np.median([1,2,3,4,5,6,7])
# 4.0
```

In statistics, the mode is a measure of central tendency where the most frequently occurring value in a dataset.

When we use the value_counts() method on a Pandas DataFrame, we are seeing the occurrences of values sorted by most frequent. The top value is the mode.

The mode can be calculated in Python using the `scipy.stats.mode()`

or the `statistics.mode()`

functions.

```
from scipy import stats
import statistics
data = [1,2,2,3,4,5,5,5]
print(stats.mode(data).mode)
print(statistics.mode(data))
```

The mode is often used on categorical variables since they are often unordered and generally don’t have a numeric representation.

Choosing the right measure of central tendency (mean, mode, or median) depends on your data and the information you want to draw from it. While the mean is the most often used measure, it may not be the best, follow these quick guidelines to understand which measure to choose:

- Mean: More sensitive to outliers. Better for symmetrical data (normally distributed).
- Median: Less sensitive to outliers. Better for non-symmetrical (skewed) data.
- Mode: More appropriate for categorical data

L’article Measures of Central Tendency in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>L’article How to Find Outliers (Python Example) est apparu en premier sur JC Chouinard.

]]>Outliers in statistics represent the extreme data points that are significantly different from the others.

Outliers are defined by one of these rules:

- data point < Q1 – (1.5 * IQR)
- data point > Q3 + (1.5 * IQR)

Simply put, outliers are data points in data set that are outside of the box plot limits.

To find outliers in a dataset, we need to compute the quartiles and find the first and the third quartiles. Then, we need to compute the interquartile ranges on the dataset and verify if the data point is greater than or smaller than 1.5 * IQR.

So, if we take our custom dataset:

```
import numpy as np
# Generate a random dataset with outliers
data = np.concatenate(
[
np.random.normal(0, 1, 100),
np.random.normal(10, 1, 10)
]
)
```

We can calculate the quartiles and the iqr using either the `np.percentile`

, or the `np.quantile`

functions, or alternatively, use the `scipy.stats.iqr`

function.

```
# Calculate quartiles and IQR for outlier detection
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1 # equivalent to scipy.stats.iqr(data)
# Define the lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
```

L’article How to Find Outliers (Python Example) est apparu en premier sur JC Chouinard.

]]>L’article Measures of Statistical Dependence “Correlation” in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>The measures of statistical dependence, also known as the measures of correlation, are the summary statistics used to evaluate the relationships between variables.

The 8 measures of statistical dependence used to evaluate the correlation between multiple variables are:

**Covariance:**How much two random variables change together**Correlation Coefficient:**Linear relationship of two continuous variables**Spearman’s Rank Correlation:**Strength/direction of the monotonic relationship between two variables.**Kendall’s Tau (τ)**: Strength/direction of ordinal association between two variables.**Point-Biserial Correlation**: Relationship between a continuous and a binary variables**Phi Coefficient (φ)**: Association between two binary variables.**Contingency Tables / Chi-Square Tests**: Association between two categorical variables**Cramér’s V**: Association for categorical variables based on chi-square statistics

In this tutorial, we will focus on the most common measures: Covariance, Pearson’s Correlation Coefficient (linear correlation), Spearman’s Rank Correlation (monotonic correlation), and Kendall’s Tau (ordinal correlation).

The covariance measures how much two variables change together.

It shows how much a variation in one variable is associated with a variation in another.

The downside of using the covariance in establishing the correlation is that it is sensitive to the scale of the variables.

To calculate the covariance, you need to get the average from each variable and subtract each value from the average, multiply the matrices and add the values together. Finally, divide the result by the number of values.

The formula of the covariance is

`Cov(X, Y) = Σ(Xi-µ)(Yj-v) / n`

To calculate the covariance between two array variables in Python, use the `cov()`

function from the `numpy`

library.

```
import numpy as np
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
# Calculate the covariance matrix
np.cov(x, y)
```

The function returns a 2×2 array (or covariance matrix) where diagonal values measure variability and off-diagonal values show relationships.

- Positive means they tend to increase together
- Negative means one goes up when the other goes down.

Here are examples of different kinds of covariance matrices.

```
import numpy as np
# Sample data
x = np.array([1, 2, 3, 4, 5])
x2 = np.array([6, 5, 4, 3, 2])
y = np.array([2, 3, 4, 5, 6])
x3 = np.array([1, 2, 3, 4, 5])
y3 = np.array([1, 2, 3, 2, 1]) # Example with no covariance
# Calculate the covariance matrices
cov_matrix = np.cov(x, y)
cov_matrix2 = np.cov(x2, y)
cov_matrix3 = np.cov(x3, y3)
print('Positive variation:\n', cov_matrix)
print('Negative variation:\n', cov_matrix2)
print('No covariance:\n', cov_matrix3)
```

And what it looks like on a graph

The Pearson’s r correlation coefficient quantifies the linear relationship between two *continuous* variables.

The results ranges from -1 to 1:

**Perfect negative correlation:**-1**Perfect positive correlation:**1**No linear Correlation:**0

To calculate the Pearson’s R correlation coefficient, use the `pearsonr`

function from `scipy.stats`

library.

```
import numpy as np
from scipy.stats import pearsonr
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
# Calculate Pearson's correlation coefficient
correlation_coefficient, _ = pearsonr(x, y)
print("Pearson's Correlation Coefficient:", correlation_coefficient)
```

The output here shows a perfect positive correlation where when 1 variable increases by one, the other increases by the same amount.

`Pearson's Correlation Coefficient: 1.0`

```
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import seaborn as sns
# Create data for scenarios
np.random.seed(0)
# Negative correlation
x_neg = np.linspace(0, 10, 50)
y_neg = -2 * x_neg + 10 + np.random.normal(0, 2, 50)
# Positive correlation
x_pos = np.linspace(0, 10, 50)
y_pos = 2 * x_pos + np.random.normal(0, 2, 50)
# No correlation
x_no_corr = np.linspace(0, 10, 50)
y_no_corr = np.random.normal(0, 2, 50)
# Calculate Pearson correlation coefficients
corr_coeff_neg, _ = pearsonr(x_neg, y_neg)
corr_coeff_pos, _ = pearsonr(x_pos, y_pos)
corr_coeff_no_corr, _ = pearsonr(x_no_corr, y_no_corr)
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Scatter plot 1 (Negative Correlation)
sns.regplot(x=x_neg, y=y_neg, ax=axes[0], color='red', scatter_kws={'s': 15}, line_kws={'color': 'blue'}, ci=95)
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].set_title(f"Negative Correlation (r = {corr_coeff_neg:.2f})")
# Scatter plot 2 (Positive Correlation)
sns.regplot(x=x_pos, y=y_pos, ax=axes[1], color='green', scatter_kws={'s': 15}, line_kws={'color': 'blue'}, ci=95)
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].set_title(f"Positive Correlation (r = {corr_coeff_pos:.2f})")
# Scatter plot 3 (No Correlation)
sns.regplot(x=x_no_corr, y=y_no_corr, ax=axes[2], color='blue', scatter_kws={'s': 15}, line_kws={'color': 'blue'}, ci=95)
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y')
axes[2].set_title(f"No Correlation (r = {corr_coeff_no_corr:.2f})")
# Adjust layout
plt.tight_layout()
# Show all plots
plt.show()
```

The Spearman’s rank correlation, also known as Spearman’s rho, evaluates the strength and direction of the monotonic relationship between two variables.

A monotonic relationship is a relationship between variables that happens when the value of variable increases or decreases when the other variable increases.

Spearman’s rho check the ranks of the data instead of their actual values. This makes it less impacted by outliers and helps with ordinal data.

To calculate the Spearman’s Rank Correlation, use the `spearmanr`

function from `scipy.stats`

library.

```
from scipy.stats import spearmanr
# Example data
x = [10, 20, 30, 40, 50]
y = [5, 15, 25, 35, 45]
# Calculate Spearman's rank correlation
rho, p_value = spearmanr(x, y)
# Print the result
print(f"Spearman's Rank Correlation Coefficient: {rho}")
print(f"P-value: {p_value}")
```

When interpreting the Spearman’s rho number, check this general guideline:

**Positive rho**: As one variable increases, the other tends to increase,**Negative rho**: As one variable increases, the other tends to decrease.**Rho = 0:**No monotonic relationship.

In statistics, Kendall’s Tau (τ) measures the strength and direction of the ordinal association between two variables.

To calculate the Spearman’s Rank Correlation, use the `kendalltau`

function from `scipy.stats`

library.

```
import numpy as np
from scipy.stats import kendalltau
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 1, 5, 4])
# Calculate Kendall's Tau
tau, p_value = kendalltau(x, y)
print(f"Kendall's Tau (τ): {tau:.2f}")
print(f"P-value: {p_value:.4f}")
```

When interpreting the Kendall’s Tau (τ) number, check this general guideline:

**τ is close to 1**: Strong positive correlation: Strong negative correlation**τ is close to -1**0**τ is close to****:**No correlation

Refer to this table to evaluate which correlation algorithm to choose to evaluate the relationship between variables.

Correlation Measure | Best for Data Type | Robust to Outliers | Type of Relationship |
---|---|---|---|

Covariance | Interval Data, Ratio Data | No | Linear |

Pearson’s Correlation Coefficient (r) | Interval Data, Ratio Data | No | Linear |

Spearman’s Rank Correlation (ρ) | Ordinal Data, Interval Data | Yes | Monotonic |

Kendall’s Tau (τ) | Ordinal Data, Data with Tied Ranks | Yes | Concordance or Discordance |

L’article Measures of Statistical Dependence “Correlation” in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>L’article Measures of Variability “Spread” in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>The measures of variability, also known as measures of spread, are. the summary statistics used to understand the variability of the data how close or spread apart the data points are).

The 8 measures of variability in summary statistics are the range, the interquartile range (IQR), the variance, the standard deviation, the Coefficient of Variation (CV), the Mean Absolute Deviation, the Root Mean Square (RMS) and the Percentile Ranges.

**Range:**Difference between the maximum and minimum values in a dataset**Interquartile Range (IQR):**Difference between the third and first quartiles (Q3 and Q1) . Focuses on the middle 50% to reduce the impact of outliers.**Variance:**Average distance of each data point from the mean**Standard Deviation**: Square-root of the variance**Coefficient of Variation (CV):**Percentage ratio of the standard deviation to the mean**Mean Absolute Deviation:**Average absolute difference between data points and the mean**Root Mean Square (RMS):**Square root of the mean of the squared values.**Percentile Ranges:**Ranges between specific percentiles to provide insights into the central of the data less influenced by extreme values.

Name | Description | When to Use | Python Function |
---|---|---|---|

Range | Difference between max and min. | Quick overview of spread. | `max(data) - min(data)` |

Interquartile Range (IQR) | Range of middle 50% data. | Robust to outliers. | `np.quantile(data, 0.75) - np.quantile(data, 0.25)` |

Variance | Average squared deviations. | Sensitive to outliers. | `np.var(data)` |

Standard Deviation | Square root of variance. | Interpretable, in same units. | `np.std(data)` |

Coefficient of Variation (CV) | Standard deviation relative to mean. | Comparing datasets with different scales. | `(np.std(data) / np.mean(data)) * 100` |

Mean Absolute Deviation (MAD) | Average absolute deviations. | Robust to outliers. | `np.mean(np.abs(data - np.mean(data)))` |

Root Mean Square (RMS) | Square root of mean of squared values. | Used in signal processing. | `np.sqrt(np.mean(np.square(data)))` |

Percentile Ranges | Ranges between specific percentiles. | Highlight central data range. | `numpy.percentile(data, q) - numpy.percentile(data, p)` |

The Range in statistics represents the difference between the maximum and minimum values of a dataset.

The range is calculated by subtracting the minimum value from the maximum value.

`range = max - min`

To calculate the range for values in Python, use the `max()`

and the `min()`

function. Note that the `range()`

function is used to create the range, not to calculate it.

```
values = range(1, 10) # create range
rg = max(values) - min(values) # calculate range
print('Values:', list(values))
print('Range:', rg)
```

```
Values: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Range: 8
```

The Interquartile Ranges, or IQR, in summary statistics represents the difference between the third quartile (Q3) and the first quartile (Q1). By focusing on the middle 50% of the data, the Interquartile Ranges allow an analysis that is less influenced by extreme values.

Simply put, the interquartile range is the height of the box in a boxplot.

To understand IQR, we need to introduce the concept of quantiles.

Quantiles, also known as percentiles, occurs when we split the data in equal parts. For example, when we split quantiles into 4 equal parts, we call those *quartiles*.

We can use the `np.quantile()`

function in Python to split the data in equal parts.

```
import numpy as np
# Example numerical dataset
data = np.array(range(1,10))
# Calculate the median (50th percentile)
median = np.quantile(data, 0.5)
print("Median (50th percentile):", median)
np.median(data) == np.quantile(data, 0.5)
```

```
Median (50th percentile): 5.0
True
```

You can see how, in the code above, quantiles split in 50% is the same as computing the median value.

You can also calculate the 25th, 75th and 90th percentiles of your dataset.

```
# Calculate the 25th percentile (1st quartile)
q1 = np.quantile(data, 0.25)
print("25th Percentile (1st Quartile):", q1)
# Calculate the 75th percentile (3rd quartile)
q3 = np.quantile(data, 0.75)
print("75th Percentile (3rd Quartile):", q3)
# Calculate a custom quantile (e.g., the 90th percentile)
custom_quantile = np.quantile(data, 0.9)
print("90th Percentile (Custom Quantile):", custom_quantile)
# Calculate all quartiles at once
quartiles = np.quantile(data, [0,0.25,0.5,0.75,1])
print("Quartiles:", quartiles)
```

```
25th Percentile (1st Quartile): 3.0
75th Percentile (3rd Quartile): 7.0
90th Percentile (Custom Quantile): 8.2
Quartiles: [1. 3. 5. 7. 9.]
```

The best way to visualize quantiles in Python, is to create a box plot from your data.

```
import matplotlib.pyplot as plt
import numpy as np
# Example numerical dataset
data = np.array(range(10))
# Create a box plot
plt.boxplot(data)
# Add labels and a title
plt.xlabel("Dataset")
plt.ylabel("Values")
plt.title("Box Plot of Quartiles")
# Show the plot
plt.show()
```

The interquartile range is calculated by subtracting quartile 1 data from quartile 3 data.

`IQR = Q3(data) - Q1(data)`

The interquartile ranges (IQR) can be calculated in Python by using `np.quantiles()`

to subtract the first quantile from the third, or by using the `iqr()`

function from the `scipy.stats`

module.

`np.quantile(data, 0.75) - np.quantile(data, 0.25)`

```
from scipy.stats import iqr
iqr(range(1,10))
```

`4.0`

The variance in statistics is a measure of the spread or variability of a dataset.

The variance quantifies how far individual data points in a dataset differ from the mean (average) of the dataset.

Simply put, the variance show how dispersed or scattered the data points are around the mean.

On a scatter plot, we can easily visualize the dispersion of the data by modifying the variance.

To calculate the variance, subtract the mean from each data point, square the distances, sum all the squared values and divide by the number of data points minus 1.

Formula of the variance

To compute the variance in Python, either use the `var()`

function from `numpy`

or compute the sum the squared distances from the mean and divide by the number of values minus 1.

To calculate the variance in Python with `Numpy`

, use the `var()`

function with the `ddof`

argument set to 1. The `ddof`

argument is used to specify the formula to use when working with a sample of data.

```
# Variance with np.var()
import numpy as np
data_points = [1,2,3,4,5]
np.var(data_points, ddof=1)
```

Follow these steps to calculate the variance manually in Python:

- Calculate distances from the mean
- Square the distances
- Sum the squared distances
- Find Number of data points
- Divide the sum of the squared distances by n-1

```
import numpy as np
data_points = [1,2,3,4,5]
# 1. Calculate distances from the mean
distances = data_points - np.mean(data_points)
print('1. Mean distances:', distances)
# 2. Square the distances
sqr_distances = distances ** 2
print('2. Squared distances:', sqr_distances)
# 3. Sum the squared distances
summed_distances = sum(sqr_distances)
print('3. Sum of the squared distances:', summed_distances)
# 4. Find Number of data points
n = len(data_points)
print('4. Number of data points:', n)
# 5. Divide the sum of the squared distances by n-1
variance = summed_distances / (n - 1)
print('5. Variance:', variance)
```

Output

```
1. Mean distances: [-2. -1. 0. 1. 2.]
2. Squared distances: [4. 1. 0. 1. 4.]
3. Sum of the squared distances: 10.0
4. Number of data points: 5
5. Variance: 2.5
```

Note that the S^2 notation shows how the result of the variance is a squared value on which the sqare-root could be applied to compute the standard deviation.

The standard deviation in statistics is a measure of the spread or variability of a dataset calculated using the square-root of the variance.

Simply put, the standard deviation is the square-root of the variance.

The advantage of the standard deviation compared to the variance is that the standard deviation is in the same units as your data points (seconds, minutes, days, etc.).

To calculate the standard deviation, calculate the variance and compute the square-root of the variance.

Formula of the standard deviation

To compute the standard deviation in Python, either use the `std()`

function from `numpy`

or use the `np.sqrt()`

function on the computed variance.

`np.std(data_points)`

`np.sqrt(np.var(data_points))`

Just note that using the np.std() function will compute the sqrt() function on the variation calculated from a division with n, not n-1.

```
import numpy as np
data_points = [1,2,3,4,5]
```

```
# Standard deviation with np.std
print('std():',np.std(data_points))
# Standard deviation with np.sqrt(np.var()) n
variance = np.var(data_points)
print('np.sqrt(np.var()):',np.sqrt(variance))
```

```
# Standard deviation with np.std
print('std(ddof=1):',np.std(data_points, ddof=1))
# Standard deviation with np.sqrt(np.var()) n-1
variance = np.var(data_points, ddof=1)
print('np.sqrt(np.var(ddof=1)):',np.sqrt(variance))
```

Output

```
std(): 1.4142135623730951
np.sqrt(np.var()): 1.4142135623730951
std(ddof=1): 1.5811388300841898
np.sqrt(np.var(ddof=1)): 1.5811388300841898
```

The Coefficient of Variation, or CV, in statistics is a a relative measure of variability represented by the percentage ratio of the standard deviation and the mean.

To calculate the coefficient of variation (CV), you need to calculate the standard deviation and the mean of a dataset. Then, find the coefficient of variation by computing the ration of the standard deviation to the mean.

`cv = (std/mean) * 100`

To calculate the coefficient of variation in Python, use the `mean()`

and the `std()`

function of the `numpy`

library, and then divide the standard deviation by the mean.

```
import numpy as np
# Sample data
data_points = [1, 2, 3, 4, 5]
# Calculate the mean (average) of the data
mean = np.mean(data_points)
# Calculate the standard deviation of the data
std_dev = np.std(data_points)
# Calculate the Coefficient of Variation (CV)
cv = (std_dev / mean) * 100
# Print the result
print(f"Coefficient of Variation (CV): {cv:.2f}%")
```

`Coefficient of Variation (CV): 47.14%`

The Mean Absolute Deviation, or MAD, in statistics is the average absolute difference between data points and the mean. It is often used as an alternative to the standard deviation since it does not require to square deviations.

To calculate the mean absolute deviation, calculate the differences between each data point and the mean. Then, get the absolute values. Finally, compute the mean.

`mad = mean(absolute(data_points - mean(data_points)))`

To calculate the mean absolute deviation (MAD) in Python, get the absolute values from the subtraction of the mean from each data point, then find out the mean of the absolute values.

```
import numpy as np
data_points = [1,2,3,4,5]
# 1. Calculate distances from the mean
distances = data_points - np.mean(data_points)
print('1. Mean distances:', distances)
# 2. Calculate the mean absolute deviation
mad = np.mean(np.abs(distances))
print('2. Mean absolute deviation:', mad)
```

```
1. Mean distances: [-2. -1. 0. 1. 2.]
2. Mean absolute deviation: 1.2
```

The Root Mean Square, or RMS, in statistics represents the square root of the mean of the squared values used to measure the magnitude of variations.

Mathematically, the RMS can be represented as:

`RMS = sqrt((x₁² + x₂² + ... + xₙ²) / n)`

To calculate the root mean square in Python,

- Square each data point in your dataset.
- Calculate the mean of the squared values
- Take the square root the mean of the squared values.

```
import numpy as np
# Sample data
data_points = [1, 2, 3, 4, 5]
# 1. Square each value
squared_data = [x**2 for x in data_points]
# 2. Find the mean of the squared values
mean_squared = np.mean(squared_data)
# 3. Take the square root
rms = np.sqrt(mean_squared)
print(f"Root Mean Square (RMS): {rms}")
```

`Root Mean Square (RMS): 3.3166247903554`

The Percentile Ranges is a summary statistic extracted using the ranges between specific percentiles (e.g 10th and 90th percentiles). The percentile ranges provide insights into the central X% of the dat, which is less influenced by outliers.

Calculating percentile ranges in Python is similar to calculating interquartile ranges, but with custom percentiles specified. You can do so by using the `percentile()`

function of the `numpy`

library.

```
# Calculate the first percentile (Q1)
q1 = np.percentile(sorted_data, 25)
# Calculate the third percentile (Q3)
q3 = np.percentile(sorted_data, 75)
# Calculate the percentile range
percentile_range = q3 - q1
```

Outliers in statistics are the extreme data points that are significantly different from the others. They can have a big impact on the measures of variability and should be understood, and dealt with accordingly, by statistician and data scientists. Check out our tutorial to understand how to identify outliers in a dataset.

L’article Measures of Variability “Spread” in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>L’article Measures of the Shapes of the Distributions in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>The measures of the shapes of the distributions are a summary statistic used to understand data characteristics, identify outliers, and improve modelling decisions.

Data may be distributed in different ways. Sometimes it is symmetrical (e.g. normal distribution), sometimes it is non-symmetrical (e.g. right/left skewed), and sometimes is narrower and steeper than others (e.g. kurtosis).

To identify describe these different shapes of distribution, statisticians use mainly two different kinds of summary statistics:

**Skewness**: measure of the asymmetry of a distribution**Kurtosis**: measure of the tailedness of a distribution

In summary statistics, the skewness is the measure of the asymmetry of a distribution.

Simply put, it shows how symmetrical both sides of the peak of a curve are.

A distribution can be:

- Zero skew
- Left Skewed (negative skew)
- Right Skewed (positive skew)

When we think of the bell-shaped normal distribution, we say that it has zero skew. Zero skew means that the left side and right sides are mirror images.

The normal distribution is not the only distribution that have zero skew. The uniform distribution for example also has zero-skew.

A distribution is zero skew when the mean and the median are equal:

```
# zero skew
mean = median
```

Skewness can be understood in terms of tails. A distribution is left skewed when it is longer on the left side of its peak, than on its right.

A distribution is left skewed when the mean is smaller than the median:

```
# left skew
mean < median
```

A distribution is positive, or right-skewed, when it is longer on the right side of its peak, than on its right.

A distribution is right skewed when the mean is greater than the median:

```
# right skew
mean > median
```

In summary statistics, the kurtosis is the measure of the tailedness of a distribution.

Simply put, it shows if data in a distribution are more or less extreme (outliers) than a normal distribution.

The three types of distributions with kurtosis are:

**Leptokurtic**: Large tails, more extreme outliers, positive kurtosis**Mesokurtic**: Medium tails, kurtosis equal to zero**Platykurtic**: Thin tails, less extreme outliers, negative kurtosis

The formula for the kurtosis is the sum of the differences of each data point from the mean to the fourth, divided by the standard deviation to the fourth.

`kurtosis = Σ(x - µ)^4 / σ^4`

Google simplified this removing the sum of differences.

To calculate the kurtosis of a dataset in Python, use the `kurtosis`

function from the `scipy.stats`

libary.

```
import numpy as np
from scipy.stats import kurtosis
# Sample dataset from a normal distribution
data = np.random.normal(0, 1, 1000)
# Calculate kurtosis
kurtosis_value = kurtosis(data)
print(f"Kurtosis: {kurtosis_value:.2f}")
```

As a general guideline, when evaluating the result of a kurtosis:

- a positive value indicate a Leptokurtic distribution that is more peaked than normal (more extreme outliers)
- a negative value indicates a Platykurtic distribution that is more flatter than normal (less extreme outliers).
- a value that equals 0 indicates Mesokurtic distribution that follows the normal.
- Values beyond −2 and +2 are considered indicative of excessive flatness or peakness

L’article Measures of the Shapes of the Distributions in Summary Statistics (Python Examples) est apparu en premier sur JC Chouinard.

]]>L’article Create a Python Virtual Environment With Venv est apparu en premier sur JC Chouinard.

]]>```
$ python3 -m venv venv
```

This command creates a virtual environment. Note that you can also use Anaconda Environments instead.

Activate it by running the activate file.

```
$ . venv/bin/activate
```

To deactivate the environment, use `deactivate`

.

```
$ deactivate
```

L’article Create a Python Virtual Environment With Venv est apparu en premier sur JC Chouinard.

]]>L’article Impact of Swearwords on SEO and Social Media Reach (Case Study) est apparu en premier sur JC Chouinard.

]]>I wrote an article where I wanted to show the impact of blacklisted terms on Google and Social Media.

On an article on SEO jokes, I created an initial version of the article that contained a lot of blacklisted words to prove that Google wouldn’t index it. After 16 hours of the profanity article being live and posted on Linkedin and Twitter, Google still did not index the article.

On top of that, the article did horrible on LinkedIn.

I removed the blacklisted terms and reposted a new identical post on LinkedIn.

48 minutes later, the article was indexed on Google and it reached a lot more users than its bad-mouth counterpart, proving that this both LinkedIn and Google assess the quality of your content before showing it to its users.

This is why you don’t block Twitter and LinkedIn in robots.txt.

Anytime you post a link on these platform, their bot send a request to view your page.

It is mentioned on their documentation that this is meant for them to fetch the metadata, including the thumbnail from your post. However, this post proves that they also use it to assess the quality of your content.

After 23 Hours, this Twitter post had 277 Views.

I just deleted and reposted that one right now at 3:57PM on March 26. Let’s see.

After 20 minutes, this post already had more views than previous one.

Twitter also seem to check the content before pushing it.

L’article Impact of Swearwords on SEO and Social Media Reach (Case Study) est apparu en premier sur JC Chouinard.

]]>L’article SEO Jokes to Get Us Through Core Updates est apparu en premier sur JC Chouinard.

]]>It started as a practical joke, but when SEO pranksters realized that people actually believed that SEO was a job (*what a gullible crowd*), they pushed the joke further. One fake news after the other, they’ve tried to convince people that SEO was not a job, but a person, and that it was now dead.

This post will not really be funny at all (nor contain any jokes). It is only stupid and offensive. Be aware.

It is mostly written for the fun of adding a ton of inaccurate facts to my website, while adding a clickbait title that targets keywords with the wrong search intent. I also wanted to make sure that I had a few profanity blacklisted words on my blog to f**king offend Googlebot as much as possible.

I reality, I never swear (only in french… all the time).

This story all started when a spoiled little brat started whining about becoming a has-been, after being kind of big in something that no-one cares about.

While people were very supportive about me being a drama gender-neutral queen/king, search relations team were more practical.

It took me some time to provide a GOOD ANSWER to John Mueller, because they are all on PAGE 3 nowadays.

While searching for the good answer, I had to browse through the entire web.

After reading through all the websites, all the webpages, all the ChatGPT prompts, I couldn’t overcome Aleyda Solis’ tweets and retweets.

After reading all of Google patents, I started getting headaches trying to understand Koray Gubur’s sheer depth of knowledge and inventions of words. No Koray, information retrieval and microsemantics are not words, they’re cat names.

I realized something there though.

Call me anthropomorphist, but some websites carry some human-like features, emotions and preferred activities.

Google is a scrap-booking grandma in love with Taylor Swift.

While Reddit is a horny teenager lying in a brothel.

Some website also like to take vacations, travel and challenge themselves to adventurous, yet potentially dangerous situations. For instance, Johnmu.com always dreamt to visiting the dark web and finally managed to go through with it.

It was not a good vacation. He did not come back as his cheery uplifting self but with a somber brown-socks-wearing dull name on X. He even moved to a new fantasy land. (just know that you are loved John).

OK, enough with that.

I told you that SEO is not a real job. I stand by it.

Any job that did not exist when your parents were born is not a real job, it’s a hobby.

When I told my mom I wanted to be an SEO, I was 15 years old. She asked me what kind of job this was. It told her that anytime someone searched for something on Altavista, Yahoo! or Google, an SEO had to go through all the websites, sort the results and same the results to a file, this way we can do this work only for NEW queries and serve the same search results again for the queries that we already had worked on. And this is why, Mom, that I watch so much porn.

Now that the back story is set, Brian Dean tells me in a SEMRush sponsored article that I need a BUCKET BRIGADE.

Bucket brigades are soldiers that will beat you up if you bounce from my page without reading it through the end.

Ok, I promised you SEO jokes, and I delayed the answer with SEO content enough.

Now, I need to write good helpful AI-generated content to optimize for the failed HCU and Core Updates. But before I do so, go easy on Google. You know, Google is only human, you can’t make good algorithm updates, play Monopoly, AND find new clever names for Google My Business and Google Data Studio at the same time. (“Bank Error in Your Favor: Collect $200”).

Speaking of names did you notice how Google uses the algorithms’ names to give us hints about what we have fix on our website when we were penalized by an update?

It is true, here is what they :

**Helpful Content Update –**To fix your rank loss, simply remove all helpful content on your website**Spam Update –**To remove the algorithmic penalty, just add spam to your website**Core update –**Just hit the gym and work on your Core, you won’t improve ranks but you’ll have a fantastic beach body**Product Review Update –**Go on G2 Crowd and write good reviews about Google Products, you’ll get better rankings**Penguin Algorithm Update**– Just write about penguins, you’ll be fine.**Mobilegeddon**– What, you never watched Armageddon?

OK, here are my SEO jokes, right after my next annoying Ad.

**SEO Joke:**

What does SGE, Google+ and Google Buzz have in common?

… They’re all good products.

**Yet another SEO Joke:**

Google is like a Book. I never managed to read past Page 3.

**SEO Riddle: **

You are like the lemmatized version of this URL’s slug.

You’re an _____________.

**SEO Insult:**

You are like People Also Ask questions on Google search, nobody likes you.

AHAHAH! Pfewwwww! That was very clever AND good, wasn’t it? Pure comedy right there.

Let me keep up with more non-satisfying-unrelated-to-the-search-query-and-title content. Here are some SEO tips.

SEOs are struggling to make sure that Google does not crawl, index or serve content in search results. Google is a stubborn SOB, if you put a robots.txt, It will not crawl, but it will index, though blocked by robots.txt. If you add a noindex meta-tag, it will crawl your page like spiders in your bed at night.

Let me give you a live example with real tips on how to make sure Google does not crawl, index or serve content in its search results:

First and foremost, I will add good human-written content on my page, that will keep Google at bay.

Next, I will write about anything that Google can make money on: Hotels, Flights, Jobs, Shopping. That way even if Google indexes my content and rank me #1, it will be buried 20 scrolls down, right below its own helpful features.

But to make sure, I will trigger soft 404s by talking about how this “Page Not Found”, “Product no longer available, “No matching search results”, “this product is out of stock”, “We couldn’t find any jobs that matched your search”, “Sorry, there are no Hotels available”.

To seal the deal, if you really don’t want to be indexed, add some of the words from Google blacklisted words. The initial version of this post had so many of them that Google, LinkedIn, Twitter and Facebook all decided to prevent this seriously wrong post from reaching your beautiful eyes.

Most of my LinkedIn posts reach 1000 impressions in the first day. This one really shows how good it is

After I reposted the not-as-fun article without the all the juicy blacklisted words, LinkedIn decided to show the article.

Disclaimer

If people reading this post have not realized yet that this article is just meant for comedy, let me straight up the politically correct thing to do before going on. If you are not the kind to be offended by these kinds of things, just skip up that part.

This article is for FUN. I do understand that SEO is a real livelihood, that people and companies are going through hard times now. Just trying to lighten the mood here.

I know that people working in SEO, Google search engineers (and search relations team members) are good people that work hard to make the web better while doing what they are paid to do. Businesses like Google are still businesses trying to make profits, improve their products and pay their employees (which is a good thing).

I totally agree with Barry Schwartz on this post on X and believe we should be kinder to one another.

This is why none of these post are a meant as a personal attack to anyone, nor any company, they really are only meant for fun. I only joke about people that I throughly respect and will be happy to take down anything that one may find offensive.

End of disclaimer

Sorry everyone.

L’article SEO Jokes to Get Us Through Core Updates est apparu en premier sur JC Chouinard.

]]>L’article Scrape Linkedin Jobs with Python (Example) est apparu en premier sur JC Chouinard.

]]>Doing manually is good, but it is prone to error, tedious, and not scalable. In this tutorial, we will be using Python and scraping LinkedIn jobs using it.

Further, in this tutorial, I will explain how you can scale this method, bypassing the limitations of scraping LinkedIn jobs using Python. And later, we will store this data for analysis purposes in a CSV file.

For this tutorial, we will use LinkedIn Job Scraping API which will help us scrape job data from LinkedIn without getting blocked.

We will use Python 3.x for this tutorial. You can install Python or use Google Colab. See how to install Python. After this create a folder in which you will keep the Python scripts. I am naming this folder as `jobs`

. In Terminal Type:

`$ mkdir jobs`

Then download the following libraries.

- requests: This will be used for creating an HTTP connection with the LinkedIn Jobs Scraping API.
- Pandas: This will be used to convert JSON data into a CSV file.

Finally, you have to sign up for the free pack of Scrapingdog. With the free pack, you get 1000 credits which is enough for testing purposes. This tutorial, will take less than 40 credits.

Before we start scraping the jobs let’s look at the documentation first. We can see that there are four ** required** parameters that needs to be passed while making the GET request.

To scrape Linkedin jobs with Python, you will need an proxy service API key, a Linkedin geoid and a job query to fill the LinkedIn search bar.

For this tutorial, we we are going to scrape jobs offered by *Google* in the USA, using the Scraping Dog proxy service.

To find the GeoId inside LinkedIn jobs, open the LinkedIn jobs search page and check the geoid parameter inside the URL.

As you can see the geoId of the USA is **103644278**. The geoid for Canada is **101174742**.

To scrape your first LinkedIn jobs page, define the API endpoint and create a parameter dictionary that contains the API key, the job search term, the LinkedIn geoid and the page number from the paginated results. Then, use Python requests to fetch the Scraping Dogs API.

```
import requests
# Define Scraping Dog API Key
api_key = 'you-API-key'
# Define the URL and parameters
url = "https://api.scrapingdog.com/linkedinjobs/"
geo_id = '101174742' #Canada. US: "103644278"
params = {
"api_key": api_key,
"field": "Google",
"geoid": geo_id,
"page": "1"
}
# Send a GET request with the parameters
response = requests.get(url, params=params)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Access the response content
google_jobs = response.json()
print(google_jobs)
else:
print("Request failed with status code:", response.status_code)
```

This code will return all the jobs offered by Google in the USA. Once you run this code you will find this response on your console.

The API response contains the following fields:

*job_position*,*job_link*,*job_id*,*company_name*,*company_profile*,*job_location**job_posting_date*.

It is not necessary to pass the company name all the time. You can even pass the job title as the value of the ** field** parameter.

```
import requests
api_key = 'you-API-key'
url = "https://api.scrapingdog.com/linkedinjobs/"
geo_id = '101174742'
params = {
"api_key": api_key,
"field": "Product Manager", # job title here
"geoid": geo_id,
"page": "1"
}
# Send a GET request with the parameters
response = requests.get(url, params=params)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Access the response content
pm_jobs = response.json()
print(pm_jobs)
else:
print("Request failed with status code:", response.status_code)
```

For every page number, you will get a maximum of **25 jobs**. Now, the question is how to scale this solution to collect all the jobs from the API.

To extract all the jobs on Linkedin at scale for a specific geoId and search term, we will run an infinite `while`

loop. In this example, we will extract all the jobs for jobs at Google in the USA found on Linkedin jobs by following these steps:

- Create a function to store the LinkedIn data
- Create an infinite while loop to Scrape Each Search Page
- Extract and store the LinkedIn data

The first step is to create a function that will store the data. Thus, if something breaks, you will still have data stored for each loop.

This will create a Pandas DataFrame from the data we return. If a file already exists, it will read it, and append the data to the existing file. If not, it will create a new one.

```
import pandas as pd
import requests
import sys
def process_json_data(data, csv_filename="linkedin_jobs.csv"):
# Convert the list of dictionaries to a Pandas DataFrame
df = pd.DataFrame(data)
try:
# Try to read an existing CSV file
existing_df = pd.read_csv(csv_filename)
# Append the new data to the existing DataFrame
updated_df = pd.concat([existing_df, df], ignore_index=True)
# Save the updated DataFrame back to the CSV file
updated_df.to_csv(csv_filename, index=False)
print(f"Data appended to {csv_filename}")
except FileNotFoundError:
# If the CSV file does not exist, create a new one
df.to_csv(csv_filename, index=False)
print(f"Data saved to a new CSV file: {csv_filename}")
return df
```

Now, we will create an infinite while loop to scrape each search page. This while loop will run until the length of the data array comes to zero. Because we dont want to spend all your API credits for this tutorial, I will add the `sys.exit()`

line to the end, stopping after the first loop. Remove that line to scrape all of LinkedIn Jobs.

```
import sys
while True:
# We will add Python code here
# Stop after first loop,
sys.exit() # remove to fetch everything
```

To extract and store the LinkedIn data, add the API request and a Python storing function inside the `while`

loop.

```
import pandas as pd
import requests
import sys
# Example usage inside your loop
page = 0
full_list = []
while True:
l = []
page += 1
url = "https://api.scrapingdog.com/linkedinjobs/"
params = {
"api_key": api_key,
"field": "SEO",
"geoid": geo_id,
"page": str(page)
}
print('Running page:', page)
# Send a GET request with the parameters
response = requests.get(url, params=params)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Access the response content
data = response.json()
for item in data:
l.append(item)
if len(data) == 0:
break
# Store data to CSV
df = process_json_data(l)
full_list.extend(l)
else:
print("Request failed with status code:", response.status_code)
if page == 2:
sys.exit() # Comment out to run the entire loop
full_df = pd.DataFrame(full_list)
```

This code will keep running until the length of data is zero. break statement will stop the infinite while loop.

- At each iteration, it will fetch the API for a new page of 25 results, until all the page are scraped
- After each loop, it creates a Pandas DataFrame and store the data into a CSV file. This way, no progress is lost.
- There is a list (
`l`

) for each loop, and a full list (`full_list`

) to store all the data.

The benefits of scraping LinkedIn job listings are that it helps developers can create more efficient, targeted job boards that cater to the specific needs of various industries and job seekers. The collection of job data not only enhances the job search experience for individuals but also offers companies a more nuanced platform to find the right talent.

Scaling this operation through an API extends the utility of your initial scraping project, enabling you to handle larger volumes of data with greater efficiency.

This scalability is crucial to building a comprehensive job board that remains current with the ever-evolving job market on LinkedIn. An API-based approach allows for real-time updates and integration with other software tools, making your job board more dynamic and responsive to the needs of both job seekers and employers.

In this blog, we learned how you can scrape LinkedIn Jobs at scale using Python.

I hope you liked the tutorial, I will be writing more LinkedIn Tutorials shortly.

Happy Scraping!

L’article Scrape Linkedin Jobs with Python (Example) est apparu en premier sur JC Chouinard.

]]>