What is Logistic Regression in Machine Learning (with Python Example)

Logistic regression is a machine learning algorithm used in supervised learning used for classification problems trying to predict the label of data points.

In logistic regression, the predicted value will be given from the highest probability of getting that value.

Learn Logistic Regression

This post is an overview of the logistic regression algorithm and is in no way complete.

Join the Newsletter

    If you want to learn more about logistic regression algorithms, here are a few Datacamp tutorials that helped me.

    What is the Logistic Regression Algorithm?

    Logistic regression is one of the several regression models used in machine learning.

    This visualization helps understand how logistic regression works.

    The line represents the linear decision boundary.

    Given a new value:

    • If the probability to find a value is below 0.5, it will be given the value 0 (blue).
    • If the probability is above 0.5, it will be given the value 1 (orange).

    Logistic Regression Formula

    The Logistic Regression is based on the Sigmoid mathematical function that has a S shape.

    The Sigmoid function (also called logistic function).

    Linear Regression VS Logistic Regression

    • The Linear regression is estimated using the Ordinary Least Squares (OLS) approach.
    • The logistic regression is estimated using the Maximum Likelihood Estimation (MLE) approach.

    While linear regression is a fantastic technique when the data follows a normal distribution, it is less useful when it doesn’t.

    Let’s have an example where the data is not following a normal distribution.

    Here we can see that the logistic regression better fits the data.

    3 types of logistic regression

    Binary Logistic Regression

    When the response variable can only belong to two categories.

    Example: predict the number of sales of a product. Sold or not Sold.

    Multinomial Logistic Regression

    When the response variable can belong to multiple nominal categories (3 or more).

    Example: predict out of 3 varieties of flowers based on sepal features in the Iris dataset.

    Ordinal Logistic Regression

    When the response variable can belong to multiple ordinal categories (3 or more).

    Example: predict the rating of a business based on social media rating from 1 to 5.


    Make Logistic Regression with Scikit-learn on Binary Classification Models

    To better understand logistic regressions, we will make Logistic Regression on a binary classification model using Python and Scikit-learn.

    In this tutorial, I will look at the online shopper’s behaviour and try to predict whether or not a product will be sold.

    Load Libraries

    Here, we will use matplotlib, numpy, pandas, seaborn, scipy and sklearn.

    # Load standard libraries
    import matplotlib.pyplot as plt
    import numpy as np 
    import pandas as pd 
    import seaborn as sns
    
    # Load Scipy EDA libraries
    from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
    from scipy.spatial.distance import squareform
    
    # Load sklearn libraries
    from sklearn.impute import SimpleImputer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report, plot_confusion_matrix, roc_curve
    from sklearn.model_selection import GridSearchCV, train_test_split
    from sklearn.preprocessing import StandardScaler
    
    # Set graph style to default Seaborn style
    sns.set()
    

    Load dataset

    Let’s read the online_shoppers_intention from a URL on the Machine Learning Repository.

    db = 'https://archive.ics.uci.edu/ml/machine-learning-databases'
    url = db + '/00468/online_shoppers_intention.csv'
    df = pd.read_csv(url)
    

    Exploratory data analysis (EDA)

    Inspect dataframe

    df.head(3)
    

    Check dataframe info

    df.info()
    

    Here we see that there a no missing values and see the different data types for each column.

    Visualize the target

    First, let’s look at the variable that we will try to predict.

    df.Revenue.value_counts().plot(kind='bar')
    plt.xlabel('Revenue')
    plt.ylabel('# of visitors')
    plt.title('Number of visitors with revenue')
    plt.show()
    

    Visualize some of the features

    Now, let’s try to understand the features of our dataset by plotting them.

    cols = ['VisitorType', 'Month', 'Weekend']
    num_col = len(cols)
    fig, axes = plt.subplots(1, num_col, figsize=(12, 3))
    for i in range(len(cols)):
        sns.barplot(ax=axes[i],
            x=cols[i],
            y='Revenue',
            data=df
        )
        axes[i].set_title(f'Revenue - {cols[i]}')
    fig.tight_layout()
    plt.show()
    

    Here, we see that although there was no missing value, the Other column of the VisitorType seems like a missing value.

    The other category makes no sense, either you are a new visitor or a returning visitor. We will have to handle that later.

    Otherwise, we can see that we can expect more sales in November and in the Weekend.

    We can also plot the impact of bounce rate on revenues.

    sns.violinplot(x='Revenue', y='BounceRates', data=df)
    plt.title('Impact of bounce rate on revenues')
    plt.show()
    

    Here we see that the bounce rate need to be close to 0 in order to lead to a sale. Make sense, how can you purchase a product without clicking anywhere?

    Understand the correlation between variables

    columns = df.select_dtypes(include=['int64', 'float64', 'int32']).columns
    subset = df[columns]
    corr = subset.corr()
    
    sns.heatmap(round(corr,2), cmap='RdBu', annot=True, 
                annot_kws={'size': 7}, vmin=-1, vmax=1)
    
    plt.title('Correlation matrix of numeric features')
    plt.show()
    

    We can see that the most correlated features are between the BounceRates and the ExitRates.

    Initiate Independent and Dependent variables

    Classification algorithms will try to classify targets (dependent variables) using the features (independent variables) as predictors.

    • target = Revenue
    • features = Every column except Revenue
    X = df.drop('Revenue', axis=1)
    y = df['Revenue']
    

    Data Preprocessing

    Data preprocessing will be useful to improve the accuracy of the LogisticRegression predictions.

    Remove irrelevant features

    The PageValues feature is irrelevant to predicting the number of sales.

    I should say, it is too good a predictor in the way that it is caused by the number of sales.

    As the number of sales increases the page value increases. Not the other way around. To have a relevant model we shouldn’t use this feature as a predictor.

    X.drop('PageValues', axis=1, inplace=True)
    

    Impute missing values

    Now, let’s take care of the visitor types classified as other.

    As we have seen earlier, these should be considered missing since you are either a new or a returning visitor.

    import numpy as np 
    from sklearn.impute import SimpleImputer
    
    # replace other by np.nan values
    X['VisitorType'].replace('Other', np.nan, inplace=True)
    
    # replace nan values with the most frequent value.
    imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    X['VisitorType'] = imp.fit_transform(X[['VisitorType']])
    

    Handle categorical data

    Scikit-learn requires the categorical features to be converted to continuous numeric features.

    We’ll first convert the True and False to 1s and 0s.

    # convert bool to int
    X['Weekend'] = X['Weekend'].astype(int)
    

    Then convert all other categorical features to numeric using pandas get_dummies.

    import pandas as pd
    X = pd.get_dummies(X, drop_first=True)
    X.head(3)
    

    Scale numeric data

    To improve the accuracy of the model, preprocess the numeric features so that they all are on the same scale.

    from sklearn.preprocessing import StandardScaler
    
    num_cols = [
        'Administrative_Duration',
        'Informational_Duration',
        'ProductRelated_Duration',
        'BounceRates',
        'ExitRates',
        'SpecialDay'
        ]
    
    scaler = StandardScaler()
    X[num_cols] = scaler.fit_transform(X[num_cols])
    

    Split into Training and Testing sets

    To be able to evaluate the quality of the model, we need to split the data into training and testing sets.

    The model will be trained on the training set.

    Then the prediction will be made on that trained model.

    Then the prediction will be compared to real-world data to evaluate how accurate the model is.

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    

    Apply the Logistic Regression

    The LogisticRegression can take multiple parameters.

    GridSearchCV is used to identify and use the best possible parameter for the LogisticRegression.

    The .fit() method is used to train the model.

    %%time
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import GridSearchCV
    
    # define log_reg parameters to try
    params = {
        'C': [0.001, 0.01, 0.1, 1.],
        'penalty': ['l1', 'l2']
    }
    
    # initiate the logistic regression
    log_reg = LogisticRegression(
        random_state=42, 
        class_weight='balanced',
        solver='liblinear'
        )
    
    # find and use the best parameters
    # of the logistic regression
    log_reg_cv = GridSearchCV(
        log_reg, 
        param_grid=params, 
        cv=5,
        scoring='accuracy',
    )
    
    # train the model on the training set
    log_reg_cv.fit(X_train, y_train)
    
    # make prediction
    y_pred = log_reg_cv.predict(X_test)
    print(f'Prediction: {y_pred[:5]}')
    
    # compute the score and the print parameters used
    print('Best score: ', log_reg_cv.best_score_)
    print('Best params: ', log_reg_cv.best_params_)
    print('Score: ', log_reg_cv.score(X_test, y_test))
    

    Fine-tune the model

    Above, we saw that the accuracy of the model was around 63%. Which is not a great score depending on what you intend to do with the data.

    You can improve that score by fine-tuning the model.

    Evaluate the feature importance

    Let’s look at which features influenced the most model.

    # estimate feature importance
    feature_imp = log_reg_cv.best_estimator_.coef_[0]
    
    # Set fig size
    f = plt.figure()
    f.set_figheight(7)
    
    # set label names
    labels = list(X_train.columns)
    
    # plot graph
    plt.barh([x for x in range(len(feature_imp))], feature_imp)
    plt.yticks(range(len(labels)), labels)
    plt.title('Most important features for Online Shoppers (LogisticRegression)')
    
    plt.show()
    

    Compute the Classification Report

    Evaluate the precision of your model with the classification report.

    from sklearn.metrics import classification_report
    
    # make prediction
    y_pred = log_reg_cv.predict(X_test)
    
    # compute the classification report
    print(classification_report(y_test, y_pred))
    

    Plot the confusion matrix

    Plot the recall of the model in a confusion matrix.

    from sklearn.metrics import plot_confusion_matrix
    
    color = 'black'
    matrix = plot_confusion_matrix(log_reg_cv, X_test, y_test, cmap=plt.cm.Blues)
    matrix.ax_.set_title('Confusion Matrix', color=color)
    plt.xlabel('Predicted Label', color=color)
    plt.ylabel('True Label', color=color)
    plt.gcf().axes[0].tick_params(colors=color)
    plt.gcf().axes[1].tick_params(colors=color)
    plt.show()
    

    Plot the ROC Curve

    Plot the receiver operating characteristic curve (ROC curve) to check the impact of varying the threshold on the true positive rates and false-positive rates.

    from sklearn.metrics import roc_curve
    
    y_pred_prob = log_reg_cv.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    
    plt.plot([0,1],[1,0],'k--')
    plt.plot(fpr, tpr, label='Logistic Regression')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Logistic Regression ROC curve')
    plt.show()
    

    The larger the area under the curve, the better the model.

    from sklearn.metrics import roc_auc_score
    y_pred_prob = log_reg_cv.predict_proba(X_test)[:,1]
    roc_auc_score(y_test, y_pred_prob)
    # 0.7392785169515115
    

    Conclusion

    Huge. Huge congratulations for completing this course on Logistic Regressions with Scikit-learn!

    5/5 - (1 vote)