Classification Machine Learning Project in Scikit-Learn (Python Example)

In this article, we will use Python to learn Scikit-learn through a typical machine learning classification problem.

We will:

  1. Load the dataset
  2. Explore the dataset
  3. Split data into features and targets (independent and dependent variables)
  4. Create new features (feature engineering)
  5. Preprocess the data
  6. Split data into training and testing datasets
  7. Run the classification algorithm and find their ideal hyperparameter
  8. Train the model
  9. Evaluate each model

Load the Titanic dataset with Scikit-learn

Most machine learning models need preprocessing of untidy data.

Join the Newsletter

    The Titanic dataset is a great one to use to practice machine learning classification.

    We will load the data using the fetch_openml method available with the scikit-learn library.

    from sklearn.datasets import fetch_openml
    
    # load dataset
    titanic = fetch_openml('titanic', version=1, as_frame=True)
    df = titanic['data']
    df['survived'] = titanic['target']
    

    Exploratory data analysis

    Exploratory data analysis (EDA) is an important step of the machine learning process.

    It helps understand, clean and validate what data is important, missing or in the wrong format for your machine learning model to understand.

    Load Packages for exploratory data analysis

    To perform EDA, we will use Matplotlib, Pandas and Seaborn.

    import matplotlib.pyplot as plt
    import pandas as pd 
    import seaborn as sns
    
    from sklearn.datasets import fetch_openml
    sns.set()
    

    Preview the dataset

    Let’s start by understanding our dataset.

    df.head(3)
    

    Generate descriptive statistics

    Preview the numeric features descriptive statistics.

    df.describe()
    
    Here you can see great variation in how the data is spread by looking at the std.

    Show null columns

    df.info()
    

    All columns have value in them.

    Show missing values

    df.isnull().sum()
    

    A lot of columns are missing values.

    Let’s visualize that.

    miss_vals = pd.DataFrame(df.isnull().sum() / len(df) * 100)
    miss_vals.plot(kind='bar',
        title='Missing values in percentage',
        ylabel='percentage'
        )
    
    plt.show()
    

    Visualize the target variable

    Since we will try to predict survival, let’s visualize the survival column.

    df.survived.value_counts().plot(kind='bar')
    
    plt.xlabel('Survival')
    plt.ylabel('# of passengers')
    plt.title('Number of passengers based on their survival')
    plt.show()
    

    Survival by age

    fig, ax = plt.subplots()
    
    ax.hist(df.age.dropna(), label='Not survived')
    ax.hist(df['age'][df.survived == '1'].dropna(), label='Survived')
    
    plt.xlabel('Survival')
    plt.ylabel('Age')
    plt.title('Survivals by age')
    plt.legend()
    plt.show()
    

    Survival by gender

    df['survived'] = df.survived.astype('int')
    
    sns.barplot(
        x='sex',
        y='survived',
        data=df
    )
    
    plt.title('Survival by gender')
    plt.show()
    

    Survivers by class

    sns.countplot(x='pclass', data=df)
    plt.title('Unique survivers by class')
    plt.show()
    
    sns.barplot(x='pclass', y='survived', data=df)
    plt.title('Percent survivers by class')
    plt.show()
    

    Survivers by port of embarkation

    sns.barplot(x='embarked', y='survived', data=df)
    plt.title('Percent survivers by port of embarkation')
    plt.show()
    

    Initiate Independent and Dependent variables

    Classification algorithms will try to classify targets (dependent variables) using the features (independent variables) as predictors.

    Here, the column that we want to make a prediction on is the column that states whether or not the passenger survived.

    We can assign the features to X by using the drop method to keep all columns except the target and assigning the target to y.

    from sklearn.datasets import fetch_openml
    
    # load dataset
    titanic = fetch_openml('titanic', version=1, as_frame=True)
    df = titanic['data']
    df['survived'] = titanic['target']
    
    # Assign Dependent and Independent variables
    X = df.drop('survived', axis=1)
    y = df['survived']
    

    Better still, fetch_openml() does that for you with the return_X_y keyword.

    from sklearn.datasets import fetch_openml
    
    X, y = fetch_openml('titanic', version=1, as_frame=True, return_X_y=True)
    

    Feature Engineering

    Some features don’t have much meaning when used alone. However, we can give them meaning by looking at the context.

    For example, the sibsp and parch columns tell you if the passenger was travelling with siblings, parents or children. By combining these features, you can infer if the passenger was travelling alone, and see if that impacted the chances of survival.

    X['family'] = X['sibsp'] + X['parch']
    X.loc[X['family'] > 0, 'travelled_alone'] = 0
    X.loc[X['family'] == 0, 'travelled_alone'] = 1
    X.drop(['family', 'sibsp', 'parch'], axis=1, inplace=True)
    sns.countplot(x='travelled_alone', data=X)
    plt.title('Number of passengers travelling alone')
    plt.show()
    

    Preprocess Data with Scikit-learn

    There are two main reasons why you want to do data preprocessing before training your machine learning model:

    • To satisfy the requirements of the scikit-learn api
    • To clean erroneous and missing data from datasets

    We will:

    1. Remove features that we don’t want
    2. Fill missing values
    3. Convert categorical data features to numeric format
    4. Scale numeric features

    Remove features that we don’t want

    First, we remove the columns that have too many missing values.

    # remove high missing value columns
    X.drop(['cabin', 'boat', 'body'], axis=1, inplace=True)
    
    # remove less interesting features
    X.drop(['name','ticket','home.dest'], axis=1, inplace=True)
    

    Fill Missing Values (Imputation)

    To use Scikit-learn, you should have no missing values in your dataset.

    Thus, using SimpleImputer, we will fill the missing values using the mean for numeric data, and the most_frequent value for categorical data.

    from sklearn.impute import SimpleImputer
    
    def get_parameters(df):
        parameters = {}
        for col in df.columns[df.isnull().any()]:
            if df[col].dtype == 'float64' or df[col].dtype == 'int64' or df[col].dtype =='int32':
                strategy = 'mean'
            else:
                strategy = 'most_frequent'
            missing_values = df[col][df[col].isnull()].values[0]
            parameters[col] = {'missing_values':missing_values, 'strategy':strategy}
        return parameters
    
    parameters = get_parameters(X)
    
    for col, param in parameters.items():
        missing_values = param['missing_values']
        strategy = param['strategy']
        imp = SimpleImputer(missing_values=missing_values, strategy=strategy)
        X[col] = imp.fit_transform(X[[col]])
    
    X.isnull().sum()
    

    Handle Categorical Data

    Scikit learn requires categorical data to be converted into continuous numeric format. Using the pandas get_dummies method, we will convert categorical features into 0s and 1s.

    # handle categorical data
    cat_cols = X.select_dtypes(include=['object','category']).columns
    dummies = pd.get_dummies(X[cat_cols], drop_first=True)
    X[dummies.columns] = dummies
    X.drop(cat_cols, axis=1, inplace=True)
    X.head()
    

    Scale Numeric Data

    To improve model performance, we will scale the numeric features so that they all have a mean=0 and a standard deviation = 1.

    # Scale numeric data
    from sklearn.preprocessing import StandardScaler
    
    # Select numerical columns
    num_cols = X.select_dtypes(include=['int64', 'float64', 'int32']).columns
    
    # Apply StandardScaler
    scaler = StandardScaler()
    X[num_cols] = scaler.fit_transform(X[num_cols])
    

    Split into Training and Testing Sets

    We could apply the machine learning model to the entire dataset, but that wouldn’t be useful to evaluate the model performance. Instead, we split the dataset into training and testing datasets.

    from sklearn.model_selection import train_test_split
    
    RAND_STATE = 42
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RAND_STATE)
    

    Then, we will apply the model to the training set, and compare the result to the actual data in the test set.

    Choose the best classifier algorithm

    There are many classification machine learning models in Scikit-learn.

    We will need to compare them and choose the best model to predict the data.

    The models that we will look at are:

    1. LogisticRegression
    2. KNeighborsClassifier
    3. SVC
    4. RandomForestClassifier
    5. DecisionTreeClassifier

    Create a dictionary to store results

    cross_val_scores = {}
    
    models = [
        'LogisticRegression', 
        'KNeighborsClassifier',
        'SVC',
        'RandomForestClassifier',
        'DecisionTreeClassifier'
        ]
    
    empty_dict = {
        'best_score':'',
        'best_params':'',
        'score':''
        }
    
    for m in models:
        cross_val_scores[m] = empty_dict
    
    cross_val_scores
    

    LogisticRegression

    Create a logistic regression with the LogisticRegression algorithm.

    Below, you will see that we will execute the hyperparameter tuning for each of the models using GridSearchCV. This will allow us not only to compare the models, but to identify the best parameters for each model.

    %%time
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import GridSearchCV
    
    params = {
        'C': [0.001, 0.01, 0.1, 1.],
        'penalty': ['l1', 'l2']
    }
    
    log_reg = LogisticRegression(
        random_state=RAND_STATE, 
        class_weight='balanced',
        solver='liblinear'
        )
    
    log_reg_cv = GridSearchCV(
        log_reg, 
        param_grid=params, 
        cv=5,
        scoring='accuracy',
    )
    
    log_reg_cv.fit(X_train, y_train)
    
    cross_val_scores['LogisticRegression']['best_score'] = log_reg_cv.best_score_
    cross_val_scores['LogisticRegression']['best_params'] = log_reg_cv.best_params_
    cross_val_scores['LogisticRegression']['score'] = log_reg_cv.score(X_test, y_test)
    

    KNeighborsClassifer

    Create the classification using the K-Nearest neighbors algorithm.

    %%time
    import numpy as np 
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import GridSearchCV
    
    params = {'n_neighbors': np.arange(1, 50)}
    
    knn = KNeighborsClassifier()
    
    knn_cv = GridSearchCV(
        knn, 
        param_grid=params,
        cv=5,
        scoring='accuracy'
        )
    
    knn_cv.fit(X_train, y_train)
    
    cross_val_scores['KNeighborsClassifier']['best_score'] = knn_cv.best_score_
    cross_val_scores['KNeighborsClassifier']['best_params'] = knn_cv.best_params_
    cross_val_scores['KNeighborsClassifier']['score'] = knn_cv.score(X_test, y_test)
    

    SVC

    Create the classification using the SVC algorithm.

    %%time
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    
    params = {
        'C': [0.001, 0.01, 0.1, 1.],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'gamma': ['scale', 'auto'],
    }
    
    svc = SVC(
        random_state=RAND_STATE,
        class_weight='balanced',
        probability=True,
    )
    
    svc_cv = GridSearchCV(
        svc, 
        param_grid=params, 
        cv=5,
        scoring='accuracy',
    )
    
    svc_cv.fit(X_train, y_train)
    
    cross_val_scores['SVC']['best_score'] = svc_cv.best_score_
    cross_val_scores['SVC']['best_params'] = svc_cv.best_params_
    cross_val_scores['SVC']['score'] = svc_cv.score(X_test, y_test)
    

    RandomForestClassifier

    Create the classification using the popular ensemble learning algorithm: the RandomForestClassifier.

    %%time
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import GridSearchCV
    
    params = {
        'n_estimators': [5, 10, 15, 20, 25], 
        'max_depth': [3, 5, 7, 9, 11, 13],
    }
    rand_forest = RandomForestClassifier(
        random_state=RAND_STATE,
        class_weight='balanced',
    )
    
    rf_cv = GridSearchCV(
        rand_forest, 
        param_grid=params, 
        cv=5,
        scoring='accuracy',
    )
    
    rf_cv.fit(X_train, y_train)
    
    cross_val_scores['RandomForestClassifier']['best_score'] = rf_cv.best_score_
    cross_val_scores['RandomForestClassifier']['best_params'] = rf_cv.best_params_
    cross_val_scores['RandomForestClassifier']['score'] = rf_cv.score(X_test, y_test)
    

    DecisionTree

    Create the classification using the DecisionTreeClassifier algorithm.

    %%time
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import GridSearchCV
    
    params = {
        'max_depth': [3, 5, 7, 9, 11, 13],
    }
    
    decision_tree = DecisionTreeClassifier(
        random_state=RAND_STATE,
        class_weight='balanced',
    )
    
    dt_cv = GridSearchCV(
        decision_tree, 
        param_grid=params, 
        cv=5,
        scoring='accuracy',
    )
    
    dt_cv.fit(X_train, y_train)
    
    cross_val_scores['DecisionTreeClassifier']['best_score'] = dt_cv.best_score_
    cross_val_scores['DecisionTreeClassifier']['best_params'] = dt_cv.best_params_
    cross_val_scores['DecisionTreeClassifier']['score'] = dt_cv.score(X_test, y_test)
    

    Compare the Results

    Now, let’s compare the machine learning classification algorithms against each other.

    pd.DataFrame(cross_val_scores).T.sort_values(by='best_score',ascending=False)
    

    This report tells me that the best algorithm to choose in this case is the DecisionTreeClassifier with the max_depth parameter set to 3.

    Build the model into a Pipeline

    Let’s put everything together by building a Pipeline to run each of these steps together.

    import pandas as pd 
    import matplotlib.pyplot as plt
    
    from sklearn.datasets import fetch_openml
    from sklearn.compose import ColumnTransformer
    from sklearn.impute import SimpleImputer
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import classification_report, plot_confusion_matrix
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    
    # Set random state for reproducibility
    RAND_STATE = 42
    
    # load data
    X, y = fetch_openml('titanic', version=1, as_frame=True, return_X_y=True)
    
    # preprocessing
    X['family'] = X['sibsp'] + X['parch']
    X.loc[X['family'] > 0, 'travelled_alone'] = 0
    X.loc[X['family'] == 0, 'travelled_alone'] = 1
    X.drop(['family', 'sibsp', 'parch'], axis=1, inplace=True)
    X.drop(['cabin', 'boat', 'body'], axis=1, inplace=True)
    X.drop(['name','ticket','home.dest'], axis=1, inplace=True)
    
    # handle numeric features
    numeric_features = ['age','fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())])
    
    # handle categorical features
    categorical_features = ['embarked', 'sex', 'pclass', 'travelled_alone']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('scaler', OneHotEncoder(handle_unknown='ignore'))])
    
    # Create a transformer
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    # Run the classifier
    classifier = DecisionTreeClassifier(
        random_state=RAND_STATE,
        class_weight='balanced',
        max_depth=3
        )
      
    # Split into training and testing 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RAND_STATE)
    
    # Set the Pipeline
    model = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', classifier)])
    
    # Fit the pipeline
    model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    
    # Evaluate
    print(f'Model score: {model.score(X_test, y_test)}')
    
    # compute the classification report
    print(classification_report(y_test, y_pred))
    

    This is it. You can see that the model score is the same as what we computed earlier.

    We also used Scikit-learn’s metrics module to compute the classification report to evaluate each model.

    Now, we’ll plot the confusion matrix to look at true-positive and false-positive rates.

    from sklearn.metrics import classification_report, plot_confusion_matrix
    # plot confusion matrix
    color = 'black'
    matrix = plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)
    matrix.ax_.set_title('Confusion Matrix', color=color)
    plt.xlabel('Predicted Label', color=color)
    plt.ylabel('True Label', color=color)
    plt.gcf().axes[0].tick_params(colors=color)
    plt.gcf().axes[1].tick_params(colors=color)
    plt.show()
    

    Feature Importance

    Let’s look at which features influence the model the most.

    %%time
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import GridSearchCV
    
    print(cross_val_scores['DecisionTreeClassifier']['best_params'])
    
    decision_tree = DecisionTreeClassifier(
        random_state=RAND_STATE,
        class_weight='balanced',
        max_depth=3
    )
    
    
    decision_tree.fit(X_train, y_train)
    
    feature_imp = decision_tree.feature_importances_
    
    labels = list(X_train.columns)
    plt.barh([x for x in range(len(feature_imp))], feature_imp)
    plt.title('DecisionTreeClassifier Feature Importance')
    plt.yticks(range(len(labels)), labels)
    plt.show()
    

    Above, we can see that we could potentially get rid of the port of embarkation and the travelled alone features since they don’t seem to impact the results DecisionTreeClassifier algorithm.

    Beware, that wouldn’t be true for any algorithm.

    For example, the port of embarkation has some impact on the LogisticRegression algorithm.

    Conclusion

    Wow. This was a lot.

    We compared the most popular classification machine learning algorithms in Scikit learn against the Titanic dataset.

    5/5 - (2 votes)