sklearn GridSearchCV with Python

GridSearchCV is a hyperparameter tuning technique used in machine learning to perform model optimization. More specifically, it is a class from the Scikit-learn’s model_selection module used to perform cross-validation to find the best parameters for a given model and defined performance metrics.

Hyperparameter tuning relates to the process of choosing the optimal hyperparameters to optimize a machine learning algorithm.

The main objective of GridSearchCV is to evaluates all the possible combinations of specified parameters for an estimator, and optimize its performance. Simply put, GridSearchCV tests different hyperparameters to find the best one for your model.


Subscribe to my Newsletter


What is Cross-Validation

Cross-validation, also known as CV, is a method that is used to select model parameters in a way that does not rely too much on the initial training set. It takes a hold-out portion of the training data to use as a validation set.

Credit Stanford University

One of the most commonly used CV methods is the k-fold cross-validation. GridSearchCV leverages k-fold cross-validation to discover the best possible combination.

K-Fold Cross-Validation

According to the Stanford University, K-Fold Cross-Validation works by splitting the data into K subsets (also known as folds). For every i = 1, . . . , K, train model on all except i’th fold. Then, it computes the test error on the i’th fold and average the test errors.

To better exemplify this process, the visualization below shows that for a number of splits (or parameter), cross validation uses one of the fold to try out a parameter and calculate a test error metric (e.g. Accuracy, precision, f1-score,…). It then uses the next fold and make similar calculation, and so on. Finally it averages the metrics to find the optimal parameters.

cross-validation basics

Cross-Validation in GridSearchCV

GridSearchCV performs cross-validation by following these steps:

  • GridSearchCV splits the training data in k equal parts (folds).
  • Each fold is used as a validation set and the remaining is used as training
  • Perform cross-validation (e.g., with KFold) for each combination of hyperparameters.
  • Return the best hyperparameters based on the average performance across the cross-validation folds.

The image bellow shows the best n_neighbors parameter value for a KNearestNeighbor algorithm.

gridsearchcv on knn

What are the Best GridSearchCV Performance Metrics?

The right performance metrics to evaluate in GridSearchCV depends on the model that you are using, the dataset and the context of the machine learning project. GridSearchCV provides a number of metrics, such as the accuracy, the precision and the recall. Each metric, or combination of metric has its own specific use case and should be considered when modelling on your data.

GridSearchCV With Python Example

In this GridSearchCV with Python example, we will perform a classification task on the Breast Cancer Dataset available in Scikit-learn. We will use KNN to try to predict malignant or benign cancers, doing so by using GridSearchCV to find the best n_neighbors parameter value for our dataset.

Getting Started

$ pip3 install -U scikit-learn

Load and Explore the dataset

from sklearn.datasets import load_breast_cancer
import pandas as pd

dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()

Show features and targets

print('Features:', dataset.feature_names)
print('Targets:', dataset.target_names)

Get data as arrays

X = dataset.data
y = dataset.target

Split the dataset into training and test sets

We will use train_test_split from sklearn.model_selection to split the data into training and testing data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Instantiate the KNN Machine Learning Model

We will instantiate the KNN class from sklearn.neighbors without any parameters.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

Defining the Parameter Grid

# Defining the parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9],   # Trying different 'k' n_neighbors values
}

Set up GridSearchCV

# Set up GridSearchCV
knn_cv = GridSearchCV(
            knn, 
            param_grid, 
            cv=5, 
            scoring='accuracy'
        )

Training the Models

knn_cv.fit(X_train, y_train)

Find the Best Parameters and Score

print(f"Best parameters: {knn_cv.best_params_}")
print(f"Best cross-validated accuracy: {knn_cv.best_score_:.2f}")

Evaluate the Best Model on the Test Set

best_knn = knn_cv.best_estimator_ # model that performed best
y_pred = best_knn.predict(X_test)
test_accuracy = best_knn.score(X_test, y_test)

print(f"Test set accuracy: {test_accuracy:.2f}")

Scalability of GridSearchCV

GridSearchCV is a practice that does not scale very well. GridSearchCV requires to loop through every pre-defined possibility of the grid and train the machine learning on each of the possible parameters.

This creates an issue because as the number of folds and the number of parameters increase, the number of fits to be performed increases exponentially. This is a chart that shows how 10-fold CV required fits grows massively with only 7 hyperparameters.

number_values_per_parameter ** number_parameters * number_of_folds
5 ** 7 * 10 = 781250 fits

RandomizedSearchCV to the Rescue

An alternative is to use RandomizedSearchCV from sklearn.model_selection.
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

# Instantiate
knn = KNeighborsClassifier()

 #Set parameter grid
param_grid = {
    'n_neighbors': np.arange(1, 50)   
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    knn, 
    param_grid, 
    n_iter=10,  # Limit the number of iterations (10 combinations)
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',  # You can use 'precision', 'recall', or 'f1' here for different use cases
    random_state=42
)

# Fit the model
random_search.fit(X_train, y_train)

# Find the best parameters and score
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validated accuracy: {random_search.best_score_:.2f}")

# Evaluate the best model on the test set
best_knn = random_search.best_estimator_
test_accuracy = best_knn.score(X_test, y_test)

print(f"Test set accuracy: {test_accuracy:.2f}")

Benefits and Challenges of GridSearchCV

  • The main benefit of using GridSearchCV is to improve model performance. It fine-tunes parameters, which leads to better predictions on unseen data.
  • The effectiveness of GridSearchCV depends on the size of the dataset, the number of hyperparameters, and the available computational resources.

When to use GridSearchCV instead of KFold and cross_val_score?

You should use GridSearchCV instead of KFold and cross_val_score when you’re not only looking to evaluate a model but also want to tune hyperparameters to find the best combination for your model.

This is it for our tutorial on GridSearchCV with Sklearn.

Enjoyed This Post?