GridSearchCV is a hyperparameter tuning technique used in machine learning to perform model optimization. More specifically, it is a class from the Scikit-learn’s model_selection
module used to perform cross-validation to find the best parameters for a given model and defined performance metrics.
Hyperparameter tuning relates to the process of choosing the optimal hyperparameters to optimize a machine learning algorithm.
The main objective of GridSearchCV is to evaluates all the possible combinations of specified parameters for an estimator, and optimize its performance. Simply put, GridSearchCV tests different hyperparameters to find the best one for your model.
What is Cross-Validation
Cross-validation, also known as CV, is a method that is used to select model parameters in a way that does not rely too much on the initial training set. It takes a hold-out portion of the training data to use as a validation set.
One of the most commonly used CV methods is the k-fold cross-validation. GridSearchCV leverages k-fold cross-validation to discover the best possible combination.
K-Fold Cross-Validation
According to the Stanford University, K-Fold Cross-Validation works by splitting the data into K subsets (also known as folds). For every i = 1, . . . , K
, train model on all except i’th
fold. Then, it computes the test error on the i’th
fold and average the test errors.
To better exemplify this process, the visualization below shows that for a number of splits (or parameter), cross validation uses one of the fold to try out a parameter and calculate a test error metric (e.g. Accuracy, precision, f1-score,…). It then uses the next fold and make similar calculation, and so on. Finally it averages the metrics to find the optimal parameters.
Cross-Validation in GridSearchCV
GridSearchCV performs cross-validation by following these steps:
- GridSearchCV splits the training data in k equal parts (folds).
- Each fold is used as a validation set and the remaining is used as training
- Perform cross-validation (e.g., with KFold) for each combination of hyperparameters.
- Return the best hyperparameters based on the average performance across the cross-validation folds.
The image bellow shows the best n_neighbors
parameter value for a KNearestNeighbor
algorithm.
What are the Best GridSearchCV Performance Metrics?
The right performance metrics to evaluate in GridSearchCV depends on the model that you are using, the dataset and the context of the machine learning project. GridSearchCV provides a number of metrics, such as the accuracy, the precision and the recall. Each metric, or combination of metric has its own specific use case and should be considered when modelling on your data.
GridSearchCV With Python Example
In this GridSearchCV with Python example, we will perform a classification task on the Breast Cancer Dataset available in Scikit-learn. We will use KNN to try to predict malignant or benign cancers, doing so by using GridSearchCV to find the best n_neighbors
parameter value for our dataset.
Getting Started
$ pip3 install -U scikit-learn
Load and Explore the dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd
dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()
Show features and targets
print('Features:', dataset.feature_names)
print('Targets:', dataset.target_names)
Get data as arrays
X = dataset.data
y = dataset.target
Split the dataset into training and test sets
We will use train_test_split from sklearn.model_selection to split the data into training and testing data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Instantiate the KNN Machine Learning Model
We will instantiate the KNN class from sklearn.neighbors without any parameters.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
Defining the Parameter Grid
# Defining the parameter grid
param_grid = {
'n_neighbors': [3, 5, 7, 9], # Trying different 'k' n_neighbors values
}
Set up GridSearchCV
# Set up GridSearchCV
knn_cv = GridSearchCV(
knn,
param_grid,
cv=5,
scoring='accuracy'
)
Training the Models
knn_cv.fit(X_train, y_train)
Find the Best Parameters and Score
print(f"Best parameters: {knn_cv.best_params_}")
print(f"Best cross-validated accuracy: {knn_cv.best_score_:.2f}")
Evaluate the Best Model on the Test Set
best_knn = knn_cv.best_estimator_ # model that performed best
y_pred = best_knn.predict(X_test)
test_accuracy = best_knn.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.2f}")
Scalability of GridSearchCV
GridSearchCV is a practice that does not scale very well. GridSearchCV requires to loop through every pre-defined possibility of the grid and train the machine learning on each of the possible parameters.
This creates an issue because as the number of folds and the number of parameters increase, the number of fits to be performed increases exponentially. This is a chart that shows how 10-fold CV required fits grows massively with only 7 hyperparameters.
number_values_per_parameter ** number_parameters * number_of_folds
5 ** 7 * 10 = 781250 fits
RandomizedSearchCV to the Rescue
An alternative is to use RandomizedSearchCV
from sklearn.model_selection
.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Instantiate
knn = KNeighborsClassifier()
#Set parameter grid
param_grid = {
'n_neighbors': np.arange(1, 50)
}
# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
knn,
param_grid,
n_iter=10, # Limit the number of iterations (10 combinations)
cv=5, # 5-fold cross-validation
scoring='accuracy', # You can use 'precision', 'recall', or 'f1' here for different use cases
random_state=42
)
# Fit the model
random_search.fit(X_train, y_train)
# Find the best parameters and score
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validated accuracy: {random_search.best_score_:.2f}")
# Evaluate the best model on the test set
best_knn = random_search.best_estimator_
test_accuracy = best_knn.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.2f}")
Benefits and Challenges of GridSearchCV
- The main benefit of using GridSearchCV is to improve model performance. It fine-tunes parameters, which leads to better predictions on unseen data.
- The effectiveness of GridSearchCV depends on the size of the dataset, the number of hyperparameters, and the available computational resources.
When to use GridSearchCV instead of KFold and cross_val_score?
You should use GridSearchCV
instead of KFold
and cross_val_score
when you’re not only looking to evaluate a model but also want to tune hyperparameters to find the best combination for your model.
This is it for our tutorial on GridSearchCV with Sklearn.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.