k-Nearest Neighbors is a machine learning algorithm used in supervised learning to predict the label of data points by looking what is the majority in its closest neighbours.

This is a classification approach.

Given a number of neighbors `k`

, the k-Nearest neighbors algorithm will look at what is present in the majority and will attribute the majority to the new data points.

## Learn k-Nearest Neighbors

This post is an overview of the k-Nearest Neighbors algorithm and is in no way complete.

If you want to learn more about the k-Nearest Neighbors algorithms, here are a few Datacamp tutorials that helped me.

## Understand the k-Nearest Neighbors algorithm visually

This visualization help understand how k-Nearest Neighbors work. Given a `k`

value, what will be the prediction?

- In the
`k=3`

circle,`green`

is the majority, new data points will be predicted as`green`

; - In the
`k=6`

circle,`blue`

is the majority, new data points will be predicted as`blue`

;

## Advantages and Disadvantages of the KNN approach

**Advantages: **The k-Nearest Neighbors algorithm is simple to implement and robust to noisy training data.

**Disadvantages: **High cost of computation compared to other algorithms. Storage of data: memory based, so less efficient. Need to define which `k`

value to use.

## When to use the KNN algorithm?

- Image and video recognition
- Filtering of recommender systems

## Run the k-Nearest Neighbors with Scikit-learn

Let’s run the k-Nearest Neighbors algorithm with Scikit-learn.

- Load data
- Split data into training and test sets
- Train the classifier model on the training set and make predictions on the test set
- Evaluate the model looking at the known labels.
- Fine-tune the model

### Load data

Sklearn has a set of built-in datasets that we can use. Here we will load the breast cancer dataset.

```
import pandas as pd
from sklearn import datasets
dataset = datasets.load_breast_cancer()
df = pd.DataFrame(dataset.data,columns=dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()
```

Let’s look at the labels that we will try to predict.

Based on all the features that we have:

```
print(dataset.feature_names)
# [
# 'mean radius',
# 'mean texture',
# 'mean perimeter',
# ...,
# 'worst fractal dimension'
# ]
```

we will try to predict the severity of the breast cancer.

```
print(dataset.target_names)
# ['malignant', 'benign']
```

### Split Data into Training and Test sets

Whenever we build a machine learning model, we want to check its accuracy.

You will need to split your data into **training and test datasets** using the `train_test_split`

module.

**The training dataset**is used to fit (or train) the model.**The test dataset**is excluded from training. It is labelled data that will be used to compare against the predictions made by the model.

```
from sklearn.model_selection import train_test_split
# Define independent (features) and dependent (targets) variables
X = dataset['data']
y = dataset['target']
# split taining and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
```

### Train the Model and make predictions

Now, to make predictions based on the labelled data, we will:

- Initiate the
`KNeighborsClassifier`

the machine learning model - Use the
`.fit()`

method to train the mode - Use the
`.predict()`

method to make predictions

```
from sklearn.neighbors import KNeighborsClassifier
# train the model
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
```

### Evaluate the model

It is very important to evaluate the accuracy of the model. We can do this using the `.score()`

method on the `knn`

object.

```
# compute accuracy of the model
knn.score(X_test, y_test)
```

The accuracy of the model is

```
0.9239766081871345
```

Which is a pretty good result in this case.

## Test Different K Values

We can also try to look at the model accuracy of multiple `k`

values.

```
import numpy as np
import matplotlib.pyplot as plt
neighbors = np.arange(1, 25)
accuracy = np.empty(len(neighbors))
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
accuracy[i] = knn.score(X_test, y_test)
plt.title('k-NN accuracy by number of Neighbors')
plt.plot(neighbors, accuracy)
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
```

Here we can see that the accuracy of the model start decreasing as we increase the `k`

value.

## Select the best K Value for the Model

Selecting the best hyperparameter is critical to selecting the best model.

The plot above is great, but should you choose `8`

or `13`

, or even bigger `n_neighbors`

values?

Using `GridSearchCV`

from the `model_selection`

module, you can check the best parameter for your model.

```
import numpy as np
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors':np.arange(1, 50)}
knn_cv = GridSearchCV(knn, param_grid, cv=5)
knn_cv.fit(X_train, y_train)
print(knn_cv.best_params_)
print(knn_cv.best_score_)
```

The result tells you which `k`

value to take to best fit your data. In this case, you should set `n_neighbors`

to be `13`

.

```
{'n_neighbors': 6}
0.9498417721518987
```

## Check the Confusion Matrix

It is possible that the accuracy is not fully representative. We will now try to see how many predictions are True and how many are False.

We will do this using:

- Confusion matrix
- Classification report

Quick reminder, in the coming plots we will plot the targets (`0`

s and `1`

s) and not the target names. Remember that:

`0`

=`malignant`

`1`

=`benign`

### Confusion matrix

To plot the confusion matrix, we will use the `confusion_matrix`

and `plot_confusion_matrix`

methods from the `sklearn.metrics`

module.

```
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
color = 'white'
matrix = plot_confusion_matrix(knn, X_test, y_test, cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()
```

If you don’t know how to interpret this, just read my post on the confusion matrix.

### Classification Report

Let’s compute the classification report to assess the quality of the predictions.

```
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```

If you don’t know how to interpret this, just read my post on the classification report.

## Additional Parameters for KNN

KNN allows for some flexibility providing additional parameters when training a model. For instance, along with n_neighbors, it is possible to modify the distance metrics and the weighting schemes for example.

### KNN Distance Metrics

By default KNN calculates the Euclidean distance between two points, but it is also possible to calculate the Manhattan distance instead.

```
# Example modifying the metric for KNN
KNeighborsClassifier(metric='manhattan', n_neighbors=5)
```

**Euclidean Distance**(in read) is the straight-line distance between A and B.**Manhattan Distance**(in blue) is the sum of the horizontal and vertical distances between A and B

### KNN Weighting Schemes

Weighting schemes in KNN help determine how much influence each neighbor has on a prediction. KNN allows for different weighting schemes: uniform and distance. By default KNN is uniform meaning all neighbors contribute equally to the prediction.

```
# Example modifying the weights metric for KNN
KNeighborsClassifier(weights='distance')
```

`'uniform'`

: all neighbors have equal weight. Distance from the classified data point is not important.`'distance'`

: closer neighbors have more weight than those further to the data point.

## Conclusion

This project is now done. We have implemented and checked the accuracy of our k-Nearest Neighbors algorithm in Scikit-learn.

SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.