Learn Ensemble Learning Algorithms in Machine Learning (with Python Examples)

Ensemble learning is a supervised learning technique used in machine learning to improve overall performance by combining the predictions from multiple models.

Each model can be a different classifier:

How does Ensemble Learning Work?

Ensemble learning works on the principle of the “wisdom of the crowd”. By combing multiple models, we can improve the accuracy of the predictions.


Subscribe to my Newsletter


Types of Ensemble Methods

  • Voting
  • Bootstrap aggregation (bagging)
  • Random Forests
  • Boosting
  • Stacked Generalization (Blending)

Voting

Voting is an ensemble machine learning algorithm that involves making a prediction that is the average (regression) or the sum (classification) of multiple machine learning models.

  • Same training sets
  • Different algorithms
  • sklearn.ensemble: VotingRegressor, VotingClassifier

Example of Voting Classifier in Python (Sklearn)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual classifiers
logistic_classifier = LogisticRegression(random_state=42)
tree_classifier = DecisionTreeClassifier(random_state=42)
svm_classifier = SVC(random_state=42)

# Create a VotingClassifier with majority rule
voting_classifier = VotingClassifier(
    estimators=[
        ('logistic', logistic_classifier), 
        ('tree', tree_classifier), 
        ('svm', svm_classifier)],
    voting='hard'  # 'hard' for majority vote, 'soft' for weighted vote
)

# Fit the ensemble classifier to the training data
voting_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = voting_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Bootstrap Aggregation (Bagging)

Bagging, or bootstrap aggregation, is an ensemble method that reduces the variance of individual models by fitting a decision tree on different bootstrap samples of a training set.

  • Different training sets
  • Same algorithm
  • Two models from sklearn.ensemble: BaggingClassifier, BaggingRegressor

Example of Bagging Classifier in Python (Sklearn)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base classifier (Decision Tree)
base_classifier = DecisionTreeClassifier(random_state=42)

# Create a BaggingClassifier
bagging_classifier = BaggingClassifier(
    base_estimator=base_classifier,  # Base classifier to be used
    n_estimators=10,  # Number of base classifiers (decision trees)
    random_state=42,
)

# Fit the bagging classifier to the training data
bagging_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bagging_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Random Forests

Random forests use decision trees as the base estimator for the predictions and improve the performance of models by calculating the majority voting / average prediction of multiple decision trees.

Random forest is both a supervised learning algorithm and an ensemble algorithm.

  • Base estimator is a decision tree
  • Each estimator uses a different bootstrap sample of the training set
  • Two models from sklearn.ensemble: RandomForestClassifier, RandomForestRegressor

Example of Random Forests in Python (Sklearn)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Classifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model to the training data
random_forest.fit(X_train, y_train)

# Make predictions on the test data
y_pred = random_forest.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Boosting

Boosting is an ensemble method that converts weak learners into strong learners by having each predictor fix the errors of its predecessor.

Boosting can be used in classification and regression problems.

Boosting machine learning algorithms work by:

  • Instantiating a weak learner (e.g. CART with max_depth of 1)
  • Making a prediction and passing the wrong predictions to the next predictor
  • Paying more and more attention at each iteration to the observations. having prediction errors
  • Making new prediction until the limit is reached or the higher accuracy is achieved.
Reproduced from Wikipedia

Multiple boosting Algorithms

  • Gradient Boosting: Gradient boosting machines, Gradient Boosted Regression Trees
  • Adaboost
  • XGBoost

Example of Boosting with Adaboost in Python (Sklearn)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base classifier (e.g., Decision Tree)
base_classifier = DecisionTreeClassifier(max_depth=1)

# Create an AdaBoost Classifier
adaboost_classifier = AdaBoostClassifier(
    base_estimator=base_classifier,
    n_estimators=50,  # Number of weak learners (you can adjust this)
    random_state=42
)

# Fit the model to the training data
adaboost_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = adaboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Stacked Generalization (Blending)

Stacking, also known as Stacked Generalization, is an ensemble technique that improves the accuracy of the models by combining predictions of multiple classification or regression machine learning models.

Stacking machine learning algorithms work by:

  • Using multiple first level models to predict on a training set
  • Combining (stacking) the predictions to generate a new training set
  • Fitting and predicting a second level model on the generated training set
  • From sklearn.ensemble: StackingClassifier

Example of Blending in Python (Sklearn)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual classifiers
estimators = [
    ('logistic', LogisticRegression(random_state=42)),
    ('tree', DecisionTreeClassifier(random_state=42)),
    ('rf', RandomForestClassifier(random_state=42))
]

# Create the StackingClassifier
stacking_classifier = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()  # You can choose any final estimator
)

# Fit the StackingClassifier to the training data
stacking_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = stacking_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy * 100:.2f}%")

Interesting Resources from the Community

Conclusion

This concludes the introduction of ensemble machine learning algorithms. We have covered how ensemble learning works and provided an overview of the most common machine learning models.

The next step is to learn how to use Scikit-learn to train each ensemble machine learning models on real data.

5/5 - (1 vote)