Learn Ensemble Learning Algorithms in Machine Learning (with Python Examples)

Ensemble learning is a supervised learning technique used in machine learning to improve overall performance by combining the predictions from multiple models.

Each model can be a different classifier:

How does Ensemble Learning Work?

Ensemble learning works on the principle of the “wisdom of the crowd”. By combing multiple models, we can improve the accuracy of the predictions.

Join the Newsletter

    Types of Ensemble Methods

    • Voting
    • Bootstrap aggregation (bagging)
    • Random Forests
    • Boosting
    • Stacked Generalization (Blending)

    Voting

    Voting is an ensemble machine learning algorithm that involves making a prediction that is the average (regression) or the sum (classification) of multiple machine learning models.

    • Same training sets
    • Different algorithms
    • sklearn.ensemble: VotingRegressor, VotingClassifier

    Example of Voting Classifier in Python (Sklearn)

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import VotingClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create individual classifiers
    logistic_classifier = LogisticRegression(random_state=42)
    tree_classifier = DecisionTreeClassifier(random_state=42)
    svm_classifier = SVC(random_state=42)
    
    # Create a VotingClassifier with majority rule
    voting_classifier = VotingClassifier(
        estimators=[
            ('logistic', logistic_classifier), 
            ('tree', tree_classifier), 
            ('svm', svm_classifier)],
        voting='hard'  # 'hard' for majority vote, 'soft' for weighted vote
    )
    
    # Fit the ensemble classifier to the training data
    voting_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = voting_classifier.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy * 100:.2f}%")
    

    Bootstrap Aggregation (Bagging)

    Bagging, or bootstrap aggregation, is an ensemble method that reduces the variance of individual models by fitting a decision tree on different bootstrap samples of a training set.

    • Different training sets
    • Same algorithm
    • Two models from sklearn.ensemble: BaggingClassifier, BaggingRegressor

    Example of Bagging Classifier in Python (Sklearn)

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create a base classifier (Decision Tree)
    base_classifier = DecisionTreeClassifier(random_state=42)
    
    # Create a BaggingClassifier
    bagging_classifier = BaggingClassifier(
        base_estimator=base_classifier,  # Base classifier to be used
        n_estimators=10,  # Number of base classifiers (decision trees)
        random_state=42,
    )
    
    # Fit the bagging classifier to the training data
    bagging_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = bagging_classifier.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy * 100:.2f}%")
    

    Random Forests

    Random forests use decision trees as the base estimator for the predictions and improve the performance of models by calculating the majority voting / average prediction of multiple decision trees.

    Random forest is both a supervised learning algorithm and an ensemble algorithm.

    • Base estimator is a decision tree
    • Each estimator uses a different bootstrap sample of the training set
    • Two models from sklearn.ensemble: RandomForestClassifier, RandomForestRegressor

    Example of Random Forests in Python (Sklearn)

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create a Random Forest Classifier
    random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Fit the model to the training data
    random_forest.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = random_forest.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy * 100:.2f}%")
    

    Boosting

    Boosting is an ensemble method that converts weak learners into strong learners by having each predictor fix the errors of its predecessor.

    Boosting can be used in classification and regression problems.

    Boosting machine learning algorithms work by:

    • Instantiating a weak learner (e.g. CART with max_depth of 1)
    • Making a prediction and passing the wrong predictions to the next predictor
    • Paying more and more attention at each iteration to the observations. having prediction errors
    • Making new prediction until the limit is reached or the higher accuracy is achieved.
    Reproduced from Wikipedia

    Multiple boosting Algorithms

    • Gradient Boosting: Gradient boosting machines, Gradient Boosted Regression Trees
    • Adaboost
    • XGBoost

    Example of Boosting with Adaboost in Python (Sklearn)

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create a base classifier (e.g., Decision Tree)
    base_classifier = DecisionTreeClassifier(max_depth=1)
    
    # Create an AdaBoost Classifier
    adaboost_classifier = AdaBoostClassifier(
        base_estimator=base_classifier,
        n_estimators=50,  # Number of weak learners (you can adjust this)
        random_state=42
    )
    
    # Fit the model to the training data
    adaboost_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = adaboost_classifier.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy * 100:.2f}%")
    
    

    Stacked Generalization (Blending)

    Stacking, also known as Stacked Generalization, is an ensemble technique that improves the accuracy of the models by combining predictions of multiple classification or regression machine learning models.

    Stacking machine learning algorithms work by:

    • Using multiple first level models to predict on a training set
    • Combining (stacking) the predictions to generate a new training set
    • Fitting and predicting a second level model on the generated training set
    • From sklearn.ensemble: StackingClassifier

    Example of Blending in Python (Sklearn)

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier, StackingClassifier
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create individual classifiers
    estimators = [
        ('logistic', LogisticRegression(random_state=42)),
        ('tree', DecisionTreeClassifier(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42))
    ]
    
    # Create the StackingClassifier
    stacking_classifier = StackingClassifier(
        estimators=estimators,
        final_estimator=LogisticRegression()  # You can choose any final estimator
    )
    
    # Fit the StackingClassifier to the training data
    stacking_classifier.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = stacking_classifier.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Stacking Classifier Accuracy: {accuracy * 100:.2f}%")
    
    

    Interesting Resources from the Community

    Conclusion

    This concludes the introduction of ensemble machine learning algorithms. We have covered how ensemble learning works and provided an overview of the most common machine learning models.

    The next step is to learn how to use Scikit-learn to train each ensemble machine learning models on real data.

    5/5 - (1 vote)