Scikit-learn: How to Install, Import and Run Sklearn for Machine Learning (with Python Example)

Scikit-learn, or sklearn, is a machine learning library widely used in the data science community for supervised learning and unsupervised learning.


Learn Scikit-learn

In this post, we will cover the basics of Scikit-learn. Scikit-learn or sklearn is a great library for data science and machine learning.

If you want to deepen your knowledge of sklearn, Datacamp has a fantastic set of tutorials to help you.

Join the Newsletter


    What is Scikit-learn?

    Scikit-learn, or sklearn, is a machine learning package for Python built on top of SciPy, Matplotlib and NumPy.


    Why Use Sklearn?

    One of the major advantages of Scikit-learn is that it can be used for many different applications such as Classification, Regression, Clustering, NLP and more.

    Sklearn not only has a vast range of functionalities but also has very thorough documentation, making the package easy to learn and use.


    Install Scikit-learn

    $ pip install -U scikit-learn
    

    Import sklearn

    Scikit-learn has many modules and many methods and classes.

    The structure of the import goes like this:

    from sklearn.module_name import method_name, ClassName
    

    Load a dataset with Scikit-learn

    Scikit-learn has a variety of built-in datasets that you can load like this:

    import pandas as pd
    from sklearn.datasets import load_iris
    
    dataset = load_iris()
    df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
    df['target'] = pd.Series(dataset.target)
    df.head()
    

    Sklearn Built-in datasets

    Here are some of the methods that you can use to load more datasets.

    load_boston()
    load_breast_cancer()
    load_diabetes()
    load_iris()
    load_wine()
    

    Fetch openml

    You can also use fetch_openml() to query open datasets.

    from sklearn.datasets import fetch_openml
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
    X.head()
    

    Data Representation in Sklearn

    Data representation in scikit-learn is represented in X and y.

    • X = Independent variable = feature = predictor variable
    • y = Dependent variable = target = response variable

    In supervised learning, the scikit-learn tabular dataset has both independent and dependent (X and y) variables.

    In unsupervised learning, the dependent (y) variable is unknown.

    Data representation in Scikit-learn

    Data Formatting Requirements

    When using the Scikit-learn api, the data should follow certain requirements:

    1. Data should be stored as numpy arrays or pandas dataframes
    2. The dependent variables y should be converted to continuous values (no categorical variables).
    3. Missing values should be filled or removed
    4. Each X column should be a unique independent variable
    5. Each row should be an observation of the variable
    6. There should be as many labels as there are observations of a feature

    Thus, preprocessing is critical when using Scikit-learn.


    Generate Dummy Data

    Libraries used: matplotlib, pandas, seaborn, sklearn

    import matplotlib.pyplot as plt
    import pandas as pd 
    import seaborn as sns
    
    from sklearn.datasets import make_classification 
    
    sns.set()
    
    # generate dataset for classification
    X, y = make_classification(n_samples=100, n_classes=2, n_features=5, n_redundant=0, n_repeated=0, n_clusters_per_class=1,class_sep=2,random_state=42)
    
    # prepare data for plot
    df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
    data = df.groupby('label')
    
    # plot classification into a scatter plot
    fig, ax = plt.subplots()
    params = {0:'blue',1:'green'}
    for key, value in data:
        value.plot(ax=ax,kind='scatter',x='x',y='y',label=key,color=params[key])
    
    plt.title('Scatterplot using make_classification()')
    plt.show()
    
    

    Understand Scikit-learn machine learning models

    The basic steps to run a machine learning model in Scikit-learn are:

    1. Load the dataset
    2. Define the machine learning model object*
    3. Define the feature variables and the target
    4. Split into training and testing data
    5. Use .fit() to train the model. The process of training the data is also called “fitting”.
    6. Use .predict() to predict outcomes from the learned model
    7. Evaluate the model

    *Machine learning models in Scikit-learn are all created by Python classes (see: Object Oriented Programming).

    A lot more steps can be added to make the model better, but we will focus on these as part of this tutorial.


    Run your First Machine Learning Model in Scikit-learn

    Now, lets train the most basic machine learning model with Scikit-learn and evaluate the performance using the classification report of sklearn.metrics.

    import pandas as pd
    
    from sklearn.datasets import load_breast_cancer
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import classification_report
    
    # Load data
    dataset = load_breast_cancer()
    df = pd.DataFrame(dataset.data,columns=dataset.feature_names)
    df['target'] = pd.Series(dataset.target)
    
    # Define predictor and predicted datasets
    X = df.drop('target', axis=1).values
    y = df['target'].values
    
    # Choose the machine learning model
    knn = KNeighborsClassifier(n_neighbors=8)
    
    # Train the model
    knn.fit(X, y)
    
    # Compute the prediction
    y_pred = knn.predict(X)
    
    # Print first 10 prediction labels
    print(['Malignant' if x==0 else 'Benign' for x in y_pred[:10]])
    

    Congratulations, you have run your first machine learning model in Scikit-learn.

    This model, however, isn’t perfect.

    Since we used the entire dataset to predict, it is hard to evaluate its quality on unseen data.

    It would be best to split the dataset into training and testing data and evaluate the accuracy of the prediction on the test data.


    Split the Data into testing and training data

    To make sure our model is good when used on unseen data, it is good practice to split the dataset into training and testing datasets.

    We will:

    • split the data into training and testing sets
    • train the model on the training set, leaving the testing set as is
    • make a prediction using the test features
    • compare the expected predictions with the real untouched test set

    Using the code above, we will use the train_test_split method to create the training and testing splits.

    On top of the X and y data, we add two parameters to train_test_split method: test_size and random_state.

    The test_size parameter tells the percentage of data that we want to use for training as float.

    The random_state parameter is used for replication. It says that the random numbers generated will remain in the same sequence so that another user can run your code and get the same results.

    import pandas as pd
    from sklearn.datasets import load_breast_cancer
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import classification_report
    from sklearn.model_selection import train_test_split
    
    # Load data
    dataset = load_breast_cancer()
    df = pd.DataFrame(dataset.data,columns=dataset.feature_names)
    df['target'] = pd.Series(dataset.target)
    
    # Define predictor and predicted datasets
    X = df.drop('target', axis=1).values
    y = df['target'].values
    
    # Split into training and testing datasets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Choose the machine learning model
    knn = KNeighborsClassifier(n_neighbors=8)
    # Train the model
    knn.fit(X_train, y_train)
    
    # Compute the prediction
    y_pred = knn.predict(X_test)
    
    # compute accuracy of the model
    knn.score(X_test, y_test)
    
    

    This tells us that the accuracy of the model is

    0.9649122807017544
    

    Evaluate the Model

    It is always important to evaluate the quality of the model.

    To do so, we will plot the:

    Classification Report

    The classification report is a good way to compute the accuracy of your model.

    from sklearn.metrics import classification_report
    
    print(classification_report(y_test, y_pred))
    

    Read this article if you don’t know how to interpret the classification report.

    Confusion Matrix

    The confusion matrix computes the True positives, False positives, True negatives and False negatives of your machine-learned predictions.

    import matplotlib.pyplot as plt
    from sklearn.metrics import plot_confusion_matrix
    
    color = 'white'
    matrix = plot_confusion_matrix(knn, X_test, y_test, cmap=plt.cm.Blues)
    matrix.ax_.set_title('Confusion Matrix', color=color)
    plt.xlabel('Predicted Label', color=color)
    plt.ylabel('True Label', color=color)
    plt.gcf().axes[0].tick_params(colors=color)
    plt.gcf().axes[1].tick_params(colors=color)
    plt.show()
    

    Read this article if you don’t know how to interpret the confusion matrix.


    Advanced Scikit-learn

    Machine learning models can be improved in many ways. This article was just an overview of Scikit-learn, so I have created a series of tutorials to help you improve your Machine learning skills.

    1. Generate dummy data with Scikit-learn
    2. Data preprocessing
    3. Train test split
    4. Pipeline
    5. Classification with Sklearn
    6. Regression in scikit-learn
    7. Evaluation of the model

    Interesting Tutorials on Scikit-Learn

    Most common Scikit-Learn Libraries

    • sklearn.datasets
      • fetch_openml
      • load_*
      • make_classification
    • sklearn.decomposition
    • sklearn.model_selection
    • sklearn.neighbors
    • sklearn.linear_model
    • sklearn.tree
    • sklearn.ensemble
      • VotingClassifier
      • RandomForestClassifier
      • RandomForestRegressor
      • BaggingClassifier
      • BaggingReggressor
      • AdaBoostClassifier
      • AdaBoostRegressor
      • GradientBoostingClassifier
      • GradientBoostingRegressor
      • StackingClassifier
    • sklearn.svm
      • LinearSVC
      • SVC
      • SVR
    • sklearn.naive_bayes
      • GaussianNB
    • sklearn.cluster
    • sklearn.impute
      • SimpleImputer
    • sklearn.feature_extraction
      • CountVectorizer
      • TfidfTransformer
      • TfidfVectorizer
    • sklearn.preprocessing
      • OneHotEncoder
      • MinMaxScaler
      • StandardScaler
    • sklearn.metrics
    • sklern.pipeline
      • make_pipeline
      • Pipeline

    Conclusion

    This is the end of the introduction to Scikit-learn. We have seen what Scikit-learn is and how to use it in your machine learning projects.

    5/5 - (1 vote)