SK Part 1: Basic Modeling#

This tutorial’s topic is basic model fitting using a train-test-split approach (also known as “hold-out sampling”).

Learning Objectives#

  • Illustrate three examples of supervised machine learning:

    • Binary classification

    • Regression

    • Multinomial (a.k.a. multiclass) classification (as an exercise with solutions provided)

  • Split the data into a training set and a test set

  • Fit and evaluate a nearest neighbor model

  • Fit and evaluate a decision-tree model

  • Fit and evaluate a Gaussian Naive Bayes model

Table of Contents#

Supervised Learning Tasks #

In line with our textbook’s notation, supervised learning is a machine learning task which uses a set of descriptive features \(D\) to predict a target feature \(t\). Note that Scikit-Learn documentation and many machine learning books use \(X\) and \(y\) to denote input dataset and target feature respectively.

Three Common Types of Supervised Learning Tasks #

The three common types of target feature \(t\) are as follows:

  1. Continuous targets. For example, house prices; loan amounts.

  2. Binary targets. For instance, whether a patient has Type 2 diabetes or not; whether a loan will default or not.

  3. Multinomial (a.k.a. multiclass) targets. For example, five-level Likert items such as “very poor”, “poor”, “average”, “good” and “very good”.

Let’s get familiar with some terminology. When the target feature is continuous, we coin it as a “regression problem”. The predictive model is then called a “regressor”. If the target feature is binary or multinomial, we say it is a “classification problem”. In fact, binary is a special case of multinomial targets (it has only two classes). The model built is called a “classifier”.

Other Types of Supervised Learning Tasks #

Before we proceed further, it is worth to mention other types of target features that we shall not cover:

  • Count targets, such as number of road accidents in Victoria.

  • Multilabel targets. Suppose we conduct a survey asking RMIT students “why do you love Melbourne”. Possible answers include “coffee”, “nice weather”, “nice food”, or “friendly people”. The answers to the survey are an example of “multilabel” target variables. The labels are not mutually exclusive as the survey participants could select more than one answer, for example (“coffee”, “nice weather”), (“coffee”), (“nice food”, “friendly people”), or “all above”.

  • Proportional targets, which are continuous, but strictly between 0 and 1, or equivalently between 0% and 100%. For example, loan default probability, or probability of a customer buying a certain product.

Overview of Examples #

To reiterate, we shall focus on continuous, binary, and multinomial targets in this and upcoming tutorials using the sample datasets below:

  1. Breast Cancer Wisconsin Data. The target feature is binary, i.e., if a cancer diagnosis is “malignant” or “benign”.

  2. California Housing Data. The target feature is continuous, which is the house prices in California.

  3. Wine Data. The target feature is multinomial. It consists of three classes of wines in a particular region in Italy.

These datasets can be loaded from sklearn. Let’s go through Breast Cancer Data and California Housing Data. We shall leave Wine Data as an exercise (with possible solutions).

Binary Classification Example: Breast Cancer Wisconsin Data #

This dataset contains 569 observations and has 30 input features. The target feature has two classes: 212 “malignant” (M) and 357 “benign” (B).

Preparing Data for Modeling #

We first load the data from sklearn as follows.

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing

df = load_breast_cancer()

Data, target = df.data, df.target

Let’s scale each descriptive feature to be between 0 and 1 before fitting any classifiers.

Data = preprocessing.MinMaxScaler().fit_transform(Data)

The target feature is already encoded. Let’s check.

np.unique(target, return_counts = True)
(array([0, 1]), array([212, 357]))

However, we would like “malignant” to be the positive class (1) and “benign” to be the negative class (0). So we use the “where” function as below to reverse the labels.

target = np.where(target==0, 1, 0)

Let’s check to make sure the labels are now reversed.

np.unique(target, return_counts = True)
(array([0, 1]), array([357, 212]))

Spliting Data into Training and Test Sets #

We split the descriptive features and the target feature into a training set and a test set by a ratio of 70:30. That is, we use 70 % of the data to build a classifier and evaluate its performance on the test set.

To split data, we use train_test_split function from sklearn.

In a classification problem, we might have an uneven proportion of classes. In the breast cancer example, the target has 212 “M” and 357 “B” classes. Therefore, when splitting the data into training and test sets, it is possible that the class proportions in these split sets might be different from the original one. So, in order to ensure the proportion is not deviating from the ratio of 212/357 when splitting the data, we set the stratify option in train_test_split function to the target array.

Furthermore, in order to be able to replicate our analysis later on, we set the random_state option to 999.

Finally, in order to ensure the data is split randomly, we set the shuffle option to “True” (which, by the way, is “True” by default).

from sklearn.model_selection import train_test_split

# The "\" character below allows us to split the line across multiple lines
D_train, D_test, t_train, t_test = \
    train_test_split(Data, target, test_size = 0.3, 
                     stratify=target, shuffle=True, random_state=999)

Fitting a Nearest Neighbor Classifier #

Let’s try a nearest neighbor classifier with 2 neighbors using the Euclidean distance.

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=2, p=2)

We can now go ahead and fit the classifier on the train data and evaluate its performance on the test data. Let’s first fit the nearest neighbor classifier on the training set.

# we put a ";" at the end to supress the line's output
knn_classifier.fit(D_train, t_train);

Done! We have created a nearest neighbor classifier. We shall use accuracy to evaluate this classifer using the test set. The accuracy metric is defined as:

\[\text{Accuracy} = \frac{\text{Number of correct predicted labels}}{\text{Number of total observations}}\]

In order to evaluate the performance of our classifier on the test data, we use the score method and set X = D_test and y = t_test.

knn_classifier.score(X=D_test, y=t_test)
0.9707602339181286

The nearest neighbor classifier scores an accuracy rate of 97% in this particular case on the test data. That is impressive.

Fitting a Decision Tree Classifier #

Let’s say we want to fit a decision tree with a maximum depth of 4 (max_depth = 4) using information gain for split criterion (criterion = 'entropy'). For reproducibility, we set random_state = 999.

from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(max_depth=4,
                                       criterion='entropy',
                                       random_state = 999)

Now let’s fit the decision tree on the training set.

dt_classifier.fit(D_train, t_train);
dt_classifier.score(D_test, t_test)
0.9415204678362573

The decision tree predicts the correct labels on the test set with an accuracy rate of 94%. However, there are other performance metrics, such as precision, recall, and F1 score, to assess model performance from different angles. We shall revisit model evaluation in tutorial SK Part 4: Evaluation.

Fitting a Gaussian Naive Bayes Classifier #

One last model we would like to fit to the breast cancer dataset is the Gaussian Naive Bayes classifier with a variance smoothing value of \(10^{-3}\).

from sklearn.naive_bayes import GaussianNB

nb_classifier = GaussianNB(var_smoothing=10**(-3))
nb_classifier.fit(D_train, t_train)
nb_classifier.score(D_test, t_test)
0.9532163742690059

We observe that the accuracy of the Gaussian Naive Bayes and decision tree classifiers are slightly lower compared to that of the nearest neighbor classifier.

We would have to perform multiple runs in a cross-validation setting and then conduct a “paired t-test” in order to determine if this difference is statistically significant or not.

We shall cover this important topic in the SK Part 5 tutorial.

Regression Example: California Housing Data #

Reading and Splitting Data #

The California Housing Data is available within sklearn datasets. Let’s load the dataset and use 70 % of the data for training and the remaining 30 % for testing. The goal is to build a decision tree regressor to predict median value of owner-occupied homes in thousand dollars. The input data has been cleaned.

from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()

housing_data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

The housing_data dictionary has two keys that we need: data and target, both as Numpy arrays. To see the first few rows in the data and the target, we can use array slicing.

housing_data.data[:3,]
array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02]])
housing_data.target[:3,]
array([4.526, 3.585, 3.521])

Let’s split both the data and the target into train and test respectively.

from sklearn.model_selection import train_test_split
D_train, D_test, t_train, t_test = \
    train_test_split(housing_data.data, housing_data.target, test_size = 0.3,
        shuffle=True, random_state=999)

Fitting and Evaluating a Regressor #

We create a decision tree regressor object (DecisionTreeRegressor) with a maximum depth of 4. Since it is a regression problem, we cannot build the model using accuracy. Instead, we build the regressor based on mean squared error (MSE) performance metric. The MSE is given as:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(\hat{t}_{i} - t_{i})^2\]

where

  • \(n\) is the total number of observations in the dataset (it can be training or test).

  • \(t_{i}\) is the actual target value for \(i^{th}\) instance.

  • \(\hat{t}_{i}\) is the predicted target value for \(i^{th}\) instance.

A lower MSE value indicates a smaller difference between predicted and actual values on the average, and thus better prediction performance.

from sklearn.tree import DecisionTreeRegressor

dt_regressor = DecisionTreeRegressor(max_depth = 4, random_state = 999)
dt_regressor.fit(D_train, t_train)
DecisionTreeRegressor(max_depth=4, random_state=999)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To compute MSE, we first need to predict on the test set.

t_pred = dt_regressor.predict(D_test)

Next, we import mean_squared_error from sklearn.metrics module and compute MSE using the predicted and test target feature values.

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(t_test, t_pred)

round(mse, 3)
np.float64(0.553)

It is more intuitive to examine the root of MSE, which is denoted by RMSE, rather than MSE itself as RMSE is in the same units as the target feature.

round(np.sqrt(mse), 3)
np.float64(0.743)

Exercises #

Problems #

  1. On the breast cancer dataset, check if the accuracy score improves when we increase max depth from 4 to 5. Note: In upcoming tutorials, we shall demonstrate how to search for the optimal set of parameters such as max depth to improve model accuracy.

  2. Refresher questions for Pandas and Matplotlib:

    • Read Wine Data dataset by calling sklearn.datasets import load_wine.

    • Plot a bar chart for target wine classes.

    • Calculate means of all numeric variables for each wine class. Are mean values very different among wine classes for some numeric variables?

  3. Build a decision tree classifier for Wine Data and calculate the accuracy score.

Possible Solutions #

Problem 1

# Load and split the data using stratification

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
cancer_df = load_breast_cancer()
Data, target = cancer_df.data, cancer_df.target

D_train, D_test, t_train, t_test = \
    train_test_split(Data, target, 
        test_size = 0.3, stratify = target)

# Calculate the counts for each label in test and training sets
test_counts  = np.unique(t_test, return_counts = True)
train_counts = np.unique(t_train, return_counts = True)

print('The class proportions in test set are ' + 
    str(test_counts[1]/sum(test_counts[1])))
print('The class proportions in test set are ' + 
    str(train_counts[1]/sum(train_counts[1])))

decision_tree1 = DecisionTreeClassifier(max_depth = 4,
                                        criterion = 'entropy',
                                        random_state = 999)
decision_tree2 = DecisionTreeClassifier(max_depth = 5,
                                        criterion = 'entropy',
                                        random_state = 999)
decision_tree1.fit(D_train, t_train)
decision_tree2.fit(D_train, t_train)

print(decision_tree1.score(X = D_test, y = t_test))
print(decision_tree2.score(X = D_test, y = t_test))

Problems 2 and 3

import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

wine = load_wine()

Data, target = wine.data, wine.target
print(np.unique(wine.target, return_counts = True))

# prepare for plotting
import matplotlib.pyplot as plt
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

# Draw the bar chart
target_counts = np.unique(target, return_counts = True)
plt.bar(target_counts[0], target_counts[1])
plt.xlabel('Wine type')
plt.ylabel('Counts')
plt.show();

# Get means of all numeric variables for each target
import pandas as pd
all_data = pd.DataFrame(wine.data)
all_data['target'] = target
pd.pivot_table(all_data, index="target", aggfunc = np.mean)

# Build and visualise the model.
D_train, D_test, t_train, t_test = \
    train_test_split(Data, target, test_size = 0.3, stratify = target)

decision_tree = DecisionTreeClassifier(max_depth = 4,
                                       criterion = 'entropy',
                                       random_state = 999)
decision_tree.fit(D_train, t_train)
print(decision_tree.score(X = D_test, y = t_test))