SK Part 0: Introduction to Predictive Modeling with Python and Scikit-Learn#
This is the first in a series of tutorials on supervised machine learning with Python and Scikit-Learn
. It is a short introductory tutorial that provides a bird’s eye view using a binary classification problem as an example and it is actually is a simplified version of the tutorial SK Part 1. The reference textbook in these tutorials is below and it can be accessed here.
The classifiers illustrated are as follows:
Nearest neighbors (Chapter on Similarity-based learning)
Decision trees (Chapter on Information-based learning)
Random forests ensemble method (Chapter on Information-based learning)
Naive Bayes (Chapter on Probability-based learning)
Support vector machines (Chapter on Error-based learning)
As an overview, we shall cover various aspects of scikit-learn
in the following tutorials:
SK Part 0 (“SK-Intro”): Introduction to machine learning with Python and scikit-learn (this tutorial)
SK Part 1 (“SK-Basics”): Basic model fitting
SK Part 2 (“SK-FS”): Feature selection and ranking
SK Part 3 (“SK-Eval”): Model evaluation (using performance metrics other than simple accuracy)
SK Part 4 (“SK-CV”): Cross-validation and hyper-parameter tuning
SK Part 5 (“SK-Pipes”): Machine learning pipeline, statistical model comparison, and model deployment
Binary Classification Example: Breast Cancer Wisconsin Data#
This dataset is concerned with predicting whether a cell tissue is cancerous or not using the cell’s measurement values. It contains 569 observations and 30 input features. The target feature, “diagnosis”, has two classes: 212 “malignant” and 357 “benign”, denoted by “M” and “B” respectively.
The dataset has no missing values and all features are numeric other than the target feature (which is binary).
Reading Breast Cancer Dataset from the Cloud#
We load the data directly from the following github account.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import io
import requests
# so that we can see all the columns
pd.set_option('display.max_columns', None)
# how to read a csv file from a github account
url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/breast_cancer_wisconsin.csv'
url_content = requests.get(url_name, verify=False).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
Let’s check the shape of this dataset to make sure it has been downloaded correctly.
df.shape
(569, 31)
Let’s have a look at the first 5 rows in this raw dataset.
df.head(5)
mean_radius | mean_texture | mean_perimeter | mean_area | mean_smoothness | mean_compactness | mean_concavity | mean_concave_points | mean_symmetry | mean_fractal_dimension | radius_error | texture_error | perimeter_error | area_error | smoothness_error | compactness_error | concavity_error | concave_points_error | symmetry_error | fractal_dimension_error | worst_radius | worst_texture | worst_perimeter | worst_area | worst_smoothness | worst_compactness | worst_concavity | worst_concave_points | worst_symmetry | worst_fractal_dimension | diagnosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | M |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | M |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | M |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | M |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | M |
Partitioning Dataset into the Set of Descriptive Features and the Target Feature#
Next, we partition cancer_df
columns into the set of descriptive features and the target.
# The ".values" part below converts the data frame to a 2-dimensional numpy array
Data = df.drop(columns = 'diagnosis').values
target = df['diagnosis']
Encoding Target#
Keep in mind that scikit-learn
always requires all data to be numeric, so the target needs to be encoded as 0 and 1.
from sklearn import preprocessing
target = preprocessing.LabelEncoder().fit_transform(target)
Note that the LabelEncoder
labels in an alphabetical order. That is, “B” is labeled as 0 whereas “M” as labeled as 1 (see the code below).
np.unique(target, return_counts = True)
(array([0, 1]), array([357, 212]))
Scaling Descriptive Features#
It’s always a good idea to scale your descriptive features before fitting any models. Here, we use the “min-max scaling” so that each descriptive feature is scaled to be between 0 and 1. In the rest of this tutorial, we work with scaled data.
Data = preprocessing.MinMaxScaler().fit_transform(Data)
Spliting Data into Training and Test Sets#
We split the descriptive features and the target feature into a training set
and a test set
by a ratio of 70:30. That is, we use 70% of the data to build our classifiers and evaluate their performance on the remaining 30% of the data. This is to ensure that we measure model performance on unseen data in order to avoid overfitting. We also set a random state value so that we can replicate our results later on.
from sklearn.model_selection import train_test_split
D_train, D_test, t_train, t_test = train_test_split(Data,
target,
test_size = 0.3,
random_state=999)
Fitting a Nearest Neighbor Classifier#
Let’s fit a nearest neighbor classifier with 5 neighbors using the Euclidean distance. We fit the model on the train data and evaluate its performance on the test data.
Below, the score
method returns the accuracy of the classifier on the test data. Accuracy is defined as the ratio of correctly predicted observations to the total number of observations.
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=5, p=2)
knn_classifier.fit(D_train, t_train)
knn_classifier.score(D_test, t_test)
0.9707602339181286
If you would like to see which parameters are available for a classifier, just type the name of the classifier followed by a question mark, e.g., “KNeighborsClassifier?”.
# KNeighborsClassifier?
Fitting a Decision Tree Classifier#
Let’s fit a decision tree classifier with the entropy split criterion and a maximum depth of 4 on the train data, and then evaluate its performance on the test data.
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(criterion='entropy', max_depth=4)
dt_classifier.fit(D_train, t_train)
dt_classifier.score(D_test, t_test)
0.9298245614035088
Fitting a Random Forest Classifier#
An ensemble method is a collection of many sub-classifiers. The final outcome is determined by a majority voting of the sub-classifiers. Random forest classifier is a popular ensemble method based on the idea of “bagging” where the sub-classifiers are decision trees. Let’s fit a random forest classifier with 100 decision trees.
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_classifier.fit(D_train, t_train)
rf_classifier.score(D_test, t_test)
0.9590643274853801
Fitting a Gaussian Naive Bayes Classifier#
Another model we would like to fit to the breast cancer dataset is the Gaussian Naive Bayes classifier with a variance smoothing value of \(10^{-3}\).
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB(var_smoothing=10**(-3))
nb_classifier.fit(D_train, t_train)
nb_classifier.score(D_test, t_test)
0.935672514619883
Fitting a Support Vector Machine#
One last model we fit is the SVM with all the default values.
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier.fit(D_train, t_train)
svm_classifier.score(D_test, t_test)
0.9883040935672515
Making Predictions with a Fitted Model#
Once a model is built, a prediction can be made using the predict
method of the fitted classifier.
For example, suppose we would like to use the fitted nearest neighbor classifier as our model, and we would like to find out the model’s prediction for the first three rows in the input data. Of course, we already know the labels of these rows (which are all malignant), so this is just to illustrate how you would make a prediction for a new observation.
new_obs = Data[0:3]
knn_classifier.predict(new_obs)
array([1, 1, 1])
The model’s prediction for these three rows is that they are all “1”, that is, they are all “malignant”. Thus, in this particular case, we observe that the model correctly predicts the first three rows in the input data.
Summary#
This tutorial illustrates that Python and Scikit-Learn
together provide a unified interface to model fitting and evaluation and they greatly simplify the machine learning workflow.
Of course, there is a whole lot more to supervised machine learning than what is shown in here, such as
Other classification algorithms
Solving prediction problems where the target feature is numeric (a.k.a. regression problems)
Using other model performance metrics (e.g., precision, recall, mean squared error for regression, etc.)
More sophisticated model performance assessment methods (such as cross-validation)
How model parameters can be optimized (also known as hyperparameter tuning)
We cover these topics in the subsequent tutorials.