{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SK Part 0: Introduction to Predictive Modeling with Python and Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the first in a series of tutorials on supervised machine learning with Python and `Scikit-Learn`. It is a short introductory tutorial that provides a bird's eye view using a binary classification problem as an example and it is actually is a simplified version of the tutorial **SK Part 1**. The reference textbook in these tutorials is below and it can be accessed [here](https://machinelearningbook.com/).\n", "\n", "The classifiers illustrated are as follows:\n", "* **Nearest neighbors** (Chapter on Similarity-based learning)\n", "* **Decision trees** (Chapter on Information-based learning)\n", "* **Random forests ensemble method** (Chapter on Information-based learning)\n", "* **Naive Bayes** (Chapter on Probability-based learning)\n", "* **Support vector machines** (Chapter on Error-based learning)\n", "\n", "As an overview, we shall cover various aspects of `scikit-learn` in the following tutorials:\n", "\n", "- **SK Part 0 (\"SK-Intro\"):** Introduction to machine learning with Python and scikit-learn (this tutorial)\n", "- **SK Part 1 (\"SK-Basics\"):** Basic model fitting\n", "- **SK Part 2 (\"SK-FS\"):** Feature selection and ranking\n", "- **SK Part 3 (\"SK-Eval\"):** Model evaluation (using performance metrics other than simple accuracy)\n", "- **SK Part 4 (\"SK-CV\"):** Cross-validation and hyper-parameter tuning\n", "- **SK Part 5 (\"SK-Pipes\"):** Machine learning pipeline, statistical model comparison, and model deployment\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary Classification Example: Breast Cancer Wisconsin Data\n", "\n", "This dataset is concerned with predicting whether a cell tissue is cancerous or not using the cell's measurement values. It contains 569 observations and 30 input features. The target feature, \"diagnosis\", has two classes: 212 \"malignant\" and 357 \"benign\", denoted by \"M\" and \"B\" respectively.\n", "\n", "The dataset has no missing values and all features are numeric other than the target feature (which is binary)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading Breast Cancer Dataset from the Cloud\n", "\n", "We load the data directly from the following github account." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import io\n", "import requests\n", "\n", "# so that we can see all the columns\n", "pd.set_option('display.max_columns', None) \n", "\n", "# how to read a csv file from a github account\n", "url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/breast_cancer_wisconsin.csv'\n", "url_content = requests.get(url_name, verify=False).content\n", "df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the shape of this dataset to make sure it has been downloaded correctly." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(569, 31)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the first 5 rows in this raw dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | mean_radius | \n", "mean_texture | \n", "mean_perimeter | \n", "mean_area | \n", "mean_smoothness | \n", "mean_compactness | \n", "mean_concavity | \n", "mean_concave_points | \n", "mean_symmetry | \n", "mean_fractal_dimension | \n", "radius_error | \n", "texture_error | \n", "perimeter_error | \n", "area_error | \n", "smoothness_error | \n", "compactness_error | \n", "concavity_error | \n", "concave_points_error | \n", "symmetry_error | \n", "fractal_dimension_error | \n", "worst_radius | \n", "worst_texture | \n", "worst_perimeter | \n", "worst_area | \n", "worst_smoothness | \n", "worst_compactness | \n", "worst_concavity | \n", "worst_concave_points | \n", "worst_symmetry | \n", "worst_fractal_dimension | \n", "diagnosis | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "0.2419 | \n", "0.07871 | \n", "1.0950 | \n", "0.9053 | \n", "8.589 | \n", "153.40 | \n", "0.006399 | \n", "0.04904 | \n", "0.05373 | \n", "0.01587 | \n", "0.03003 | \n", "0.006193 | \n", "25.38 | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "M | \n", "
1 | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "0.1812 | \n", "0.05667 | \n", "0.5435 | \n", "0.7339 | \n", "3.398 | \n", "74.08 | \n", "0.005225 | \n", "0.01308 | \n", "0.01860 | \n", "0.01340 | \n", "0.01389 | \n", "0.003532 | \n", "24.99 | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "M | \n", "
2 | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "0.2069 | \n", "0.05999 | \n", "0.7456 | \n", "0.7869 | \n", "4.585 | \n", "94.03 | \n", "0.006150 | \n", "0.04006 | \n", "0.03832 | \n", "0.02058 | \n", "0.02250 | \n", "0.004571 | \n", "23.57 | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "M | \n", "
3 | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "0.2597 | \n", "0.09744 | \n", "0.4956 | \n", "1.1560 | \n", "3.445 | \n", "27.23 | \n", "0.009110 | \n", "0.07458 | \n", "0.05661 | \n", "0.01867 | \n", "0.05963 | \n", "0.009208 | \n", "14.91 | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "M | \n", "
4 | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "0.1809 | \n", "0.05883 | \n", "0.7572 | \n", "0.7813 | \n", "5.438 | \n", "94.44 | \n", "0.011490 | \n", "0.02461 | \n", "0.05688 | \n", "0.01885 | \n", "0.01756 | \n", "0.005115 | \n", "22.54 | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "M | \n", "