{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Normal distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we'll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we'll use the graphical tools of Python to assess the normality of a dataset and also learn how to generate random numbers from a normal distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import io\n", "import requests\n", "\n", "df_url = 'https://raw.githubusercontent.com/akmand/datasets/master/openintro/bdims.csv'\n", "url_content = requests.get(df_url, verify=False).content\n", "bdims = pd.read_csv(io.StringIO(url_content.decode('utf-8')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick peek at the first few rows of the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(507, 25)\n" ] }, { "data": { "text/html": [ "
\n", " | bia.di | \n", "bii.di | \n", "bit.di | \n", "che.de | \n", "che.di | \n", "elb.di | \n", "wri.di | \n", "kne.di | \n", "ank.di | \n", "sho.gi | \n", "che.gi | \n", "wai.gi | \n", "nav.gi | \n", "hip.gi | \n", "thi.gi | \n", "bic.gi | \n", "for.gi | \n", "kne.gi | \n", "cal.gi | \n", "ank.gi | \n", "wri.gi | \n", "age | \n", "wgt | \n", "hgt | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "42.9 | \n", "26.0 | \n", "31.5 | \n", "17.7 | \n", "28.0 | \n", "13.1 | \n", "10.4 | \n", "18.8 | \n", "14.1 | \n", "106.2 | \n", "89.5 | \n", "71.5 | \n", "74.5 | \n", "93.5 | \n", "51.5 | \n", "32.5 | \n", "26.0 | \n", "34.5 | \n", "36.5 | \n", "23.5 | \n", "16.5 | \n", "21 | \n", "65.6 | \n", "174.0 | \n", "1 | \n", "
1 | \n", "43.7 | \n", "28.5 | \n", "33.5 | \n", "16.9 | \n", "30.8 | \n", "14.0 | \n", "11.8 | \n", "20.6 | \n", "15.1 | \n", "110.5 | \n", "97.0 | \n", "79.0 | \n", "86.5 | \n", "94.8 | \n", "51.5 | \n", "34.4 | \n", "28.0 | \n", "36.5 | \n", "37.5 | \n", "24.5 | \n", "17.0 | \n", "23 | \n", "71.8 | \n", "175.3 | \n", "1 | \n", "
2 | \n", "40.1 | \n", "28.2 | \n", "33.3 | \n", "20.9 | \n", "31.7 | \n", "13.9 | \n", "10.9 | \n", "19.7 | \n", "14.1 | \n", "115.1 | \n", "97.5 | \n", "83.2 | \n", "82.9 | \n", "95.0 | \n", "57.3 | \n", "33.4 | \n", "28.8 | \n", "37.0 | \n", "37.3 | \n", "21.9 | \n", "16.9 | \n", "28 | \n", "80.7 | \n", "193.5 | \n", "1 | \n", "
3 | \n", "44.3 | \n", "29.9 | \n", "34.0 | \n", "18.4 | \n", "28.2 | \n", "13.9 | \n", "11.2 | \n", "20.9 | \n", "15.0 | \n", "104.5 | \n", "97.0 | \n", "77.8 | \n", "78.8 | \n", "94.0 | \n", "53.0 | \n", "31.0 | \n", "26.2 | \n", "37.0 | \n", "34.8 | \n", "23.0 | \n", "16.6 | \n", "23 | \n", "72.6 | \n", "186.5 | \n", "1 | \n", "
4 | \n", "42.5 | \n", "29.9 | \n", "34.0 | \n", "21.5 | \n", "29.4 | \n", "15.2 | \n", "11.6 | \n", "20.7 | \n", "14.9 | \n", "107.5 | \n", "97.5 | \n", "80.0 | \n", "82.5 | \n", "98.5 | \n", "55.4 | \n", "32.0 | \n", "28.4 | \n", "37.7 | \n", "38.6 | \n", "24.4 | \n", "18.0 | \n", "22 | \n", "78.8 | \n", "187.2 | \n", "1 | \n", "
sim_norm
. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?\n",
"fdims['hgt']
look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?\n",
"bii.di
) belongs to normal probability plot letter ____.elb.di
) belongs to normal probability plot letter ____.age
) belongs to normal probability plot letter ____.che.de
) belongs to normal probability plot letter ____.kne.di
). Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.