{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Inference for numerical data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## North Carolina births" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the `nc` data set into our notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import io\n", "import requests\n", "\n", "df_url = 'https://raw.githubusercontent.com/akmand/datasets/master/openintro/nc.csv'\n", "url_content = requests.get(df_url, verify=False).content\n", "nc = pd.read_csv(io.StringIO(url_content.decode('utf-8')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| variable | description |\n", "| ---------------- | ------------|\n", "| `fage` | father's age in years. |\n", "| `mage` | mother's age in years. |\n", "| `mature` | maturity status of mother. |\n", "| `weeks` | length of pregnancy in weeks. |\n", "| `premie` | whether the birth was classified as premature (premie) or full-term. |\n", "| `visits` | number of hospital visits during pregnancy. |\n", "| `marital` | whether mother is `married` or `not married` at birth. |\n", "| `gained` | weight gained by mother during pregnancy in pounds. |\n", "| `weight` | weight of the baby at birth in pounds. |\n", "| `lowbirthweight` | whether baby was classified as low birthweight (`low`) or not (`not low`). |\n", "| `gender` | gender of the baby, `female` or `male`. |\n", "| `habit` | status of the mother as a `nonsmoker` or a `smoker`. |\n", "| `whitemom` | whether mom is `white` or `not white`. |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " | fage | \n", "mage | \n", "weeks | \n", "visits | \n", "gained | \n", "weight | \n", "
---|---|---|---|---|---|---|
count | \n", "829.000000 | \n", "1000.000000 | \n", "998.000000 | \n", "991.000000 | \n", "973.000000 | \n", "1000.00000 | \n", "
mean | \n", "30.255730 | \n", "27.000000 | \n", "38.334669 | \n", "12.104945 | \n", "30.325797 | \n", "7.10100 | \n", "
std | \n", "6.763766 | \n", "6.213583 | \n", "2.931553 | \n", "3.954934 | \n", "14.241297 | \n", "1.50886 | \n", "
min | \n", "14.000000 | \n", "13.000000 | \n", "20.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.00000 | \n", "
25% | \n", "25.000000 | \n", "22.000000 | \n", "37.000000 | \n", "10.000000 | \n", "20.000000 | \n", "6.38000 | \n", "
50% | \n", "30.000000 | \n", "27.000000 | \n", "39.000000 | \n", "12.000000 | \n", "30.000000 | \n", "7.31000 | \n", "
75% | \n", "35.000000 | \n", "32.000000 | \n", "40.000000 | \n", "15.000000 | \n", "38.000000 | \n", "8.06000 | \n", "
max | \n", "55.000000 | \n", "50.000000 | \n", "45.000000 | \n", "30.000000 | \n", "85.000000 | \n", "11.75000 | \n", "
habit
and weight
. What does the plot highlight about the relationship between these two variables?\n",
"groupby
command above but replacing mean
with size
.\n",
"weeks
) and interpret it in context. Note that since you're doing inference on a single population parameter, there is no explanatory variable, so you can omit the x
variable from the function.