{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Foundations for statistical inference - Sampling distributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We're interested in formulating a *sampling distribution* of our estimate in order to learn about the properties of the estimate, such as its distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor's office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import io\n", "import requests\n", "\n", "df_url = 'https://raw.githubusercontent.com/akmand/datasets/master/openintro/ames.csv'\n", "url_content = requests.get(df_url, verify=False).content\n", "ames = pd.read_csv(io.StringIO(url_content.decode('utf-8')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick peek at the first few rows of the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Order | \n", "PID | \n", "MS.SubClass | \n", "MS.Zoning | \n", "Lot.Frontage | \n", "Lot.Area | \n", "Street | \n", "Alley | \n", "Lot.Shape | \n", "Land.Contour | \n", "... | \n", "Pool.Area | \n", "Pool.QC | \n", "Fence | \n", "Misc.Feature | \n", "Misc.Val | \n", "Mo.Sold | \n", "Yr.Sold | \n", "Sale.Type | \n", "Sale.Condition | \n", "SalePrice | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "526301100 | \n", "20 | \n", "RL | \n", "141.0 | \n", "31770 | \n", "Pave | \n", "NaN | \n", "IR1 | \n", "Lvl | \n", "... | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "5 | \n", "2010 | \n", "WD | \n", "Normal | \n", "215000 | \n", "
1 | \n", "2 | \n", "526350040 | \n", "20 | \n", "RH | \n", "80.0 | \n", "11622 | \n", "Pave | \n", "NaN | \n", "Reg | \n", "Lvl | \n", "... | \n", "0 | \n", "NaN | \n", "MnPrv | \n", "NaN | \n", "0 | \n", "6 | \n", "2010 | \n", "WD | \n", "Normal | \n", "105000 | \n", "
2 | \n", "3 | \n", "526351010 | \n", "20 | \n", "RL | \n", "81.0 | \n", "14267 | \n", "Pave | \n", "NaN | \n", "IR1 | \n", "Lvl | \n", "... | \n", "0 | \n", "NaN | \n", "NaN | \n", "Gar2 | \n", "12500 | \n", "6 | \n", "2010 | \n", "WD | \n", "Normal | \n", "172000 | \n", "
3 | \n", "4 | \n", "526353030 | \n", "20 | \n", "RL | \n", "93.0 | \n", "11160 | \n", "Pave | \n", "NaN | \n", "Reg | \n", "Lvl | \n", "... | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "4 | \n", "2010 | \n", "WD | \n", "Normal | \n", "244000 | \n", "
4 | \n", "5 | \n", "527105010 | \n", "60 | \n", "RL | \n", "74.0 | \n", "13830 | \n", "Pave | \n", "NaN | \n", "IR1 | \n", "Lvl | \n", "... | \n", "0 | \n", "NaN | \n", "MnPrv | \n", "NaN | \n", "0 | \n", "3 | \n", "2010 | \n", "WD | \n", "Normal | \n", "189900 | \n", "
5 rows × 82 columns
\n", "samp2
. How does the mean of samp2
compare with the mean of samp1
? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?\n",
"sample_means50
? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?\n",
"sample_means_small
. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small
`, but only iterate from 1 to 100. Print the output. How many elements are there in this object called sample_means_small
? What does each element represent?So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
price
. Using this sample, what is your best point estimate of the population mean?sample_means50
. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.sample_means150
. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?