Predicting Age in Census Data

Predicting Age in Census Data #

Introduction #

The objective of this toy project is to predict the age of an individual with the 1994 US Census Data using multiple linear regression. We use the Statsmodels and Patsy modules for this task with Pyhon version >= 3.6. The dataset was sourced from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/adult (Lichman, 2013).

This report is organized as follows:

Overview section describes the dataset used and the features in this dataset.
Data Preparation section covers data cleaning and data preparation steps.
Data Exploration section explores dataset features and their inter-relationships.
Statistical Modeling and Performance Evaluation section first fits a full multiple linear regression model and performs diagnostic checks. Next, we perform backwards variable selection using p-values to obtain a reduced model, after which we perform another set of diagnostic checks on the reduced model.
Summary and Conclusions section provides a summary of our work and presents our findings.

Overview #

Data Source #

The UCI Machine Learning Repository provides five datasets, but only adult.data, adult.test, and adult.names were useful in this project. The adult.data and adult.test are the training and test datasets respectively. The adult.names file contains the details of the variables (a.k.a. features or attributes). The training dataset has 32,561 observations (a.k.a. instances or records) and the test dataset has 16,281 observations. Both datasets consist of 14 descriptive (a.k.a. independent) features and one target (a.k.a. response or dependent) feature. In this project, we combine both training and test data into one.

Project Objective #

Our goal is to see if we can predict an individual’s age within an acceptable margin of error using multiple linear regression primarily with just main affects.

Target Feature #

Our target feature is age, which is a continuous numerical feature. Hence, our project is on a regression problem.

Descriptive Features #

The variable descriptions below are from the adult.names file:

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-*inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
income: binary, 1: earns over \$50k a year, 0: earns less than \\$50k a year.

Most of the descriptive features are self-explanatory, except fnlwgt which stands for “Final Weight” defined by the US Census. This weight is an “estimate of the number of units in the target population that the responding unit represents” (Lichman, 2013). This feature aims to allocate similar weights to people with similar demographic characteristics.

Data Preparation #

Preliminaries #

For further information on how to prepare your data for statistical modeling, please refer to this page on our website.

First, let’s import all the common modules we will be using.

# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn-v0_8")

We read the training and test datasets directly from the data URLs. Also, since the datasets do not contain the attribute names, they are explicitly specified during data loading process. The adultData dataset is read first and then it is concatenated with adultTest as just data.

# Gettting data from the web
url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)

# Specifying the attribute names
attributeNames = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income',
]

# Read in data
adultData = pd.read_csv(url[0], sep = ',', names = attributeNames, header = None)
adultTest = pd.read_csv(url[1] , sep = ',', names = attributeNames, skiprows = 1)

# Join the two datasets together
data = pd.concat([adultData,adultTest])

# we will not need the datasets below anymore, so let's delete them to save memory
del adultData, adultTest

# Display randomly selected 10 rows
data.sample(10, random_state=999)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
77	67	?	212759	10th	6	Married-civ-spouse	?	Husband	White	Male	0	2	United-States	<=50K
30807	62	Self-emp-not-inc	224520	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	0	90	United-States	>50K
23838	32	Private	262153	HS-grad	9	Divorced	Craft-repair	Unmarried	White	Male	0	35	United-States	<=50K
24124	41	Private	39581	Some-college	10	Divorced	Adm-clerical	Not-in-family	Black	Female	0	24	El-Salvador	<=50K
22731	66	?	186032	Assoc-voc	11	Widowed	?	Not-in-family	White	Female	2964	30	United-States	<=50K
2975	27	Private	37933	Some-college	10	Never-married	Adm-clerical	Unmarried	Black	Female	0	40	United-States	<=50K
4457	31	State-gov	93589	HS-grad	9	Divorced	Protective-serv	Own-child	Other	Male	0	40	United-States	<=50K
32318	34	Private	117963	Bachelors	13	Married-civ-spouse	Craft-repair	Husband	White	Male	0	40	United-States	>50K
12685	61	Self-emp-not-inc	176965	Prof-school	15	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	50	United-States	>50K
2183	47	Private	156926	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	40	United-States	>50K.

Note: Alternatively, you can download the datasets from UCI Repository as txt files to your computer and then read them in as below.

    adultData = pd.read_csv('adult.data.txt', sep = ',', names = attributeNames, header = None)
    adultTest = pd.read_csv('adult.test.txt', sep = ',', names = attributeNames, skiprows = 1)

Data Cleaning and Transformation #

We first confirm that the feature types match the descriptions outlined in the documentation.

print(f"Shape of the dataset is {data.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(data.dtypes)

Shape of the dataset is (48842, 15) 

Data types are below where 'object' indicates a string type: 
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

Checking for Missing Values¶#

print(f"\nNumber of missing values for each feature:")
print(data.isnull().sum())

Number of missing values for each feature:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

On surface, no attribute contains any missing values, though we shall see below that the missing values are coded with a question mark. We will address this issue later.

Summary Statistics #

from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
data.describe(include='int64')

Table 1: Summary of continuous features

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	48842.000000	4.884200e+04	48842.000000	48842.000000	48842.000000	48842.000000
mean	38.643585	1.896641e+05	10.078089	1079.067626	87.502314	40.422382
std	13.710510	1.056040e+05	2.570973	7452.019058	403.004552	12.391444
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.175505e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.781445e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.376420e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.490400e+06	16.000000	99999.000000	4356.000000	99.000000

display(HTML('<b>Table 2: Summary of categorical features</b>'))
data.describe(include='object')

Table 2: Summary of categorical features

	workclass	education	marital-status	occupation	relationship	race	sex	native-country	income
count	48842	48842	48842	48842	48842	48842	48842	48842	48842
unique	9	16	7	15	6	5	2	42	4
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	<=50K
freq	33906	15784	22379	6172	19716	41762	32650	43832	24720

Table 2 shows the feature income has four categories (or a cardinality of 4). It was supposed to be 2 since income must be binary. We shall explain how to fix this cardinality issue later.

Continuous Features¶#

As discussed earlier, the fnlwgt variable has no predictive power, so it is removed. In addition, since we have an education categorical feature, we will go ahead and remove the education-num feature as it relays the same information and therefore redundant.

data = data.drop(columns=['fnlwgt', 'education-num'])
data['age'].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

The range of age appears reasonable as the minimum and maximum ages are 17 and 90 respectively.

Next, we define capital = capital-gain - capital-loss and then remove the individual gain and loss variables. The summary statistic for capital is displayed below.

data['capital'] = data['capital-gain'] - data['capital-loss']
data = data.drop(columns=['capital-gain', 'capital-loss'])
data['capital'].describe()

count    48842.000000
mean       991.565313
std       7475.549906
min      -4356.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      99999.000000
Name: capital, dtype: float64

Fixing Column Names #

Some column names contain a minus sign, which can be problematic when modeling. Basically, when we write the regression formula, a minus sign will mean “exclude this variable”, which is clearly not our intent here. For this reason, we modify the column names so that the minus signs are replaced with an underscore sign.

data.columns = [colname.replace('-', '_') for colname in list(data.columns)]
data.head()

	age	workclass	education	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income	capital
0	39	State-gov	Bachelors	Never-married	Adm-clerical	Not-in-family	White	Male	40	United-States	<=50K	2174
1	50	Self-emp-not-inc	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	Male	13	United-States	<=50K	0
2	38	Private	HS-grad	Divorced	Handlers-cleaners	Not-in-family	White	Male	40	United-States	<=50K	0
3	53	Private	11th	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	40	United-States	<=50K	0
4	28	Private	Bachelors	Married-civ-spouse	Prof-specialty	Wife	Black	Female	40	Cuba	<=50K	0

Categorical Features #

Let’s have a look at the unique values of the categorical columns. In Pandas, string types are of data type “object”, and usually these would be the categorical features.

categoricalColumns = data.columns[data.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values for ' + col)
    print(data[col].unique())
    print('')

Unique values for workclass
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']

Unique values for education
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']

Unique values for marital_status
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']

Unique values for occupation
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']

Unique values for relationship
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']

Unique values for race
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']

Unique values for sex
[' Male' ' Female']

Unique values for native_country
[' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South'
 ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran'
 ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand'
 ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic'
 ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia'
 ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago'
 ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary'
 ' Holand-Netherlands']

Unique values for income
[' <=50K' ' >50K' ' <=50K.' ' >50K.']

Some categorical attributes contain excessive white spaces, which makes life hard when filtering data. We will apply the strip() function to remove extra white spaces.

for col in categoricalColumns:
    data[col] = data[col].str.strip()

WARNING: The Statsmodels module does not play nice when you have a minus sign in levels of categorical variables, especially when you try to some sort of automatic variable selection. The reason is that the minus sign has a special meaning in the underlying Patsy module: it means remove this feature. So, we will replace all the minus signs in categorical variable level names with an underscore sign. In addition, a dot sign is not allowed either.

for col in categoricalColumns:
    data[col] = data[col].str.replace('-', '_')
data.head()

	age	workclass	education	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income	capital
0	39	State_gov	Bachelors	Never_married	Adm_clerical	Not_in_family	White	Male	40	United_States	<=50K	2174
1	50	Self_emp_not_inc	Bachelors	Married_civ_spouse	Exec_managerial	Husband	White	Male	13	United_States	<=50K	0
2	38	Private	HS_grad	Divorced	Handlers_cleaners	Not_in_family	White	Male	40	United_States	<=50K	0
3	53	Private	11th	Married_civ_spouse	Handlers_cleaners	Husband	Black	Male	40	United_States	<=50K	0
4	28	Private	Bachelors	Married_civ_spouse	Prof_specialty	Wife	Black	Female	40	Cuba	<=50K	0

The workclass, occupation, and native-country features contain some missing values encoded as “?”. These observations comprise 7.4% of the total number of observations.

mask = (data['workclass'] == '?') | (data['occupation'] == '?') | (data['native_country'] == '?')
mask.value_counts(normalize = True)*100

False    92.588346
True      7.411654
Name: proportion, dtype: float64

We now remove the rows with missing occupation, workclass, and native-country where missing values are encoded as “?”.

data = data[data['workclass'] != "?"] 
data = data[data['occupation'] != "?"] 
data = data[data['native_country'] != "?"]

Since native-country is too granular and unbalanced, we group countries as “Other” and “United_States”. Likewise, we also categorize races as “Other” and “White”.

data.loc[data['native_country'] != 'United_States', 'native_country'] = 'Other'
data.loc[data['race'] != 'White', 'race'] = 'Other'

TIP 1: As a general comment, sometimes numerical features in a dataset actually represent categorical features. As an example, suppose each state of Australia is encoded as an integer between 1 and 8, say as StateID column in a dataframe called df. This would still be a categorical variable. For your code to work correctly in such cases, you need to change the data type of this column from numeric to string as follows:

    df['StateID'] = df['StateID'].astype(str)

TIP 2: If you have a string column, say named col, that is actually real-valued, you can change its data type as follows:

    df['col'] = df['col'].astype(float)

Dependent Variable #

We need to correct the levels of the income categorical feature, which is our dependent variable, to make sure it is binary.

print('Before correction, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")

data['income'] = data['income'].str.rstrip(".")

print('After removing the dot, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")

Before correction, the number of unique income labels are: 
income
<=50K     22654
<=50K.    11360
>50K       7508
>50K.      3700
Name: count, dtype: int64

After removing the dot, the number of unique income labels are: 
income
<=50K    34014
>50K     11208
Name: count, dtype: int64

WARNING: The Statsmodels module does not play nice when you have mathematical symbols such as “-”, “+”, “<” and “>” in levels of categorical variables (try it and you will get a syntax error). For this reason, we will re-code income as low and high as below using the replace() function in Pandas.

data['income'] = data['income'].replace({'<=50K': 'low', '>50K': 'high'})
data['income'].value_counts()

income
low     34014
high    11208
Name: count, dtype: int64

Data Exploration #

Our dataset can now be considered “clean” and ready for visualisation and statistical modeling.

Univariate Visualisation #

ax = data['native_country'].value_counts().plot(kind = 'bar')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.tight_layout()
plt.title('Figure 1: Bar Chart of Native Country', fontsize = 15)
plt.show();

../_images/27c32b243f2abe6c4288e0e43c1a45fd3ee16855676dd241d9f4d9c44de078f8.png

Let’s display a boxplot and histogram for age. Figure 2 shows that this variable is right-skewed.

# get a box plot of age
sns.boxplot(data['age']).set_title('Figure 2: Box Plot of Age', fontsize = 15)
plt.show();

../_images/df0775a3edb6b623502e625a6d78f3a80410f741a607df0ab5c651a4c4a46f65.png

# get a histogram of age with kernel density estimate
sns.distplot(data['age'], kde = True).set_title('Figure 3: Histogram of Age', fontsize = 15)
plt.show();

../_images/e3f8a323ab395bce0b82f5279adfe2de23cf12a1686d83493a3551f68ced8e87.png

Multivariate Visualisation #

Scatterplot of Numeric Features and Age #

The scatterplot in Figure 4 shows no clear correlation between the age and hours_per_week numeric variables.

# store the values of hours-per-week
hpw = data['hours_per_week']

# get a scatter plot
plt.scatter(hpw, data['age'], alpha = 0.3)
plt.title('Figure 4: Scatterplot of hours per week and age', fontsize = 15)
plt.xlabel('hours_per_week')
plt.ylabel('age')
plt.show();

../_images/703e22f76662d7cda71d504338206865eb39972e2b059cac5510f4f10ae27a3b.png

Categorical Attributes by Age #

We can see that the distribution of age between each gender is similar.

# Creating a boxplot
sns.boxplot(x='sex', y='age', data=data);
plt.title('Figure 5: Boxplot of age by gender', fontsize = 15)
plt.show();

../_images/9eba2fb98b2c7fa3b7e3ba57858b23b083c8cfa406e2eeba8dd67e3ec53de8ce.png

The distribution of married women and single women differ but have a similar median as seen in Figure 6. The whiskers suggest that women above the age of 70 are more likely to be single.

# Storing a list of booleans corresponding to whether the person is female and a wife or Not_in_family
family_female_mask = (data['relationship'].isin(['Not_in_family','Wife'])) & (data['sex'].isin(['Female']))

# Using the list of booleans previously found to select the index of rows
family_female = data[family_female_mask]

# Creating the boxplot
sns.boxplot(x='relationship', y='age', data=family_female);
plt.title('Figure 6: Boxplot of female relationship status by age', fontsize = 15)
plt.show();

../_images/0b82b5cac4f8e45888ec2e7175a4465365570b6550bbc35774c1f78d88342615.png

Statistical Modeling and Performance Evaluation #

Full Model #

We begin by fitting a multiple linear regression that predicts age using all of the available features. We call this the full model. First let’s take a quick peak at the clean data.

data.head()

	age	workclass	education	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income	capital
0	39	State_gov	Bachelors	Never_married	Adm_clerical	Not_in_family	White	Male	40	United_States	low	2174
1	50	Self_emp_not_inc	Bachelors	Married_civ_spouse	Exec_managerial	Husband	White	Male	13	United_States	low	0
2	38	Private	HS_grad	Divorced	Handlers_cleaners	Not_in_family	White	Male	40	United_States	low	0
3	53	Private	11th	Married_civ_spouse	Handlers_cleaners	Husband	Other	Male	40	United_States	low	0
4	28	Private	Bachelors	Married_civ_spouse	Prof_specialty	Wife	Other	Female	40	Other	low	0

When constructing the regression formula, we can manually add all the independent features. On the other hand, if there are lots of independent variables, we can get smart and use some string function tricks as below.

# short and sweet
formula_string_indep_vars = ' + '.join(data.drop(columns='age').columns)
formula_string = 'age ~ ' + formula_string_indep_vars
print('formula_string: ', formula_string)

formula_string:  age ~ workclass + education + marital_status + occupation + relationship + race + sex + hours_per_week + native_country + income + capital

The formula string above works just fine with the Statsmodels module. The problem, however, is that we cannot do automatic variable selection with this formula. What we need for this purpose is “one-hot-encoding” of categorical features. For more information on this encoding, please refer to this page.

In the code chunk below, we first use the get_dummies() function in Pandas for one-hot-encoding of categorical features and then we construct a new formula string with the encoded features.

# one-hot-encoding of categorical features
# for this to work correctly, variable data types (numeric or categorical)
# must be correctly specified within the Pandas dataframe
data_encoded = pd.get_dummies(data, drop_first=True)
data_encoded.head()

	age	hours_per_week	capital	workclass_Local_gov	workclass_Private	workclass_Self_emp_inc	workclass_Self_emp_not_inc	workclass_State_gov	workclass_Without_pay	education_11th	...	occupation_Transport_moving	relationship_Not_in_family	relationship_Other_relative	relationship_Own_child	relationship_Unmarried	relationship_Wife	race_White	sex_Male	native_country_United_States	income_low
0	39	40	2174	False	False	False	False	True	False	False	...	False	True	False	False	False	False	True	True	True	True
1	50	13	0	False	False	False	True	False	False	False	...	False	False	False	False	False	False	True	True	True	True
2	38	40	0	False	True	False	False	False	False	False	...	False	True	False	False	False	False	True	True	True	True
3	53	40	0	False	True	False	False	False	False	True	...	False	False	False	False	False	False	False	True	True	True
4	28	40	0	False	True	False	False	False	False	False	...	False	False	False	False	False	True	False	False	False	True

5 rows × 52 columns

formula_string_indep_vars_encoded = ' + '.join(data_encoded.drop(columns='age').columns)
formula_string_encoded = 'age ~ ' + formula_string_indep_vars_encoded
print('formula_string_encoded: ', formula_string_encoded)

formula_string_encoded:  age ~ hours_per_week + capital + workclass_Local_gov + workclass_Private + workclass_Self_emp_inc + workclass_Self_emp_not_inc + workclass_State_gov + workclass_Without_pay + education_11th + education_12th + education_1st_4th + education_5th_6th + education_7th_8th + education_9th + education_Assoc_acdm + education_Assoc_voc + education_Bachelors + education_Doctorate + education_HS_grad + education_Masters + education_Preschool + education_Prof_school + education_Some_college + marital_status_Married_AF_spouse + marital_status_Married_civ_spouse + marital_status_Married_spouse_absent + marital_status_Never_married + marital_status_Separated + marital_status_Widowed + occupation_Armed_Forces + occupation_Craft_repair + occupation_Exec_managerial + occupation_Farming_fishing + occupation_Handlers_cleaners + occupation_Machine_op_inspct + occupation_Other_service + occupation_Priv_house_serv + occupation_Prof_specialty + occupation_Protective_serv + occupation_Sales + occupation_Tech_support + occupation_Transport_moving + relationship_Not_in_family + relationship_Other_relative + relationship_Own_child + relationship_Unmarried + relationship_Wife + race_White + sex_Male + native_country_United_States + income_low

For fun, let’s add two interaction terms to our full model. Let’s add the interaction of the capital feature with hours_per_week and race_White respectively.

formula_string_encoded = formula_string_encoded + ' + hours_per_week:capital + race_White:capital'

Also, let’s add the square of the hours_per_week feature to illustrate how we can add higher order terms to our linear regression.

formula_string_encoded = formula_string_encoded + ' + np.power(hours_per_week, 2)'
print('formula_string_encoded: ', formula_string_encoded)

formula_string_encoded:  age ~ hours_per_week + capital + workclass_Local_gov + workclass_Private + workclass_Self_emp_inc + workclass_Self_emp_not_inc + workclass_State_gov + workclass_Without_pay + education_11th + education_12th + education_1st_4th + education_5th_6th + education_7th_8th + education_9th + education_Assoc_acdm + education_Assoc_voc + education_Bachelors + education_Doctorate + education_HS_grad + education_Masters + education_Preschool + education_Prof_school + education_Some_college + marital_status_Married_AF_spouse + marital_status_Married_civ_spouse + marital_status_Married_spouse_absent + marital_status_Never_married + marital_status_Separated + marital_status_Widowed + occupation_Armed_Forces + occupation_Craft_repair + occupation_Exec_managerial + occupation_Farming_fishing + occupation_Handlers_cleaners + occupation_Machine_op_inspct + occupation_Other_service + occupation_Priv_house_serv + occupation_Prof_specialty + occupation_Protective_serv + occupation_Sales + occupation_Tech_support + occupation_Transport_moving + relationship_Not_in_family + relationship_Other_relative + relationship_Own_child + relationship_Unmarried + relationship_Wife + race_White + sex_Male + native_country_United_States + income_low + hours_per_week:capital + race_White:capital + np.power(hours_per_week, 2)

Now that we have defined our statistical model formula as a Python string, we fit an OLS (ordinary least squares) model to our encoded data.

model_full = sm.formula.ols(formula=formula_string_encoded, data=data_encoded)
###
model_full_fitted = model_full.fit()
###
print(model_full_fitted.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    age   R-squared:                       0.406
Model:                            OLS   Adj. R-squared:                  0.405
Method:                 Least Squares   F-statistic:                     571.3
Date:                Tue, 23 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:54:41   Log-Likelihood:            -1.6914e+05
No. Observations:               45222   AIC:                         3.384e+05
Df Residuals:                   45167   BIC:                         3.389e+05
Df Model:                          54                                         
Covariance Type:            nonrobust                                         
================================================================================================================
                                                   coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Intercept                                       58.8565      0.845     69.635      0.000      57.200      60.513
workclass_Local_gov[T.True]                     -0.5831      0.336     -1.737      0.082      -1.241       0.075
workclass_Private[T.True]                       -3.1473      0.285    -11.057      0.000      -3.705      -2.589
workclass_Self_emp_inc[T.True]                   1.5929      0.381      4.183      0.000       0.847       2.339
workclass_Self_emp_not_inc[T.True]               1.8802      0.331      5.675      0.000       1.231       2.530
workclass_State_gov[T.True]                     -1.9111      0.362     -5.284      0.000      -2.620      -1.202
workclass_Without_pay[T.True]                    7.5036      2.247      3.340      0.001       3.100      11.908
education_11th[T.True]                          -3.2940      0.387     -8.509      0.000      -4.053      -2.535
education_12th[T.True]                          -2.4467      0.516     -4.744      0.000      -3.458      -1.436
education_1st_4th[T.True]                        5.2143      0.756      6.897      0.000       3.732       6.696
education_5th_6th[T.True]                        4.0178      0.575      6.983      0.000       2.890       5.146
education_7th_8th[T.True]                        6.8474      0.462     14.827      0.000       5.942       7.753
education_9th[T.True]                            1.5526      0.490      3.170      0.002       0.593       2.512
education_Assoc_acdm[T.True]                    -2.0829      0.399     -5.226      0.000      -2.864      -1.302
education_Assoc_voc[T.True]                     -1.8617      0.377     -4.941      0.000      -2.600      -1.123
education_Bachelors[T.True]                     -1.5288      0.329     -4.643      0.000      -2.174      -0.883
education_Doctorate[T.True]                      3.3921      0.554      6.118      0.000       2.305       4.479
education_HS_grad[T.True]                       -0.6005      0.305     -1.968      0.049      -1.199      -0.003
education_Masters[T.True]                        1.2709      0.379      3.349      0.001       0.527       2.015
education_Preschool[T.True]                      4.8391      1.244      3.891      0.000       2.401       7.277
education_Prof_school[T.True]                    0.4821      0.503      0.958      0.338      -0.504       1.468
education_Some_college[T.True]                  -2.1039      0.313     -6.717      0.000      -2.718      -1.490
marital_status_Married_AF_spouse[T.True]       -15.3973      1.896     -8.121      0.000     -19.114     -11.681
marital_status_Married_civ_spouse[T.True]       -6.1876      0.605    -10.225      0.000      -7.374      -5.002
marital_status_Married_spouse_absent[T.True]    -2.5870      0.457     -5.667      0.000      -3.482      -1.692
marital_status_Never_married[T.True]           -11.9959      0.170    -70.591      0.000     -12.329     -11.663
marital_status_Separated[T.True]                -2.9265      0.302     -9.687      0.000      -3.519      -2.334
marital_status_Widowed[T.True]                  13.5010      0.318     42.520      0.000      12.879      14.123
occupation_Armed_Forces[T.True]                 -6.4631      2.742     -2.357      0.018     -11.838      -1.089
occupation_Craft_repair[T.True]                 -1.4627      0.208     -7.025      0.000      -1.871      -1.055
occupation_Exec_managerial[T.True]               0.5189      0.204      2.544      0.011       0.119       0.919
occupation_Farming_fishing[T.True]              -0.0065      0.322     -0.020      0.984      -0.638       0.625
occupation_Handlers_cleaners[T.True]            -2.6262      0.277     -9.466      0.000      -3.170      -2.082
occupation_Machine_op_inspct[T.True]            -1.2234      0.244     -5.024      0.000      -1.701      -0.746
occupation_Other_service[T.True]                -1.4299      0.207     -6.895      0.000      -1.836      -1.023
occupation_Priv_house_serv[T.True]               3.0495      0.692      4.406      0.000       1.693       4.406
occupation_Prof_specialty[T.True]               -0.7470      0.216     -3.459      0.001      -1.170      -0.324
occupation_Protective_serv[T.True]              -1.7801      0.371     -4.800      0.000      -2.507      -1.053
occupation_Sales[T.True]                        -0.8718      0.203     -4.288      0.000      -1.270      -0.473
occupation_Tech_support[T.True]                 -1.2322      0.307     -4.015      0.000      -1.834      -0.631
occupation_Transport_moving[T.True]              0.2230      0.268      0.832      0.406      -0.303       0.749
relationship_Not_in_family[T.True]              -4.0503      0.603     -6.716      0.000      -5.232      -2.868
relationship_Other_relative[T.True]             -6.8905      0.590    -11.689      0.000      -8.046      -5.735
relationship_Own_child[T.True]                 -11.9082      0.600    -19.862      0.000     -13.083     -10.733
relationship_Unmarried[T.True]                  -6.1890      0.624     -9.919      0.000      -7.412      -4.966
relationship_Wife[T.True]                       -3.9954      0.273    -14.611      0.000      -4.531      -3.459
race_White[T.True]                              -1.0056      0.145     -6.919      0.000      -1.290      -0.721
sex_Male[T.True]                                -0.1782      0.146     -1.223      0.221      -0.464       0.107
native_country_United_States[T.True]             1.4129      0.185      7.647      0.000       1.051       1.775
income_low[T.True]                              -2.4367      0.138    -17.616      0.000      -2.708      -2.166
hours_per_week                                  -0.1571      0.014    -11.106      0.000      -0.185      -0.129
capital                                          0.0001   3.22e-05      3.492      0.000    4.93e-05       0.000
race_White[T.True]:capital                    3.006e-05   2.12e-05      1.418      0.156   -1.15e-05    7.16e-05
hours_per_week:capital                       -2.148e-06   5.02e-07     -4.283      0.000   -3.13e-06   -1.17e-06
np.power(hours_per_week, 2)                      0.0010      0.000      6.350      0.000       0.001       0.001
==============================================================================
Omnibus:                     3966.388   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5729.860
Skew:                           0.703   Prob(JB):                         0.00
Kurtosis:                       4.033   Cond. No.                     2.26e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.26e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

The full model has an adjusted R-squared value of 0.405, which means that only 40% of the variance is explained by the model. By looking at the p-values, we observe that the majority of them are highly significant, though there are a few insignificant variables at a 5% level.

Let’s define a new data frame for actual age vs. predicted age and the residuals for the full model. We will use this data frame when plotting predicted values and the regression residuals.

residuals_full = pd.DataFrame({'actual': data_encoded['age'], 
                               'predicted': model_full_fitted.fittedvalues, 
                               'residual': model_full_fitted.resid})
residuals_full.head(10)

	actual	predicted	residual
0	39	32.570133	6.429867
1	50	49.455511	0.544489
2	38	41.508711	-3.508711
3	53	37.683573	15.316427
4	28	36.097893	-8.097893
5	37	40.570690	-3.570690
6	49	44.495664	4.504336
7	52	49.611524	2.388476
8	31	35.683539	-4.683539
9	42	44.318118	-2.318118

Let’s plot actual age values vs. predicted values.

def plot_line(axis, slope, intercept, **kargs):
    xmin, xmax = axis.get_xlim()
    plt.plot([xmin, xmax], [xmin*slope+intercept, xmax*slope+intercept], **kargs)

# Creating scatter plot
plt.scatter(residuals_full['actual'], residuals_full['predicted'], alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="red");
plt.xlabel('Actual Age');
plt.ylabel('Predicted Age');
plt.title('Figure 9: Scatter plot of actual vs. predicted age for the Full Model', fontsize=15);
plt.show();

../_images/08d6d03854b0a6e53bd1b2bad95f89867fa0fa8fda24edcc12c24a4bdfe92fc7.png

From Figure 9, we observe that the model never produces a prediction above 80 even though the oldest person in the dataset is 90.

We will now check the diagnostics for the full model.

Full Model Diagnostic Checks #

We would like to check whether there are indications of violations of the regression assumptions, which are

linearity of the relationship between target variable and the independent variables
constant variance of the errors
normality of the residual distribution
statistical independence of the residuals

Let’s first get a scatter plot of residuals (as a function of predicted age).

plt.scatter(residuals_full['predicted'], residuals_full['residual'], alpha=0.3);
plt.xlabel('Predicted Age');
plt.ylabel('Residuals')
plt.title('Figure 10(a): Scatterplot of residuals vs. predicted age for Full Model', fontsize=15)
plt.show();

../_images/e6c7c51af6829cc0dcf23ce4ef902aec94351cdc9deecc4ad6093a010a5154b7.png

From Figure 10(a), we see that, rather than being mostly random and centered around 0, the residuals exhibit a banding pattern, especially when predicted age is below 50. This pattern indicates that the constant variability assumption of linear regression is not quite satisfied in this case.

Let’s now plot actual age vs. residuals.

plt.scatter(residuals_full['actual'], residuals_full['residual'], alpha=0.3);
plt.xlabel('Actual Age');
plt.ylabel('Residuals')
plt.title('Figure 10(b): Scatterplot of residuals vs. actual age for Full Model', fontsize=15)
plt.show();

../_images/edb1a33f3811b29a859d7d200f69dad86a0ef9b41dfac619f51bd1fa91d3a773.png

We notice that the model overestimates younger ages and underestimates older ages. In particular, for those younger than the age of 30, the model predicts much older ages. Also, for those above the age of 80, the model predicts significantly younger ages.

Let’s overlay the histograms of actual vs. predicted age on the same plot.

plt.hist(residuals_full['actual'], label='actual', bins=20, alpha=0.7);
plt.hist(residuals_full['predicted'], label='predicted', bins=20, alpha=0.7);
plt.xlabel('Age');
plt.ylabel('Frequency');
plt.title('Figure 11: Histograms of actual age vs. predicted age for Full Model', fontsize=15);
plt.legend()
plt.show();

../_images/4bb0607af1e9d8d7769e8200890c8e54c39dd2e32ab70b2f4adef271b0753a7b.png

We notice that their distributions are quite different. In particular, the model’s predictions are highly clustered around mid-40’s.

Let’s now have look at the histogram of the residuals.

plt.hist(residuals_full['residual'], bins = 20);
plt.xlabel('Residual');
plt.ylabel('Frequency');
plt.title('Figure 12: Histogram of residuals for Full Model', fontsize=15);
plt.show();

../_images/9e7b94ca251aa3ea663dbff4452931dd4f5850ef6c7671f052b5d1e49043790e.png

From Figure 12, the histogram of residuals looks somewhat symmetric, though slightly right-skewed. Nonetheless, it seems the normality assumption of linear regression is not significantly violated in this particular case.

Backwards Feature Selection #

We now perform backwards feature selection using p-values. It appears Statsmodels does not have any canned code for automatic feature selection, so we wrote one ourselves.

formula_string_encoded

'age ~ hours_per_week + capital + workclass_Local_gov + workclass_Private + workclass_Self_emp_inc + workclass_Self_emp_not_inc + workclass_State_gov + workclass_Without_pay + education_11th + education_12th + education_1st_4th + education_5th_6th + education_7th_8th + education_9th + education_Assoc_acdm + education_Assoc_voc + education_Bachelors + education_Doctorate + education_HS_grad + education_Masters + education_Preschool + education_Prof_school + education_Some_college + marital_status_Married_AF_spouse + marital_status_Married_civ_spouse + marital_status_Married_spouse_absent + marital_status_Never_married + marital_status_Separated + marital_status_Widowed + occupation_Armed_Forces + occupation_Craft_repair + occupation_Exec_managerial + occupation_Farming_fishing + occupation_Handlers_cleaners + occupation_Machine_op_inspct + occupation_Other_service + occupation_Priv_house_serv + occupation_Prof_specialty + occupation_Protective_serv + occupation_Sales + occupation_Tech_support + occupation_Transport_moving + relationship_Not_in_family + relationship_Other_relative + relationship_Own_child + relationship_Unmarried + relationship_Wife + race_White + sex_Male + native_country_United_States + income_low + hours_per_week:capital + race_White:capital + np.power(hours_per_week, 2)'

import re

## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)

# initialize feature-selected fit to full model
linreg_fit = model_full_fitted

# do backwards elimination using p-values
p_val_cutoff = 0.05

## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.

print('\nPerforming backwards feature selection using p-values:')

while True:

    # uncomment the line below if you would like to see the regression summary
    # in each step:
    # ### print(linreg_fit.summary())

    pval_series = linreg_fit.pvalues.drop(labels='Intercept')
    pval_series = pval_series.sort_values(ascending=False)
    term = pval_series.index[0]
    pval = pval_series[0]
    if (pval < p_val_cutoff):
        break
        
    print(f'\nRemoving term "{term}" with p-value {pval:.4}')

    term_components = term.split(':')

    if (len(term_components) == 1): ## this is a main effect term

        term_to_remove = re.sub(r'\[.*?\]', '', term_components[0])
        
        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_to_remove)]))    


    else: ## this is an interaction term
        
        term_to_remove1 = re.sub(r'\[.*?\]', '', term_components[0])
        term_to_remove2 = re.sub(r'\[.*?\]', '', term_components[1])

        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_to_remove1), 
                                                          patsy.EvalFactor(term_to_remove2)]))    

    linreg_fit = smf.ols(formula=patsy_description, data=data_encoded).fit()

#########
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = data_encoded).fit()
#########

#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")

Performing backwards feature selection using p-values:

Removing term "occupation_Farming_fishing[T.True]" with p-value 0.984

Removing term "occupation_Transport_moving[T.True]" with p-value 0.3774

Removing term "education_Prof_school[T.True]" with p-value 0.3591

Removing term "sex_Male[T.True]" with p-value 0.2642

Removing term "race_White[T.True]:capital" with p-value 0.1615

Removing term "workclass_Local_gov[T.True]" with p-value 0.09358

***
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    age   R-squared:                       0.406
Model:                            OLS   Adj. R-squared:                  0.405
Method:                 Least Squares   F-statistic:                     642.5
Date:                Tue, 23 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:54:44   Log-Likelihood:            -1.6914e+05
No. Observations:               45222   AIC:                         3.384e+05
Df Residuals:                   45173   BIC:                         3.388e+05
Df Model:                          48                                         
Covariance Type:            nonrobust                                         
================================================================================================================
                                                   coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Intercept                                       58.5679      0.790     74.140      0.000      57.020      60.116
workclass_Private[T.True]                       -2.7499      0.172    -15.951      0.000      -3.088      -2.412
workclass_Self_emp_inc[T.True]                   1.9866      0.307      6.477      0.000       1.385       2.588
workclass_Self_emp_not_inc[T.True]               2.2638      0.236      9.610      0.000       1.802       2.726
workclass_State_gov[T.True]                     -1.5082      0.278     -5.422      0.000      -2.053      -0.963
workclass_Without_pay[T.True]                    7.8647      2.233      3.522      0.000       3.488      12.242
education_11th[T.True]                          -3.4571      0.348     -9.931      0.000      -4.139      -2.775
education_12th[T.True]                          -2.6152      0.487     -5.373      0.000      -3.569      -1.661
education_1st_4th[T.True]                        5.0367      0.736      6.843      0.000       3.594       6.479
education_5th_6th[T.True]                        3.8423      0.549      6.994      0.000       2.766       4.919
education_7th_8th[T.True]                        6.6783      0.429     15.573      0.000       5.838       7.519
education_9th[T.True]                            1.3880      0.459      3.021      0.003       0.487       2.289
education_Assoc_acdm[T.True]                    -2.2605      0.352     -6.419      0.000      -2.951      -1.570
education_Assoc_voc[T.True]                     -2.0396      0.329     -6.208      0.000      -2.684      -1.396
education_Bachelors[T.True]                     -1.7276      0.262     -6.588      0.000      -2.242      -1.214
education_Doctorate[T.True]                      3.1805      0.503      6.321      0.000       2.194       4.167
education_HS_grad[T.True]                       -0.7660      0.250     -3.061      0.002      -1.257      -0.275
education_Masters[T.True]                        1.0459      0.314      3.326      0.001       0.430       1.662
education_Preschool[T.True]                      4.6321      1.231      3.764      0.000       2.220       7.044
education_Some_college[T.True]                  -2.2759      0.256     -8.882      0.000      -2.778      -1.774
marital_status_Married_AF_spouse[T.True]       -15.3512      1.896     -8.097      0.000     -19.067     -11.635
marital_status_Married_civ_spouse[T.True]       -6.2001      0.605    -10.247      0.000      -7.386      -5.014
marital_status_Married_spouse_absent[T.True]    -2.5991      0.456     -5.696      0.000      -3.494      -1.705
marital_status_Never_married[T.True]           -12.0107      0.169    -70.961      0.000     -12.342     -11.679
marital_status_Separated[T.True]                -2.9334      0.302     -9.711      0.000      -3.525      -2.341
marital_status_Widowed[T.True]                  13.5172      0.317     42.689      0.000      12.897      14.138
occupation_Armed_Forces[T.True]                 -6.1629      2.732     -2.256      0.024     -11.517      -0.808
occupation_Craft_repair[T.True]                 -1.5761      0.173     -9.121      0.000      -1.915      -1.237
occupation_Exec_managerial[T.True]               0.4658      0.180      2.581      0.010       0.112       0.819
occupation_Handlers_cleaners[T.True]            -2.7455      0.253    -10.860      0.000      -3.241      -2.250
occupation_Machine_op_inspct[T.True]            -1.3151      0.219     -6.001      0.000      -1.745      -0.886
occupation_Other_service[T.True]                -1.4980      0.185     -8.090      0.000      -1.861      -1.135
occupation_Priv_house_serv[T.True]               3.0454      0.686      4.443      0.000       1.702       4.389
occupation_Prof_specialty[T.True]               -0.7794      0.191     -4.078      0.000      -1.154      -0.405
occupation_Protective_serv[T.True]              -1.9512      0.353     -5.527      0.000      -2.643      -1.259
occupation_Sales[T.True]                        -0.9336      0.179     -5.223      0.000      -1.284      -0.583
occupation_Tech_support[T.True]                 -1.2731      0.294     -4.331      0.000      -1.849      -0.697
relationship_Not_in_family[T.True]              -3.9907      0.601     -6.645      0.000      -5.168      -2.814
relationship_Other_relative[T.True]             -6.8363      0.587    -11.650      0.000      -7.986      -5.686
relationship_Own_child[T.True]                 -11.8623      0.597    -19.867      0.000     -13.033     -10.692
relationship_Unmarried[T.True]                  -6.0981      0.618     -9.875      0.000      -7.308      -4.888
relationship_Wife[T.True]                       -3.8670      0.240    -16.083      0.000      -4.338      -3.396
race_White[T.True]                              -0.9984      0.144     -6.919      0.000      -1.281      -0.716
native_country_United_States[T.True]             1.4226      0.185      7.708      0.000       1.061       1.784
income_low[T.True]                              -2.4513      0.137    -17.895      0.000      -2.720      -2.183
hours_per_week                                  -0.1575      0.014    -11.164      0.000      -0.185      -0.130
capital                                          0.0001    2.6e-05      5.345      0.000    8.81e-05       0.000
hours_per_week:capital                        -2.13e-06   5.01e-07     -4.247      0.000   -3.11e-06   -1.15e-06
np.power(hours_per_week, 2)                      0.0010      0.000      6.379      0.000       0.001       0.001
==============================================================================
Omnibus:                     3959.253   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5717.587
Skew:                           0.702   Prob(JB):                         0.00
Kurtosis:                       4.032   Cond. No.                     2.25e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.25e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
***
Regression number of terms: 49
Regression F-distribution p-value: 0.0000
Regression R-squared: 0.4057
Regression Adjusted R-squared: 0.4051

Similar to what we did for the full model, let’s define a new data frame for actual age vs. predicted age and the residuals for the reduced model.

residuals_reduced = pd.DataFrame({'actual': data_encoded['age'], 
                                  'predicted': model_reduced_fitted.fittedvalues, 
                                  'residual': model_reduced_fitted.resid})
residuals_reduced.head(10)

	actual	predicted	residual
0	39	32.690335	6.309665
1	50	49.461020	0.538980
2	38	41.558042	-3.558042
3	53	37.655981	15.344019
4	28	36.062133	-8.062133
5	37	40.504922	-3.504922
6	49	44.397983	4.602017
7	52	49.654315	2.345685
8	31	35.544969	-4.544969
9	42	44.329267	-2.329267

# get a scatter plot
plt.scatter(residuals_reduced['actual'], residuals_reduced['predicted'], alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="red");
plt.xlabel('Age');
plt.ylabel('Predicted Age');
plt.title('Figure 13: Scatter plot of actual vs. predicted age for Reduced Model', fontsize=15);
plt.show(); 

../_images/93ab845b623986df755a2e51f685d1b3e9ee8a999dee3cc6b66844b60574a319.png

This model returns an Adjusted R-squared of 0.404, meaning the reduced model still explains about 40% of the variance, but with 6 less variables. Looking at the p-values, they are all significant at the 5% level, as expected. From Figure 13, we still have the same issues with our model. That is, the model overestimates younger ages and underestimates older ages. We will now perform the diagnostic checks on this reduced model.

Reduced Model Diagnostic Checks #

Let’s first get a scatter plot of residuals (as a function of predicted age).

plt.scatter(residuals_reduced['predicted'], residuals_reduced['residual'], alpha=0.3);
plt.xlabel('Predicted Age');
plt.ylabel('Residuals')
plt.title('Figure 14: Scatter plot of residuals vs. predicted age for Reduced Model', fontsize=15)
plt.show();

../_images/71d5b8b33f61a7c4289a92bf889b30b31170d4a574654e46f92b71635fa745e0.png

Figure 14 looks very similar to Figure 10(a), suggesting that the residuals exhibit the same banding pattern.

Let’s now have look at the histogram of the residuals for the reduced model.

plt.hist(residuals_reduced['residual'], bins = 20);
plt.xlabel('Residual');
plt.ylabel('Frequency');
plt.title('Figure 15: Histogram of residuals for Reduced Model', fontsize = 15)
plt.show();

../_images/ee8c80e62ad6a921148ad3ebecbd12bb9ded9902a545fef4fc3e0fda9690c279.png

From Figure 15, there is again a somewhat symmetric histogram around zero, which suggests that the residuals are somewhat normally distributed.

Summary and Conclusions #

Using our independent variables, we were able to get a full model with an Adjusted R-squared value of about 40%. After backwards variable selection with a p-value cutoff value of 0.05, we were able to maintain the same performance but with 6 less variables. Our final model has 49 variables all together with a model p-value of 0.

Diagnostic checks with residual scatter plots indicate that, rather than being random and centered around 0, the residuals exhibit a banding pattern, especially when predicted age is below 50. This pattern indicates that the constant variability assumption of linear regression is not quite satisfied in this case. On the other hand, residual histograms suggest that there are no significant violations of the normality assumption on the residuals.

The final multiple linear regression model has an Adjusted R-squared value of about 40%, which is pretty low. So, it appears that the variables we used are not quite adequate for accurately predicting the age of an individual in the 1994 US Census dataset within a multiple linear regression framework. A good next step might involve adding some more interaction terms and maybe some other higher order terms to see if this would result in some improvement for the Adjusted R-squared value. Nonetheless, it might be the case that nonlinear models such as a neural network might be more appropriate for the task at hand rather than a linear regression model.

Our regression model appears to predict age correctly within $\pm40$ years in general, though this is clearly a huge margin of error for the model to be useful for any practical purposes. Furthermore, our model has some rather significant issues. Specifically, our model consistently overestimates younger ages and underestimates older ages. In particular, for those younger than the age of 30, the model predicts much older ages. Also, for those above the age of 80, the model predicts significantly younger ages.

References #

Lichman, M. (2013). UCI Machine Learning Repository [online]. Available at https://archive.ics.uci.edu/ml/datasets/adult [Accessed 2022-10-07]

Predicting Age in Census Data

Contents

Predicting Age in Census Data #

Introduction #

Overview #

Data Source #

Project Objective #

Target Feature #

Descriptive Features #

Data Preparation #

Preliminaries #

Data Cleaning and Transformation #

Checking for Missing Values¶#

Summary Statistics #

Continuous Features¶#

Fixing Column Names #

Categorical Features #

Dependent Variable #

Data Exploration #

Univariate Visualisation #

Multivariate Visualisation #

Scatterplot of Numeric Features and Age #

Categorical Attributes by Age #

Facet plots #

Statistical Modeling and Performance Evaluation #

Full Model #

Full Model Diagnostic Checks #

Backwards Feature Selection #

Reduced Model Diagnostic Checks #

Summary and Conclusions #

References #

Predicting Age in Census Data

Contents

Predicting Age in Census Data#

Introduction#

Overview#

Data Source#

Project Objective#

Target Feature#

Descriptive Features#

Data Preparation#

Preliminaries#

Data Cleaning and Transformation#

Checking for Missing Values¶#

Summary Statistics#

Continuous Features¶#

Fixing Column Names#

Categorical Features#

Dependent Variable#

Data Exploration#

Univariate Visualisation#

Multivariate Visualisation#

Scatterplot of Numeric Features and Age#

Categorical Attributes by Age#

Facet plots#

Statistical Modeling and Performance Evaluation#

Full Model#

Full Model Diagnostic Checks#

Backwards Feature Selection#

Reduced Model Diagnostic Checks#

Summary and Conclusions#

References#

Predicting Age in Census Data #

Introduction #

Overview #

Data Source #

Project Objective #

Target Feature #

Descriptive Features #

Data Preparation #

Preliminaries #

Data Cleaning and Transformation #

Summary Statistics #

Fixing Column Names #

Categorical Features #

Dependent Variable #

Data Exploration #

Univariate Visualisation #

Multivariate Visualisation #

Scatterplot of Numeric Features and Age #

Categorical Attributes by Age #

Facet plots #

Statistical Modeling and Performance Evaluation #

Full Model #

Full Model Diagnostic Checks #

Backwards Feature Selection #

Reduced Model Diagnostic Checks #

Summary and Conclusions #

References #