Seaborn

Seaborn#

Seaborn is a Python data visualisation library based on Matplotlib. Some tasks are a bit cumbersome to perform using just Matplotlib, such as automatically plotting one numeric variable vs. another numeric variable for each level of a third (categorical) variable. Using Matplotlib functions behind the scenes, Seaborn provides an easy-to-use syntax to perform such tasks.

An additional benefit of Seaborn is that it works natively with Pandas dataframes, which is not the case for Matplotlib. This tutorial provides a brief overview of Seaborn for basic data visualisation.

The BRFSS Dataset#

The dataset used in this tutorial is collected by the Centers for Disease Control and Prevention (CDC). As explained here, the Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS website contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data. We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

The variables in the dataset are genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

We begin by importing necessary modules and setting up Matplotlib environment followed by importing the dataset from the Cloud.

import random
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# import the pyplot module from the the matplotlib package
import matplotlib.pyplot as plt
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn-v0_8")

import seaborn as sns
sns.set()

import warnings

warnings.filterwarnings("ignore")

import io
import requests

# so that we can see all the columns
pd.set_option("display.max_columns", None)

# how to read a csv file from a github account
url_name = "https://raw.githubusercontent.com/akmand/datasets/master/cdc.csv"
url_content = requests.get(url_name, verify=False).content
cdc = pd.read_csv(io.StringIO(url_content.decode("utf-8")))

cdc.describe(include="all")

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
count	20000	20000.000000	20000.000000	20000.000000	20000.000000	20000.00000	20000.000000	20000.000000	20000
unique	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
top	very good	NaN	NaN	NaN	NaN	NaN	NaN	NaN	f
freq	6972	NaN	NaN	NaN	NaN	NaN	NaN	NaN	10431
mean	NaN	0.745700	0.873800	0.472050	67.182900	169.68295	155.093850	45.068250	NaN
std	NaN	0.435478	0.332083	0.499231	4.125954	40.08097	32.013306	17.192689	NaN
min	NaN	0.000000	0.000000	0.000000	48.000000	68.00000	68.000000	18.000000	NaN
25%	NaN	0.000000	1.000000	0.000000	64.000000	140.00000	130.000000	31.000000	NaN
50%	NaN	1.000000	1.000000	0.000000	67.000000	165.00000	150.000000	43.000000	NaN
75%	NaN	1.000000	1.000000	1.000000	70.000000	190.00000	175.000000	57.000000	NaN
max	NaN	1.000000	1.000000	1.000000	93.000000	500.00000	680.000000	99.000000	NaN

cdc.sample(10, random_state=999)

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
6743	very good	1	1	0	71	170	170	23	m
19360	very good	1	0	1	64	120	117	45	f
8104	good	1	1	1	70	192	170	64	m
8535	excellent	1	1	1	64	165	140	67	f
8275	very good	1	1	0	69	130	140	69	m
3511	very good	0	1	0	63	128	128	37	f
1521	good	1	1	0	68	176	135	37	f
976	fair	0	1	1	64	150	125	43	f
14484	good	1	1	1	68	185	185	78	m
3591	fair	1	1	0	71	165	175	34	m

cdc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   genhlth   20000 non-null  object
 1   exerany   20000 non-null  int64 
 2   hlthplan  20000 non-null  int64 
 3   smoke100  20000 non-null  int64 
 4   height    20000 non-null  int64 
 5   weight    20000 non-null  int64 
 6   wtdesire  20000 non-null  int64 
 7   age       20000 non-null  int64 
 8   gender    20000 non-null  object
dtypes: int64(7), object(2)
memory usage: 1.4+ MB

Let’s get a value count for each one of the categorical variables.

categorical_cols = cdc.columns[cdc.dtypes == object].tolist()
for col in categorical_cols:
    print(f"Column Name: {col}")
    print(cdc[col].value_counts())
    print("\n")

Column Name: genhlth
genhlth
very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: count, dtype: int64


Column Name: gender
gender
f    10431
m     9569
Name: count, dtype: int64

Scatter Plots#

Let’s first plot height vs. weight in the dataset with both univariate and bivariate plots.

sns.jointplot(x="height", y="weight", data=cdc)

<seaborn.axisgrid.JointGrid at 0x10ed33b10>

../_images/e82ecbee9add2da94636665bbacad7332facae36603ddbb5eefaed11b4ee85af.png

Let’s again plot height vs. weight, but this time, let’s break this by gender. For this task, we use the relplot(), which is a workhorse function in Seaborn. The default option with this function is “scatter’. You can change this by setting the kind option to “line” in order to get a line plot. Notice the height option below to make the plot slightly bigger. The hue option specifies how to color the data points.

sns.relplot(x="height", y="weight", hue="gender", height=6, data=cdc)

<seaborn.axisgrid.FacetGrid at 0x10efec490>

../_images/658c5ca5e28bd3e0ec03bba90cb38e2d2224ed360e1d803d76186aed65ec10f8.png

We are curious if the above plot would be different for smokers vs. nonsmokers as well as respondents who exercised in the past month vs. who did not. For this, we use the col and row options.

This kind of plot is called a facetted plot.

sns.relplot(
    x="height", y="weight", hue="gender", col="smoke100", row="exerany", data=cdc
)

<seaborn.axisgrid.FacetGrid at 0x10f079f50>

../_images/e884cf3b3f167f0f55eaeeb9ff8282c0ce3ff82f4512bce73a6b8cd55b141711.png

Boxplots#

We would like to visualize the distribution of weights of the respondents, its quartiles and outliers (if any) with respect to each gender. For this, we need a boxlot. Keep in mind that the “box” in the middle contains exactly 50% of the respective data subset.

Do you see anything different with the people who are in excellent health?

bp = sns.boxplot(x="genhlth", y="weight", hue="gender", data=cdc)

# position the legend outside the chart
bp.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0, title="Gender")

# set the x-axis title
bp.set_xlabel("general health")

Text(0.5, 0, 'general health')

../_images/71664bbbf0eab6067bfd7b96fb835a0fe088e47abc213fa73bb684bb79da0b9a.png

Histograms#

Let’s get a histogram of the heights and weights of the survey respondents. For both histogtams, we will overlay the kernel density estimate (KDE).

sns.distplot(cdc["height"], kde=True, bins=40)

<Axes: xlabel='height', ylabel='Density'>

../_images/5d541c12ca4994e21d6788a2b031e213b6cc2645b3ba8dc3b17d12397edc9d69.png

sns.distplot(cdc["weight"], kde=True, bins=40)

<Axes: xlabel='weight', ylabel='Density'>

../_images/59f89f43ee5a36369c94acb03b59fca14dde6592be7a24c87ff36030d00bdb0b.png

Let’s get a histogram of weights of the respondents, this time across the gender dimension as a faceted histogram.

s = sns.FacetGrid(cdc, height=6, col="gender")

s = s.map(sns.distplot, "weight", color="r")

../_images/6d29e66a5e214e94cae669d71567d26bff082e66e5d9ceb8618522579debc816.png

It might also be a good idea to overlay the weight distribution for males and females in the same plot.

s = sns.FacetGrid(cdc, height=6, hue="gender")

s = s.map(sns.distplot, "weight")

../_images/f0dbb247d0ee105aae4cbd12073dcccfd8dcd3ac952e65294ae722995cf72675.png

Line Plots#

Our BRFSS dataset does not really allow for any meaningful line plots, so we will cook up a scenario for one. We will take the first 18 rows and pretend that they are measurements for the past 18 years.

cdc_series = pd.DataFrame(
    {
        "time": pd.date_range("2000-12-31", periods=18, freq="12M"),
        "height": cdc["height"].iloc[0:18],
        "weight": cdc["weight"].iloc[0:18],
    }
)

cdc_series.head()

	time	height	weight
0	2000-12-31	70	175
1	2001-12-31	64	125
2	2002-12-31	60	105
3	2003-12-31	66	132
4	2004-12-31	61	150

# move the height and weight columns to rows
# using the melt() Pandas function:

cdc_series_long = pd.melt(
    cdc_series, id_vars=["time"], var_name="measurement_type", value_name="values"
)

cdc_series_long.sample(5, random_state=1)

	time	measurement_type	values
30	2012-12-31	weight	185
34	2016-12-31	weight	156
28	2010-12-31	weight	186
3	2003-12-31	height	66
19	2001-12-31	weight	125

lp = sns.lineplot(x="time", y="values", hue="measurement_type", data=cdc_series_long)

# position the legend outside the chart
lp.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

# add a title and axes labels
lp.set(
    title="Line plot of historical body measurements",
    xlabel="year",
    ylabel="measurement value",
)

[Text(0.5, 1.0, 'Line plot of historical body measurements'),
 Text(0.5, 0, 'year'),
 Text(0, 0.5, 'measurement value')]

../_images/fbd32f1a212687c2bcf60102b8fb5e3e3eef412416c6bf2e3bb8d052734c93a6.png

Regression Plots#

Seaborn provides an easy interface for regression plots via the lmplot() function. For illustration, we will use the famous iris flower data set, which is available within the Seaborn module.

A diagram of iris flower parts is shown below (source: US Forest Services).

# load the iris dataset
iris = sns.load_dataset("iris")

iris.sample(10, random_state=999)

	sepal_length	sepal_width	petal_length	petal_width	species
61	5.9	3.0	4.2	1.5	versicolor
64	5.6	2.9	3.6	1.3	versicolor
80	5.5	2.4	3.8	1.1	versicolor
40	5.0	3.5	1.3	0.3	setosa
36	5.5	3.5	1.3	0.2	setosa
0	5.1	3.5	1.4	0.2	setosa
44	5.1	3.8	1.9	0.4	setosa
82	5.8	2.7	3.9	1.2	versicolor
126	6.2	2.8	4.8	1.8	virginica
17	5.1	3.5	1.4	0.3	setosa

iris["species"].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

First let’s plot petal width vs. petal length without breaking by the species. We will do a scatter plot overlayed with a third-order regression fit.

g = sns.lmplot(x="petal_length", y="petal_width", order=3, data=iris)

# set axis labels
g.set_axis_labels("Sepal length (mm)", "Sepal width (mm)")

<seaborn.axisgrid.FacetGrid at 0x1124592d0>

../_images/3ffdf85dd2f5f8d407f77c9f85eba29796841f399fca6d6d32d75282b16f3852.png

Let’s now break the above regression plot with respect to the species variable and fit a first-order regression to each species’ respective data points in the plot.

g = sns.lmplot(
    x="petal_length", y="petal_width", hue="species", truncate=True, height=8, data=iris
)

# set axis labels
g.set_axis_labels("Sepal length (mm)", "Sepal width (mm)")

<seaborn.axisgrid.FacetGrid at 0x1132b2410>

../_images/356531aaeee3c8d671b0843694764181f6d76d86e0af6e741813fe26517a551c.png

The above plot indicates that breaking the plot by species seems to be a good idea in terms of a regression fit. Let’s do a faceted regression plot using the species variable as the column parameter.

g = sns.lmplot(
    x="petal_length", y="petal_width", col="species", truncate=True, data=iris
)

# set axis labels
g.set_axis_labels("Sepal length (mm)", "Sepal width (mm)")

<seaborn.axisgrid.FacetGrid at 0x11330ee50>

../_images/a806fe5fffbdeae3122aa4f9537fbcd17696d52077b7b291c0cf2423500f499a.png

Scatter Plot Matrices#

Let’s get a pairwise scatterplot matrix for all the petal length and petal width variables in the iris dataset sliced and diced by the species variable.

sp = sns.pairplot(iris, vars=["petal_length", "petal_width"], height=6, hue="species")

# change the legend title
sp._legend.set_title("Species")

../_images/d844e8b5b6f4a8992901cecd56c629b399749b7178f180fe982c613554a87da5.png

Seaborn is far more capable than what is illustrated in this short tutorial. For more details, please refer to the official API page here.