Introduction
In Part 1, we built an application to geographically explore the 500 Cities Project dataset from the CDC. In this post, we will demonstrate other exploratory data analysis (EDA) techniques for exploring a new dataset. The analysis will be done with R packages data.table, ggplot2 and highcharter.
In this post, you will learn how to:
- Build a boxplot
- Build and plot a correlation matrix
- Build a histogram
Load Dataset
It may take a few minutes to download the data. The data is also available here.
library(data.table)
df <- fread('https://chronicdata.cdc.gov/api/views/9z78-nsfp/rows.csv?accessType=DOWNLOAD')
The dataset contains values from cities across the country for 28 separate health measures. The full measure names are cumbersome so for the remainder of the post, we will use the Short_Question_Text which is an abbreviated form of the full measure name. Here are the full measure names and their corresponding abbreviations for reference.
## Measure Short_Question_Text ## 1 Current lack of health insurance among a Health Insurance ## 2 Arthritis among adults aged >=18 Years Arthritis ## 3 Binge drinking among adults aged >=18 Ye Binge Drinking ## 4 High blood pressure among adults aged >= High Blood Pressure ## 5 Taking medicine for high blood pressure Taking BP Medication ## 6 Cancer (excluding skin cancer) among adu Cancer (except skin) ## 7 Current asthma among adults aged >=18 Ye Current Asthma ## 8 Coronary heart disease among adults aged Coronary Heart Disease ## 9 Visits to doctor for routine checkup wit Annual Checkup ## 10 Cholesterol screening among adults aged Cholesterol Screening ## 11 Fecal occult blood test, sigmoidoscopy, Colorectal Cancer Screening ## 12 Chronic obstructive pulmonary disease am COPD ## 13 Physical health not good for >=14 days a Physical Health ## 14 Older adult men aged >=65 Years who are Core preventive services for o ## 15 Older adult women aged >=65 Years who ar Core preventive services for o ## 16 Current smoking among adults aged >=18 Y Current Smoking ## 17 Visits to dentist or dental clinic among Dental Visit ## 18 Diagnosed diabetes among adults aged >=1 Diabetes ## 19 High cholesterol among adults aged >=18 High Cholesterol ## 20 Chronic kidney disease among adults aged Chronic Kidney Disease ## 21 No leisure-time physical activity among Physical Activity ## 22 Mammography use among women aged 50–74 Y Mammography ## 23 Mental health not good for >=14 days amo Mental Health ## 24 Obesity among adults aged >=18 Years Obesity ## 25 Papanicolaou smear use among adult women Pap Smear Test ## 26 Sleeping less than 7 hours among adults Sleep ## 27 Stroke among adults aged >=18 Years Stroke ## 28 All teeth lost among adults aged >=65 Ye Teeth Loss
Boxplots
Boxplots graphically depict groups of numerical data through their quartiles. Outliers are shown as points above and below the boxes. This is a good first step in EDA because it shows the range of values associated with each measure. Here we will build a boxplot for each measure in the dataset grouped by category. Each boxplot is built from 500 values, one value for each city.
library(ggplot2)
df_subset <- df[df$GeographicLevel == 'City',]
# grouped boxplot
ggplot(df_subset, aes(x=substr(Short_Question_Text, 1,40), y=Data_Value, fill = Category )) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'bottom') +
facet_wrap(~Category,scales = "free_x") + xlab('Measure') + ylab('Percentage')
The boxplots show a wide distribution for most of the measures, especially among the preventive measures. Next we will analyze dependence between measures.
Correlation Plot
The second EDA technique we will demonstrate is a correlation plot which is used to show dependence between measures. The correlation between measures is given by a number between -1 and 1 (1 means perfectly positively correlated and -1 means perfectly negatively correlated). The dataset contains values for each measure at three different geographic levels: US, City, and Census Tract. To compute the correlations, we will use data at the Census Tract level which is the smallest geographic level. The correlation plot will compare the measure values at each Census Tract to determine their correlation coefficient.
The steps to compute a correlation matrix in R are as follows:
- Subset the dataset to select only the necessary columns
- Convert the dataset to a wide-format
- Calculate correlations
- Plot the correlation matrix
Subset the dataset
We only want the location (UniqueID), the measure (Short_Question_Text) and the value (Data_Value) columns in our dataset. We only want rows where GeographicLevel is Census Tract. We subset the dataset to remove the rows and columns which aren’t relevant to our analysis.
df_subset <- df[GeographicLevel == 'Census Tract', c('UniqueID','Short_Question_Text', 'Data_Value')]
head(df_subset)
## UniqueID Short_Question_Text Data_Value ## 1: 0107000-01073000100 Health Insurance 27.6 ## 2: 0107000-01073000300 Health Insurance 32.2 ## 3: 0107000-01073000400 Health Insurance 31.8 ## 4: 0107000-01073000500 Health Insurance 33.7 ## 5: 0107000-01073000700 Health Insurance 38.4 ## 6: 0107000-01073000800 Health Insurance 26.5
Convert to Wide Format
The R correlation function cor() requires that the dataset is in wide-format. The dcast function from the data.table package simplies the task of converting the dataset into wide-format. Essentially, the dcast function creates a new column for each unique Short_Question_Text (i.e. it ‘casts’ the Short_Question_Text into wide format) and inputs the Data_Value as the value for that column. The UniqueID is designated as the row name because it is not actually part of the correlation calculation.
df_wide <- dcast(df_subset, UniqueID ~ Short_Question_Text, value.var = 'Data_Value')
row.names(df_wide) <- df_wide$UniqueID
df_wide$UniqueID <- NULL
df_wide <- df_wide[complete.cases(df_wide),] #Removes any rows with NA values
head(df_wide[,1:5])
## Annual Checkup Arthritis Binge Drinking COPD Cancer (except skin) ## 1: 76.6 34.0 10.3 11.2 5.5 ## 2: 74.0 32.8 11.0 11.1 5.0 ## 3: 77.5 37.2 9.3 12.9 5.6 ## 4: 78.7 40.1 8.4 14.4 6.1 ## 5: 78.4 40.2 7.4 15.6 6.0 ## 6: 81.1 40.7 8.8 12.5 7.0
Compute Correlation Matrix
Now that the data is in the appropriate format, we can use the R function cor() to compute the correlations. We apply the cor() function to the wide-format data to compute the correlation between measures.
cor_plot <- cor(df_wide)
cor_plot <- round(cor_plot, 2)
head(cor_plot[,1:5])
## Annual Checkup Arthritis Binge Drinking COPD ## Annual Checkup 1.00 0.57 -0.35 0.29 ## Arthritis 0.57 1.00 -0.61 0.81 ## Binge Drinking -0.35 -0.61 1.00 -0.63 ## COPD 0.29 0.81 -0.63 1.00 ## Cancer (except skin) 0.47 0.65 -0.25 0.20 ## Cholesterol Screening 0.65 0.44 -0.14 -0.03 ## Cancer (except skin) ## Annual Checkup 0.47 ## Arthritis 0.65 ## Binge Drinking -0.25 ## COPD 0.20 ## Cancer (except skin) 1.00 ## Cholesterol Screening 0.75
As you can see, the output is a dataframe with values between 1 and -1. We will plot this result to make it easier to understand and analyze.
Plot the Correlation Matrix with Highcharts
Highcharter is an R wrapper for the Highcharts library. Highcharts is a data visualization library which makes it simple to develop interactive charts. The hchart() function can be applied to the R correlation function output to build a correlation plot.
library(highcharter)
hchart(cor_plot)
Reorder the Correlation Plot
While this is better than the tabular form, we can make it more clear by grouping correlated features together. We will use a helper function which I found here to reorder the correlation plot.
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
cor_plot <- reorder_cormat(cor_plot)
hchart(cor_plot)
The correlation plot yields some interesting insights. The first thing I noticed is that binge drinking is negatively correlated with many of the poor health outcomes (e.g. obesity, high cholesterol, stroke, diabetes). My first thought on this is that binge drinking is more common among young people who generally don’t have as many health issues as older people. However we would need to analyze data on binge drinking more extensively to derive solid conclusions.
Another insight is lack of health insurance is positively correlated with negative health outcomes (i.e. people without health insurance experience worse health outcomes). This is interesting although not surprising. In the next section, we will drill into the health insurance measure to analyze how it differs across the nation.
Health Insurance Histograms
First, we will compute the summary statistics and build a histogram for health insurance across the dataset by city.
df_health <- subset(df, Short_Question_Text == 'Health Insurance')
df_health <- subset(df_health, GeographicLevel == 'City' & DataValueTypeID == 'AgeAdjPrv')
summary(df_health$Data_Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.10 13.28 17.40 18.21 22.12 49.00
There is a wide range of health insurance coverage across the cities in the dataset with a difference between the best and worst of about 45 percent.
hist(df_health$Data_Value, col = 'gray', breaks = 20)
The histogram shows the data is skewed to the right.
Next, we will find the cities which have the highest percentage of their population lacking health insurance.
library(ggplot2)
df_health <- subset(df, Short_Question_Text == 'Health Insurance')
df_health <- subset(df_health, GeographicLevel == 'City' & DataValueTypeID == 'AgeAdjPrv')
df_health <- df_health[order(df_health$Data_Value, decreasing = TRUE),]
df_health$City <- paste0(df_health$CityName, ',', df_health$StateAbbr)
ggplot(df_health[1:20,], hcaes(x = reorder(City, Data_Value), y = Data_Value, fill = Data_Value)) +
geom_bar(stat = "identity",col = "black")+
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'none') +
ggtitle('The 20 cities in the USA with the lowest percentage of health insurance', subtitle = 'Texas leads the way with 6 of the top 7') +
xlab('City') + ylab('Percent Uninsured')
Pharr, TX leads the way with nearly half of their population lacking health insurance and Texas leads the way with 6 of the top 7 cities.
However Texas is a large state and they may simply have more cities in the dataset than other states. Next we will calculate the ratio of cities from each state in the top 100 with regards to lack of health insurance in comparison to the total number of cities in the dataset. The steps to complete this calculation are:
- Select the top 100 cities by lack of health insurance
- Count cities in the top 100 for each state
- Count the total cities in the dataset for each state
- Divide the number of cities in the top 100 by the total number of cities in the dataset
Select the top 100 cities by lack of health insurance.
#Subset dataset
df_health <- subset(df, Short_Question_Text == 'Health Insurance')
df_health <- subset(df_health, GeographicLevel == 'City' & DataValueTypeID == 'AgeAdjPrv')
df_health <- df_health[order(df_health$Data_Value, decreasing = TRUE),] #Order rows by Data_Value
df_health_100 <- df_health[1:100,] #Select only the top 100 rows
head(df_health_100[,1:5])
## Year StateAbbr StateDesc CityName GeographicLevel ## 1: 2014 TX Texas Pharr City ## 2: 2014 TX Texas Brownsville City ## 3: 2014 TX Texas Laredo City ## 4: 2014 FL Florida Hialeah City ## 5: 2014 TX Texas Mission City ## 6: 2014 TX Texas Edinburg City
Count the number of cities in the top 100 by state using the aggregate function.
agg_100 <- aggregate(df_health_100$Year, by = list(df_health_100$StateDesc), FUN = length)
agg_100 <- agg_100[order(agg_100$x, decreasing = TRUE),]
head(agg_100)
## Group.1 x ## 21 Texas 29 ## 3 California 25 ## 6 Florida 10 ## 16 New Jersey 7 ## 7 Georgia 4 ## 12 Louisiana 3
Plot the results.
library(ggplot2)
ggplot(agg_100, aes(x = reorder(Group.1, x), y = x, fill = x)) +
geom_bar(stat = "identity",col = "black")+
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'none') +
ggtitle('Number of cities in top 100 for percent uninsured by state', subtitle = 'Texas and California lead the way by a wide margin') +
xlab('State') + ylab('Percent Uninsured')
We can see Texas and California lead the way with the most cities in the top 100. This makes sense because they are two of the largest states.
Count the number of cities in the dataset for each state.
df_health <- subset(df, Short_Question_Text == 'Health Insurance')
df_health <- subset(df_health, GeographicLevel == 'City' & DataValueTypeID == 'AgeAdjPrv')
agg <- aggregate(df_health$Year, by = list(df_health$StateDesc), FUN = length)
agg <- agg[order(agg$x, decreasing = TRUE),]
#Plot the top 20
ggplot(agg[1:20,], aes(x = reorder(Group.1, x), y = x, fill = x)) +
geom_bar(stat = "identity",col = "black") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'none') +
ggtitle('The number of cities from each state in the 500 Cities dataset (Top 20)', subtitle = 'California leads the way by a wide margin') +
xlab('State') + ylab('Number of Cities')
California and Texas also lead the way with the most cities in the top 100. Now, we will divide the number of cities in the top 100 by the total number of cities in the dataset to create a more accurate comparison of states.
agg_merged <- setNames(merge(agg_100, agg, by = 'Group.1'), c('city', 'health_cities', 'total_cities'))
agg_merged$Ratio <- agg_merged$health_cities / agg_merged$total_cities
head(agg_merged)
## city health_cities total_cities Ratio ## 1 Arizona 1 12 0.08333333 ## 2 Arkansas 1 5 0.20000000 ## 3 California 25 121 0.20661157 ## 4 Colorado 1 14 0.07142857 ## 5 Connecticut 1 8 0.12500000 ## 6 Florida 10 33 0.30303030
Plot the ratio to see how the states compare.
ggplot(agg_merged, aes(x = reorder(city, Ratio), y = Ratio, fill = Ratio)) +
geom_bar(stat = 'identity', color = 'black')+
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = 'none') +
ggtitle('Ratio of number of cities in the top 100 over number of cities in the dataset') +
xlab('State')
The plot now paints a different picture then the raw data. New Jersey leads the way with nearly 80% of their cities in the top 100.
Th purpose of this post was to demonstrate common exploratory data analysis techniques. The goal of exploratory data analysis is to provide an understanding of the dataset and generate questions to analyze further. From this analysis, I would be curious as to why New Jersey lacks health insurance. I would also be curious if the lack of health insurance actually causes poor health outcomes, or if there are other factors in play.
As you delve deeper into datasets, you will almost always generate questions that you didn’t think of prior to starting the analyis. This is the one of the key values of EDA.