Table of Contents
Need Help? Get in Touch!
Exploratory data analysis (EDA) is generally the first step in any data science project with the goal being to summarize the main features of the dataset. It helps the analyst gain a better understanding of the available data and often can unearth powerful insights. Data visualization is the most common technique in EDA. During this post, I will demonstrate data visualization techniques for EDA using R Shiny and the JavaScript package Leaflet
The data was obtained from the CDC’s ‘500 Cities Project’. The CDC released data on 27 measures of chronic disease related to unhealthy behaviors (5), health outcomes (13), and use of preventive services (9) for 500 cities throughout the United States. This dataset offers a view into health measures by city across the country. During the first part of this tutorial, I will use this dataset to demonstrate how to build a simple R Shiny application to explore the dataset. In part 2, I will delve into other EDA techniques.
Shiny is a package for R which offers the functionality to easily build and deploy interactive web applications. It offers simple integration with many powerful software tools (Leaflet, Highcharts, Plotly, etc) as well as the power of statistical programming in R. Web applications can be built in hours rather than days with Shiny. Because of this, it is an excellent tool for exploratory data analysis (EDA). Users can build and share interactive data applications very quickly.
This app offers functionality to select a health measure and view how it differs by city throuout the country. It also offers a slider to remove cities above or below a chosen prevalence level. This allows users to more easily see where measures have high or low prevalence throughout the country.
The code below is commented to explain each step in the process to build this data application:
#Load packages library(shiny) library(leaflet) library(RColorBrewer) library(DT) library(rgdal) library(gpclib) library(maptools) library(R6) library(raster) library(broom) library(scales) library(reshape2) library(tidyverse) library(data.table) library(highcharter) #Read in dataset. You can read in the dataset directly from the CDC website. I choose not to in this case because it is large and takes a long time to download. Data <- fread('500_Cities__Local_Data_for_Better_Health.csv') #Data <- fread('https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD')
#Each Shiny application consists of ui and server elements. The syntax to launch an application is shinyApp(ui, server) shinyApp(options = list(height = 800), #height of the application within the Rmarkdown document #Define the user interface element ui = fluidPage( fluidRow( column(5 #Create element to allow user input. The values from this input are accessed in the server function via input$categoryId , selectInput('categoryId', 'Select Category' , choices = unique(Data$CategoryID) ) #uiOutput allows you to render ui elements within the server function. This offers you more flexibility in defining the user-interface. , uiOutput('measures')) , column(3, uiOutput('slider') , selectInput('age', 'Type', choices = unique(Data$DataValueTypeID)) ) ) ,fluidRow(column(8, leafletOutput('mymap')) , column(4, dataTableOutput('table')) ) ) #Define functionality ,server = function(input, output, session){ #Read the data into a reactive function. If data takes user input (reactive values), then it must be contained within a reactive function. I use this reactive function to filter the data set. A reactive function always returns the final line of code. df1 <- reactive ({ df <- Data df <- subset(df, select = c('CityName','StateAbbr', 'GeoLocation', 'Year' , 'Measure', 'Data_Value', 'PopulationCount', 'GeographicLevel' , 'Short_Question_Text', 'CategoryID', 'DataValueTypeID')) #Removes NA values df <- df[!is.na(df$Data_Value),] df <- subset(df, DataValueTypeID == input$age) df <- subset(df, CategoryID == input$categoryId) #df <- subset(df, GeographicLevel == as.character(input$geoLevel)) }) #Here I user renderUI to dynamically generate ui elements. I use here because I only want to show the measures within the category the user selects. Because this takes user input, the ui must be generated within the server function. output$measures <- renderUI({ selectInput('measures', 'Select Measure', choices = unique(df1()$Measure)) }) #Here I'm filtering the data again based on the measure chosen by the user. I couldn't do this in df1 because I first filtered the data based on category to reduce the options within the measures selectInput. df2 <- reactive ({ x <- df1() x <- subset(x, Measure == as.character(input$measures)) }) #Create a slider to filter the map markers. output$slider <- renderUI ({ sliderInput('slider', 'Filter Map', min = min(df2()$Data_Value) , max = max(df2()$Data_Value) , value = c(min(df2()$Data_Value), max(df2()$Data_Value))) }) #Build the leaflet map output$mymap <- renderLeaflet({ df <- df2() #Filter the data set based on values from the slider input df <- subset(df, Data_Value > input$slider[1] & Data_Value < input$slider[2]) #Define color pallete Colors <- brewer.pal(11,"RdBu") #Apply pallete to values from data set binpal <- colorBin(Colors, df$Data_Value, 6, pretty = FALSE) #Separate GeoLocation column into latitude and longitude columns. Required for leaflet lat = vector() lng = vector() for (i in 1: nrow(df)){ x<- unlist(strsplit(df$GeoLocation[i], ",")) #Splits the string at the comma lat[i] <- substr(x[1],2,8) #Selects characters 2 thru 8 of the string for latitude lng[i] <- substr(x[2],2,9) #Selects characters 2 thru 9 of the string for longitude } #Convert to numeric df$lat <- as.numeric(lat) df$lng <- as.numeric(lng) #Build leaflet map leaflet() %>% #Adds state borders to the map addTiles( urlTemplate = "//{s}.tiles.mapbox.com/v3/jcheng.map-5ebohr46/{z}/{x}/{y}.png", attribution = 'Maps by <a href="http://www.mapbox.com/">Mapbox</a>' ) %>% #Add the markers for each location addCircleMarkers(lat = df$lat , lng = df$lng , data = df , label = paste(df$CityName, df$StateAbbr) , color = ~binpal(Data_Value) , radius = 10 , fillColor = ~binpal(Data_Value) , fill = TRUE , opacity = 1 ) %>% addLegend(position = 'bottomleft', pal = binpal, values = df$Data_Value ) }) #Create data table to show values in tabular format output$table <- renderDataTable ({ df <- subset(df2(), select = c(CityName, StateAbbr, Data_Value)) df <- df[order(df$Data_Value, decreasing = TRUE),] df <- setNames(df, c('City', 'State', 'Value')) datatable(df, options = list(pageLength = 10)) }) } )
A few things I’ve noticed from playing with the map and filters with the Type set to Age-Adjusted Prevalence.
- Areas with high incidence of cancer seem to be clustered together.
- Binge drinking is highest in the midwest.
I’m sure there are many other insights which can be gleaned from this interesting CDC dataset. I hope this example application has highlighted the benefits of using Shiny for EDA. In part 2 of this tutorial, I will discuss other common EDA techniques and show examples using this dataset.
Data Provided by Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health
Download the data here and see below for the app in action!
Contact Red Oak Strategic
From cloud migrations to machine learning & AI - maximize your data and analytics capabilities with support from an AWS Advanced Tier consulting partner.
Related Posts
Data
Introduction The age of Big Data is upon us. From smartphones to sensors, the volume of data generated daily is staggering and presents chal...
Patrick Stewart
Data Processing
Introduction The advent of artificial intelligence (AI) in the business world is not just a fleeting trend; it's a transformative force resh...
Tyler Sanders
Data
Amazon Web Services (AWS) is, in my opinion, the single most powerful business-building tool in existence today. Bold claim, but with more t...
Tyler Sanders
Ready to get started?
Kickstart your cloud and data transformation journey with a complimentary conversation with the Red Oak team.