Red Oak Strategic
  • Home
  • About Us
  • Services
  • Amazon Web Services
    • Database Engineering
    • Machine Learning and AI
  • Resources
    • Blog
    • Case Studies
Contact Us
  • Home
  • About Us
  • Services
  • Amazon Web Services
    • Database Engineering
    • Machine Learning and AI
  • Resources
    • Blog
    • Case Studies
Mark Stephenson
Friday, 5 January 2018 / Published in Data, R, Code, Analytics, Data Science, Data Visualization, R Shiny

Red Oak Strategic - Exploratory Data Analysis: CDC's 500 Cities Project

Exploratory data analysis (EDA) is generally the first step in any data science project with the goal being to summarize the main features of the dataset. It helps the analyst gain a better understanding of the available data and often can unearth powerful insights. Data visualization is the most common technique in EDA. During this post, I will demonstrate data visualization techniques for EDA using R Shiny and the JavaScript package Leaflet

The data was obtained from the CDC’s ‘500 Cities Project’. The CDC released data on 27 measures of chronic disease related to unhealthy behaviors (5), health outcomes (13), and use of preventive services (9) for 500 cities throughout the United States. This dataset offers a view into health measures by city across the country. During the first part of this tutorial, I will use this dataset to demonstrate how to build a simple R Shiny application to explore the dataset. In part 2, I will delve into other EDA techniques.

Shiny is a package for R which offers the functionality to easily build and deploy interactive web applications. It offers simple integration with many powerful software tools (Leaflet, Highcharts, Plotly, etc) as well as the power of statistical programming in R. Web applications can be built in hours rather than days with Shiny. Because of this, it is an excellent tool for exploratory data analysis (EDA). Users can build and share interactive data applications very quickly.

This app offers functionality to select a health measure and view how it differs by city throuout the country. It also offers a slider to remove cities above or below a chosen prevalence level. This allows users to more easily see where measures have high or low prevalence throughout the country.

The code below is commented to explain each step in the process to build this data application:

#Load packages
library(shiny)
library(leaflet)
library(RColorBrewer)
library(DT)
library(rgdal)
library(gpclib)
library(maptools)
library(R6)
library(raster)
library(broom)
library(scales)
library(reshape2)
library(tidyverse)
library(data.table)
library(highcharter)

#Read in dataset. You can read in the dataset directly from the CDC website. I choose not to in this case because it is large and takes a long time to download.

Data <- fread('500_Cities__Local_Data_for_Better_Health.csv')
#Data <- fread('https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD')
#Each Shiny application consists of ui and server elements. The syntax to launch an application is shinyApp(ui, server)
shinyApp(options = list(height = 800), #height of the application within the Rmarkdown document
  
  #Define the user interface element
  ui = fluidPage(
    fluidRow(
      column(5
             
             #Create element to allow user input. The values from this input are accessed in the server function via input$categoryId
             , selectInput('categoryId', 'Select Category'
                            , choices = unique(Data$CategoryID)
                            )
             #uiOutput allows you to render ui elements within the server function. This offers you more flexibility in defining the user-interface.
             , uiOutput('measures'))
      
      , column(3, uiOutput('slider')
               , selectInput('age', 'Type', choices = unique(Data$DataValueTypeID))
               )
  )
                 ,fluidRow(column(8, leafletOutput('mymap'))
                           , column(4, dataTableOutput('table'))
                 )
  )
  
  #Define functionality
  ,server = function(input, output, session){
    
    
    #Read the data into a reactive function. If data takes user input (reactive values), then it must be contained within a reactive function. I use this reactive function to filter the data set. A reactive function always returns the final line of code.
    df1 <- reactive ({
      df <- Data
      df <- subset(df, select = c('CityName','StateAbbr', 'GeoLocation', 'Year'
                                , 'Measure', 'Data_Value', 'PopulationCount', 'GeographicLevel'
                                , 'Short_Question_Text', 'CategoryID', 'DataValueTypeID'))
      
      #Removes NA values
      df <- df[!is.na(df$Data_Value),]
      df <- subset(df, DataValueTypeID == input$age)
      df <- subset(df, CategoryID == input$categoryId)
      #df <- subset(df, GeographicLevel == as.character(input$geoLevel))
    })
    
    #Here I user renderUI to dynamically generate ui elements. I use here because I only want to show the measures within the category the user selects. Because this takes user input, the ui must be generated within the server function.
    output$measures <- renderUI({
      selectInput('measures', 'Select Measure', choices = unique(df1()$Measure))
    })
    
    #Here I'm filtering the data again based on the measure chosen by the user. I couldn't do this in df1 because I first filtered the data based on category to reduce the options within the measures selectInput.
    df2 <- reactive ({
      x <- df1()
      x <- subset(x, Measure == as.character(input$measures))
    })
    
    #Create a slider to filter the map markers. 
    output$slider <- renderUI ({
      sliderInput('slider', 'Filter Map', min = min(df2()$Data_Value) 
                  , max = max(df2()$Data_Value)
                  , value = c(min(df2()$Data_Value), max(df2()$Data_Value)))
    })
    
    #Build the leaflet map
    output$mymap <- renderLeaflet({
      df <- df2()
      
      #Filter the data set based on values from the slider input
      df <- subset(df, Data_Value > input$slider[1] & Data_Value < input$slider[2])
      
      #Define color pallete
      Colors <- brewer.pal(11,"RdBu")
      
      #Apply pallete to values from data set
      binpal <- colorBin(Colors, df$Data_Value, 6, pretty = FALSE)
      
      #Separate GeoLocation column into latitude and longitude columns. Required for leaflet
      lat = vector()
      lng = vector()
      for (i in 1: nrow(df)){
        x<- unlist(strsplit(df$GeoLocation[i], ",")) #Splits the string at the comma
        lat[i] <- substr(x[1],2,8) #Selects characters 2 thru 8 of the string for latitude
        lng[i] <- substr(x[2],2,9) #Selects characters 2 thru 9 of the string for longitude
        
      }
      #Convert to numeric
      df$lat <- as.numeric(lat)
      df$lng <- as.numeric(lng)
      
     #Build leaflet map
      leaflet() %>%
        
        #Adds state borders to the map
        addTiles(
          urlTemplate = "//{s}.tiles.mapbox.com/v3/jcheng.map-5ebohr46/{z}/{x}/{y}.png",
          attribution = 'Maps by <a href="http://www.mapbox.com/">Mapbox</a>'
        ) %>%
        
        #Add the markers for each location
        addCircleMarkers(lat = df$lat
                         , lng = df$lng
                         , data = df
                         , label = paste(df$CityName, df$StateAbbr)
                         , color = ~binpal(Data_Value)
                         , radius = 10
                         , fillColor = ~binpal(Data_Value)
                         , fill = TRUE
                         , opacity = 1

        ) %>%
        addLegend(position = 'bottomleft', pal = binpal, values = df$Data_Value
        )
      
    })
    
    #Create data table to show values in tabular format
    output$table <- renderDataTable ({
       df <- subset(df2(), select = c(CityName, StateAbbr, Data_Value))
       df <- df[order(df$Data_Value, decreasing = TRUE),]
       df <- setNames(df, c('City', 'State', 'Value'))
       datatable(df, options = list(pageLength = 10))
    
     })
    
  }
)

A few things I’ve noticed from playing with the map and filters with the Type set to Age-Adjusted Prevalence.

  • Areas with high incidence of cancer seem to be clustered together.
  • Binge drinking is highest in the midwest.

I’m sure there are many other insights which can be gleaned from this interesting CDC dataset. I hope this example application has highlighted the benefits of using Shiny for EDA. In part 2 of this tutorial, I will discuss other common EDA techniques and show examples using this dataset.

Data Provided by Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health

Download the data here and see below for the app in action!           


  • Tweet
Tagged under: Data R Code Analytics Data Science Data Visualization R Shiny

What you can read next

6 Strategies for Migrating Applications to Cloud
6 Strategies for Migrating Applications to Cloud
Business Intelligence Across a Private Equity Portfolio
Data Visualization: Empowering Decision Makers

Leave a reply

    Recent Posts

    • 6 Strategies for Migrating Applications to Cloud

      6 Strategies for Migrating Applications to Cloud

      At Red Oak Strategic, we understand that...
    • Business Intelligence Across a Private Equity Portfolio

      Background Recognizing an opportunity to expand...
    • Data Visualization: Empowering Decision Makers

      Time and again, across Red Oak Strategic’s...
    • Tracking Coronavirus: Building Parameterized Reports to Analyze Changing Data Sources

      The pace of our modern world, and the impressive...
    • Draw Rotatable 3D Charts in R Shiny with Highcharts and JQuery

      While it might be tempting to liven up a report...

    Categories

    • 2016 Election (6)
    • Analytics (11)
    • Apache Spark (1)
    • Blockchain (1)
    • Business Intelligence (1)
    • Case Studies (3)
    • Cloud (1)
    • Code (12)
    • Data (15)
    • Data Processing (2)
    • Data Science (18)
    • Data Visualization (8)
    • Databases (2)
    • Donald Trump (1)
    • Excel (1)
    • Exploratory Data Science (1)
    • Financial Analytics (1)
    • Forecasting (1)
    • ggplot2 (1)
    • h2o (1)
    • Highcharts (1)
    • Hillary Clinton (1)
    • JavaScript (3)
    • JQuery (1)
    • Machine Learning (3)
    • Maps (1)
    • Political Analytics (3)
    • Politics (7)
    • Polling (3)
    • Predictive Analytics (3)
    • Private Equity (1)
    • Python (2)
    • Python 3 (1)
    • R (10)
    • R Shiny (4)
    • RegEx (1)
    • RShiny (2)
    • Sparkling Water (1)
    • Time Series (1)
    • Tutorial (1)
    • Tutorials (8)
    • Uber (1)
    see all topics

    © 2022 Red Oak Strategic

    KEEP UPDATED

    Receive our updates, best practices and latest news straight to your inbox