15-cs02-data

Professor Shannon Ellis

2023-11-21

CS02: Predicting Air Pollution (Data)

Agenda

  • Background
  • Question
  • Data Intro
  • Wrangle

Background

OpenCaseStudies

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

Air Pollutants

Some sources are natural while others are anthropogenic (human-derived):

[source]

Major types of air pollutants

  1. Gaseous - Carbon Monoxide (CO), Ozone (O3), Nitrogen Oxides(NO, NO2), Sulfur Dioxide (SO2)
  2. Particulate - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
  3. Dust - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
  4. Biological - pollen, bacteria, viruses, mold spores

See here for more detail on the types of pollutants in the air.

Particulate Pollution

Air pollution particulates are generally described by their size:

  1. Large Coarse Particulate Matter - has diameter of >10 micrometers (10 µm)

  2. Coarse Particulate Matter (called PM10-2.5) - has diameter of between 2.5 µm and 10 µm

  3. Fine Particulate Matter (called PM2.5) - has diameter of < 2.5 µm

PM10 includes any particulate matter <10 µm (both coarse and fine particulate matter)

In relation to a piece of human hair:

[source]

Common Pollutants and their size

[source]

Penetration into the human body

[source]

Negative Health Impacts

Exposure to air pollution is:

  • associated with higher rates of mortality in older adults
  • known to be a risk factor for many diseases and conditions including (but not limited to):
  1. Asthma - fine particle exposure (PM2.5) was found to be associated with higher rates of asthma in children
  2. Inflammation in type 1 diabetes - fine particle exposure (PM2.5) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
  3. Lung function and emphysema - higher concentrations of ozone (O3), nitrogen oxides (NOx), black carbon, and fine particle exposure PM2.5 , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
  4. Low birthweight - fine particle exposure(PM2.5) was associated with lower birth weight in full-term live births
  5. Viral Infection - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (PM2.5)

See this review article for more information about sources of air pollution and the influence of air pollution on health.

Sparse monitoring PH issue

  • Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country.
  • However, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources.
  • dramatic differences in pollution rates can be seen even within the same city. (In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.)

Lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.

Machine Learning offers a solution

An article published in the Environmental Health journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.

[source]

The authors of this article state that:

“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”

The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.

So…we’re going to do the same

Question

Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Data

The State of Global Air

The State of Global Air is a report released every year to communicate the impact of air pollution on public health.

The State of Global Air 2019 report (which uses data from 2017) stated that:

Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.

[source]

In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.

[source]

The State of Global Air 2018 report (using data from 2016) separated different types of air pollution & found that particulate pollution was particularly associated with mortality.

[source]

The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:

More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

[source]

Overall Improvement

Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 Our Nation’s Air report from the US Environmental Protection Agency (EPA):

[source]

An Issue Nonetheless

  • air pollution continues to contribute to health risk for Americans, in particular in regions with higher than national average rates of pollution that, at times, exceed the WHO’s recommended level.
  • important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

You can see that current air quality conditions at this website, and you will notice variation across different cities.

For example, here are the conditions in San Francisco yesterday:

[source]

  • reports particulate values using what is called the Air Quality Index (AQI).
  • This calculator indicates that 138 AQI is equivalent to 50.5 ug/m3 and is considered unhealthy for sensitive individuals.
  • Thus, some areas exceed the WHO annual exposure guideline (10 ug/m3), and this may adversely affect the health of people living in these locations.

Adverse health effects

  • Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines.
  • it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. (For example, see this article for more details.)

Monitor Data

  • Monitor data in this case study come from a system of monitors in which roughly 90% are located within cities.
  • There is an equity issue in terms of capturing the air pollution levels of more rural areas.
  • To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in areas with little to no monitoring.
  • Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

[source]

This is what we aim to achieve in this case study.

Limitations

  1. The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.
  2. Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.
  3. Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data; however, we are interested in long term exposures, as these appear to be the most influential for health outcomes.
  4. These data are US-focused.

Supervised ML

Here, we’ll need:

  1. A continuous outcome variable that we want to predict
  2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable

To build (or train) our model, we use both the outcome and features.

The goal is to identify informative features that can explain a large amount of variation in our outcome variable.

Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.

(More details here)

Outcome

The monitor data that we will be using comes from gravimetric monitors (see picture below) operated by the US Environmental Protection Agency (EPA).

[image courtesy of Kirsten Koehler]

These monitors use a filtration system to specifically capture fine particulate matter.

[source]

The weight of this particulate matter is manually measured daily or weekly.

For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to here.

In our data set, the value column indicates the PM2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors.

The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m3).

Recall the WHO exposure guideline is < 10 ug/m3 on average annually for PM2.5.

Data Import

All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change. (Roger now works at UT Austin)

We have one CSV file that contains both our single outcome variable and all of our features (or predictor variables). You can download this file using the OCSdata package:

# install.packages("OCSdata")
OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())

here::here() helps manage file paths; will always locate files relative to your project root

# install.packages("here")
pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))

PM 2.5 Data

  • 876 monitors
  • 40 columns
    • value | outcome variable
pm |>
  glimpse()
Rows: 876
Columns: 50
$ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

There are 48 features with values for each of the 876 monitors (observations).

The data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

Features

Variable Details
id Monitor number
– the county number is indicated before the decimal
– the monitor number is indicated after the decimal
Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
fips Federal information processing standard number for the county where the monitor is located
– 5 digit id code for counties (zero is often the first value and sometimes is not shown)
– the first 2 numbers indicate the state
– the last three numbers indicate the county
Example: Alabama’s state code is 01 because it is first alphabetically
(note: Alaska and Hawaii are not included because they are not part of the contiguous US)
Lat Latitude of the monitor in degrees
Lon Longitude of the monitor in degrees
state State where the monitor is located
county County where the monitor is located
city City where the monitor is located
CMAQ Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
– A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
Does not use any of the PM2.5 gravimetric monitoring data. (There is a version that does use the gravimetric monitoring data, but not this one!)
– Data from the EPA
zcta Zip Code Tabulation Area where the monitor is located
– Postal Zip codes are converted into “generalized areal representations” that are non-overlapping
– Data from the 2010 Census
zcta_area Land area of the zip code area in meters squared
– Data from the 2010 Census
zcta_pop Population in the zip code area
– Data from the 2010 Census
imp_a500 Impervious surface measure
– Within a circle with a radius of 500 meters around the monitor
– Impervious surface are roads, concrete, parking lots, buildings
– This is a measure of development
imp_a1000 Impervious surface measure
– Within a circle with a radius of 1000 meters around the monitor
imp_a5000 Impervious surface measure
– Within a circle with a radius of 5000 meters around the monitor
imp_a10000 Impervious surface measure
– Within a circle with a radius of 10000 meters around the monitor
imp_a15000 Impervious surface measure
– Within a circle with a radius of 15000 meters around the monitor
county_area Land area of the county of the monitor in meters squared
county_pop Population of the county of the monitor
Log_dist_to_prisec Log (Natural log) distance to a primary or secondary road from the monitor
– Highway or major road
log_pri_length_5000 Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highways only
log_pri_length_10000 Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highways only
log_pri_length_15000 Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highways only
log_pri_length_25000 Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highways only
log_prisec_length_500 Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_1000 Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_5000 Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_10000 Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_15000 Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_25000 Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highway and secondary roads
log_nei_2008_pm25_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
popdens_county Population density (number of people per kilometer squared area of the county)
popdens_zcta Population density (number of people per kilometer squared area of zcta)
nohs Percentage of people in zcta area where the monitor is that do not have a high school degree
– Data from the Census
somehs Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
– Data from the Census
hs Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
– Data from the Census
somecollege Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
– Data from the Census
associate Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
– Data from the Census
bachelor Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
– Data from the Census
grad Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
– Data from the Census
pov Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines
– Data from the Census
hs_orless Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
urc2013 2013 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics
urc2006 2006 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics
aod Aerosol Optical Depth measurement from a NASA satellite
– based on the diffraction of a laser
– used as a proxy of particulate pollution
– unit-less - higher value indicates more pollution
– Data from NASA

Many of these features have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:

[source]

Wrangling

skimr

skimr | A helpful way to get an overall sense of a dataset

# install.packages("skimr")
skimr::skim(pm)
Data summary
Name pm
Number of rows 876
Number of columns 50
_______________________
Column type frequency:
character 3
numeric 47
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 49 0
county 0 1 3 20 0 471 0
city 0 1 4 48 0 607 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 26987.96 1.578761e+04 1003.00 13089.15 26132.00 39118.00 5.603910e+04 ▇▇▆▇▆
value 0 1 10.81 2.580000e+00 3.02 9.27 11.15 12.37 2.316000e+01 ▂▆▇▁▁
fips 0 1 26987.89 1.578763e+04 1003.00 13089.00 26132.00 39118.00 5.603900e+04 ▇▇▆▇▆
lat 0 1 38.48 4.620000e+00 25.47 35.03 39.30 41.66 4.840000e+01 ▁▃▅▇▂
lon 0 1 -91.74 1.496000e+01 -124.18 -99.16 -87.47 -80.69 -6.804000e+01 ▃▂▃▇▃
CMAQ 0 1 8.41 2.970000e+00 1.63 6.53 8.62 10.24 2.313000e+01 ▃▇▃▁▁
zcta 0 1 50890.29 2.778447e+04 1022.00 28788.25 48172.00 74371.00 9.920200e+04 ▅▇▇▅▇
zcta_area 0 1 183173481.91 5.425989e+08 15459.00 14204601.75 37653560.50 160041508.25 8.164821e+09 ▇▁▁▁▁
zcta_pop 0 1 24227.58 1.777216e+04 0.00 9797.00 22014.00 35004.75 9.539700e+04 ▇▇▃▁▁
imp_a500 0 1 24.72 1.934000e+01 0.00 3.70 25.12 40.22 6.961000e+01 ▇▅▆▃▂
imp_a1000 0 1 24.26 1.802000e+01 0.00 5.32 24.53 38.59 6.750000e+01 ▇▅▆▃▁
imp_a5000 0 1 19.93 1.472000e+01 0.05 6.79 19.07 30.11 7.460000e+01 ▇▆▃▁▁
imp_a10000 0 1 15.82 1.381000e+01 0.09 4.54 12.36 24.17 7.209000e+01 ▇▃▂▁▁
imp_a15000 0 1 13.43 1.312000e+01 0.11 3.24 9.67 20.55 7.110000e+01 ▇▃▁▁▁
county_area 0 1 3768701992.12 6.212830e+09 33703512.00 1116536297.50 1690826566.50 2878192209.00 5.194723e+10 ▇▁▁▁▁
county_pop 0 1 687298.44 1.293489e+06 783.00 100948.00 280730.50 743159.00 9.818605e+06 ▇▁▁▁▁
log_dist_to_prisec 0 1 6.19 1.410000e+00 -1.46 5.43 6.36 7.15 1.045000e+01 ▁▁▃▇▁
log_pri_length_5000 0 1 9.82 1.080000e+00 8.52 8.52 10.05 10.73 1.205000e+01 ▇▂▆▅▂
log_pri_length_10000 0 1 10.92 1.130000e+00 9.21 9.80 11.17 11.83 1.302000e+01 ▇▂▇▇▃
log_pri_length_15000 0 1 11.50 1.150000e+00 9.62 10.87 11.72 12.40 1.359000e+01 ▆▂▇▇▃
log_pri_length_25000 0 1 12.24 1.100000e+00 10.13 11.69 12.46 13.05 1.436000e+01 ▅▃▇▇▃
log_prisec_length_500 0 1 6.99 9.500000e-01 6.21 6.21 6.21 7.82 9.400000e+00 ▇▁▂▂▁
log_prisec_length_1000 0 1 8.56 7.900000e-01 7.60 7.60 8.66 9.20 1.047000e+01 ▇▅▆▃▁
log_prisec_length_5000 0 1 11.28 7.800000e-01 8.52 10.91 11.42 11.83 1.278000e+01 ▁▁▃▇▃
log_prisec_length_10000 0 1 12.41 7.300000e-01 9.21 11.99 12.53 12.94 1.385000e+01 ▁▁▃▇▅
log_prisec_length_15000 0 1 13.03 7.200000e-01 9.62 12.59 13.13 13.57 1.441000e+01 ▁▁▃▇▅
log_prisec_length_25000 0 1 13.82 7.000000e-01 10.13 13.38 13.92 14.35 1.523000e+01 ▁▁▃▇▆
log_nei_2008_pm25_sum_10000 0 1 3.97 2.350000e+00 0.00 2.15 4.29 5.69 9.120000e+00 ▆▅▇▆▂
log_nei_2008_pm25_sum_15000 0 1 4.72 2.250000e+00 0.00 3.47 5.00 6.35 9.420000e+00 ▃▃▇▇▂
log_nei_2008_pm25_sum_25000 0 1 5.67 2.110000e+00 0.00 4.66 5.91 7.28 9.650000e+00 ▂▂▇▇▃
log_nei_2008_pm10_sum_10000 0 1 4.35 2.320000e+00 0.00 2.69 4.62 6.07 9.340000e+00 ▅▅▇▇▂
log_nei_2008_pm10_sum_15000 0 1 5.10 2.180000e+00 0.00 3.87 5.39 6.72 9.710000e+00 ▂▃▇▇▂
log_nei_2008_pm10_sum_25000 0 1 6.07 2.010000e+00 0.00 5.10 6.37 7.52 9.880000e+00 ▁▂▆▇▃
popdens_county 0 1 551.76 1.711510e+03 0.26 40.77 156.67 510.81 2.682191e+04 ▇▁▁▁▁
popdens_zcta 0 1 1279.66 2.757490e+03 0.00 101.15 610.35 1382.52 3.041884e+04 ▇▁▁▁▁
nohs 0 1 6.99 7.210000e+00 0.00 2.70 5.10 8.80 1.000000e+02 ▇▁▁▁▁
somehs 0 1 10.17 6.200000e+00 0.00 5.90 9.40 13.90 7.220000e+01 ▇▂▁▁▁
hs 0 1 30.32 1.140000e+01 0.00 23.80 30.75 36.10 1.000000e+02 ▂▇▂▁▁
somecollege 0 1 21.58 8.600000e+00 0.00 17.50 21.30 24.70 1.000000e+02 ▆▇▁▁▁
associate 0 1 7.13 4.010000e+00 0.00 4.90 7.10 8.80 7.140000e+01 ▇▁▁▁▁
bachelor 0 1 14.90 9.710000e+00 0.00 8.80 12.95 19.22 1.000000e+02 ▇▂▁▁▁
grad 0 1 8.91 8.650000e+00 0.00 3.90 6.70 11.00 1.000000e+02 ▇▁▁▁▁
pov 0 1 14.95 1.133000e+01 0.00 6.50 12.10 21.22 6.590000e+01 ▇▅▂▁▁
hs_orless 0 1 47.48 1.675000e+01 0.00 37.92 48.65 59.10 1.000000e+02 ▁▃▇▃▁
urc2013 0 1 2.92 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
urc2006 0 1 2.97 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
aod 0 1 43.70 1.956000e+01 5.00 31.66 40.17 49.67 1.430000e+02 ▃▇▁▁▁

❓ Given the dataset we’re working with, what wrangling should we consider doing here?”

❓ What’s something you’ve learned about the data from the skimr output?

  • Consider variable type - need more factors?
  • Understand why ID is not uniformally distributed; figure out which are overrepresented; decide what to do
  • log or other transformations necessary? decide during EDA

Reminder: to read the data in and run skimr if you haven’t already:

# install.packages("OCSdata")
# install.packages("here")
# install.packages("skimr")

OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())
pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))
skimr::skim(pm)

Things to note:

  • data are summarized by variable type
  • empty/n_missing gives you a sense of how much data are missing for each variable
  • n_unique for state indicates that we have data for 49 states
  • many different distributions for continuous data, but many show bimodal distribution
  • large range of possible values for many variables (i.e. population)