15-cs02-data

Author

Professor Shannon Ellis

Published

November 21, 2023

CS02: Predicting Air Pollution (Data)

Agenda

Background
Question
Data Intro
Wrangle

Background

OpenCaseStudies

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

Air Pollutants

Some sources are natural while others are anthropogenic (human-derived):

[source]

. . .

Major types of air pollutants

Gaseous - Carbon Monoxide (CO), Ozone (O₃), Nitrogen Oxides(NO, NO₂), Sulfur Dioxide (SO₂)
Particulate - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
Dust - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
Biological - pollen, bacteria, viruses, mold spores

See here for more detail on the types of pollutants in the air.

Particulate Pollution

Air pollution particulates are generally described by their size:

Large Coarse Particulate Matter - has diameter of >10 micrometers (10 µm)
Coarse Particulate Matter (called PM_10-2.5) - has diameter of between 2.5 µm and 10 µm
Fine Particulate Matter (called PM_2.5) - has diameter of < 2.5 µm

PM₁₀ includes any particulate matter <10 µm (both coarse and fine particulate matter)

. . .

In relation to a piece of human hair:

[source]

Common Pollutants and their size

[source]

Penetration into the human body

[source]

Negative Health Impacts

Exposure to air pollution is:

associated with higher rates of mortality in older adults
known to be a risk factor for many diseases and conditions including (but not limited to):

Asthma - fine particle exposure (PM_2.5) was found to be associated with higher rates of asthma in children
Inflammation in type 1 diabetes - fine particle exposure (PM_2.5) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
Lung function and emphysema - higher concentrations of ozone (O₃), nitrogen oxides (NO_x), black carbon, and fine particle exposure PM_2.5 , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
Low birthweight - fine particle exposure(PM_2.5) was associated with lower birth weight in full-term live births
Viral Infection - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (PM_2.5)

See this review article for more information about sources of air pollution and the influence of air pollution on health.

Sparse monitoring PH issue

Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country.
However, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources.
dramatic differences in pollution rates can be seen even within the same city. (In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.)

. . .

[source]

. . .

Lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.

Machine Learning offers a solution

An article published in the Environmental Health journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.

[source]

. . .

The authors of this article state that:

“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”

. . .

The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.

. . .

So…we’re going to do the same

Question

Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Data

The State of Global Air

The State of Global Air is a report released every year to communicate the impact of air pollution on public health.

. . .

The State of Global Air 2019 report (which uses data from 2017) stated that:

Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.

[source]

. . .

In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.

[source]

. . .

The State of Global Air 2018 report (using data from 2016) separated different types of air pollution & found that particulate pollution was particularly associated with mortality.

[source]

. . .

The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:

More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

[source]

Overall Improvement

Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 Our Nation’s Air report from the US Environmental Protection Agency (EPA):

[source]

An Issue Nonetheless

air pollution continues to contribute to health risk for Americans, in particular in regions with higher than national average rates of pollution that, at times, exceed the WHO’s recommended level.
important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

. . .

You can see that current air quality conditions at this website, and you will notice variation across different cities.

For example, here are the conditions in San Francisco yesterday:

[source]

. . .

reports particulate values using what is called the Air Quality Index (AQI).
This calculator indicates that 138 AQI is equivalent to 50.5 ug/m³ and is considered unhealthy for sensitive individuals.
Thus, some areas exceed the WHO annual exposure guideline (10 ug/m³), and this may adversely affect the health of people living in these locations.

Adverse health effects

Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines.
it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. (For example, see this article for more details.)

Monitor Data

Monitor data in this case study come from a system of monitors in which roughly 90% are located within cities.
There is an equity issue in terms of capturing the air pollution levels of more rural areas.
To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in areas with little to no monitoring.
Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

. . .

[source]

This is what we aim to achieve in this case study.

Limitations

The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.
Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.
Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data; however, we are interested in long term exposures, as these appear to be the most influential for health outcomes.
These data are US-focused.

Supervised ML

Here, we’ll need:

A continuous outcome variable that we want to predict
A set of feature(s) (or predictor variables) that we use to predict the outcome variable

. . .

To build (or train) our model, we use both the outcome and features.

. . .

The goal is to identify informative features that can explain a large amount of variation in our outcome variable.

. . .

Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.

(More details here)

Outcome

The monitor data that we will be using comes from gravimetric monitors (see picture below) operated by the US Environmental Protection Agency (EPA).

[image courtesy of Kirsten Koehler]

. . .

These monitors use a filtration system to specifically capture fine particulate matter.

[source]

. . .

The weight of this particulate matter is manually measured daily or weekly.

For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to here.

. . .

In our data set, the value column indicates the PM_2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors.

. . .

The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m³).

. . .

Recall the WHO exposure guideline is < 10 ug/m³ on average annually for PM_2.5.

Data Import

All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change. (Roger now works at UT Austin)

. . .

We have one CSV file that contains both our single outcome variable and all of our features (or predictor variables). You can download this file using the OCSdata package:

# install.packages("OCSdata")
OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())

. . .

here::here() helps manage file paths; will always locate files relative to your project root

# install.packages("here")
pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))

PM 2.5 Data

876 monitors
40 columns
- value | outcome variable

pm |>
  glimpse()

Rows: 876
Columns: 50
$ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

. . .

There are 48 features with values for each of the 876 monitors (observations).

The data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

Features

Variable	Details
id	Monitor number – the county number is indicated before the decimal – the monitor number is indicated after the decimal Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
fips	Federal information processing standard number for the county where the monitor is located – 5 digit id code for counties (zero is often the first value and sometimes is not shown) – the first 2 numbers indicate the state – the last three numbers indicate the county Example: Alabama’s state code is 01 because it is first alphabetically (note: Alaska and Hawaii are not included because they are not part of the contiguous US)
Lat	Latitude of the monitor in degrees
Lon	Longitude of the monitor in degrees
state	State where the monitor is located
county	County where the monitor is located
city	City where the monitor is located
CMAQ	Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ) – A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution – *Does not use any of the PM_2.5 gravimetric monitoring data.* (There is a version that does use the gravimetric monitoring data, but not this one!) – Data from the EPA
zcta	Zip Code Tabulation Area where the monitor is located – Postal Zip codes are converted into “generalized areal representations” that are non-overlapping – Data from the 2010 Census
zcta_area	Land area of the zip code area in meters squared – Data from the 2010 Census
zcta_pop	Population in the zip code area – Data from the 2010 Census
imp_a500	Impervious surface measure – Within a circle with a radius of 500 meters around the monitor – Impervious surface are roads, concrete, parking lots, buildings – This is a measure of development
imp_a1000	Impervious surface measure – Within a circle with a radius of 1000 meters around the monitor
imp_a5000	Impervious surface measure – Within a circle with a radius of 5000 meters around the monitor
imp_a10000	Impervious surface measure – Within a circle with a radius of 10000 meters around the monitor
imp_a15000	Impervious surface measure – Within a circle with a radius of 15000 meters around the monitor
county_area	Land area of the county of the monitor in meters squared
county_pop	Population of the county of the monitor
Log_dist_to_prisec	Log (Natural log) distance to a primary or secondary road from the monitor – Highway or major road
log_pri_length_5000	Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highways only
log_pri_length_10000	Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highways only
log_pri_length_15000	Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highways only
log_pri_length_25000	Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highways only
log_prisec_length_500	Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_1000	Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_5000	Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_10000	Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_15000	Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_25000	Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highway and secondary roads
log_nei_2008_pm25_sum_10000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_15000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_25000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_10000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_15000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_25000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
popdens_county	Population density (number of people per kilometer squared area of the county)
popdens_zcta	Population density (number of people per kilometer squared area of zcta)
nohs	Percentage of people in zcta area where the monitor is that do not have a high school degree – Data from the Census
somehs	Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education – Data from the Census
hs	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree – Data from the Census
somecollege	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education – Data from the Census
associate	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree – Data from the Census
bachelor	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree – Data from the Census
grad	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree – Data from the Census
pov	Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines – Data from the Census
hs_orless	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
urc2013	2013 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics
urc2006	2006 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics
aod	Aerosol Optical Depth measurement from a NASA satellite – based on the diffraction of a laser – used as a proxy of particulate pollution – unit-less - higher value indicates more pollution – Data from NASA

. . .

Many of these features have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:

[source]

Wrangling

`skimr`

skimr | A helpful way to get an overall sense of a dataset

# install.packages("skimr")
skimr::skim(pm)

Data summary
Name	pm
Number of rows	876
Number of columns	50
_______________________
Column type frequency:
character	3
numeric	47
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
state	1	4	20	49
county	1	3	20	471
city	1	4	48	607

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	26987.96	1.578761e+04	1003.00	13089.15	26132.00	39118.00	5.603910e+04	▇▇▆▇▆
value	1	10.81	2.580000e+00	3.02	9.27	11.15	12.37	2.316000e+01	▂▆▇▁▁
fips	1	26987.89	1.578763e+04	1003.00	13089.00	26132.00	39118.00	5.603900e+04	▇▇▆▇▆
lat	1	38.48	4.620000e+00	25.47	35.03	39.30	41.66	4.840000e+01	▁▃▅▇▂
lon	1	-91.74	1.496000e+01	-124.18	-99.16	-87.47	-80.69	-6.804000e+01	▃▂▃▇▃
CMAQ	1	8.41	2.970000e+00	1.63	6.53	8.62	10.24	2.313000e+01	▃▇▃▁▁
zcta	1	50890.29	2.778447e+04	1022.00	28788.25	48172.00	74371.00	9.920200e+04	▅▇▇▅▇
zcta_area	1	183173481.91	5.425989e+08	15459.00	14204601.75	37653560.50	160041508.25	8.164821e+09	▇▁▁▁▁
zcta_pop	1	24227.58	1.777216e+04	0.00	9797.00	22014.00	35004.75	9.539700e+04	▇▇▃▁▁
imp_a500	1	24.72	1.934000e+01	0.00	3.70	25.12	40.22	6.961000e+01	▇▅▆▃▂
imp_a1000	1	24.26	1.802000e+01	0.00	5.32	24.53	38.59	6.750000e+01	▇▅▆▃▁
imp_a5000	1	19.93	1.472000e+01	0.05	6.79	19.07	30.11	7.460000e+01	▇▆▃▁▁
imp_a10000	1	15.82	1.381000e+01	0.09	4.54	12.36	24.17	7.209000e+01	▇▃▂▁▁
imp_a15000	1	13.43	1.312000e+01	0.11	3.24	9.67	20.55	7.110000e+01	▇▃▁▁▁
county_area	1	3768701992.12	6.212830e+09	33703512.00	1116536297.50	1690826566.50	2878192209.00	5.194723e+10	▇▁▁▁▁
county_pop	1	687298.44	1.293489e+06	783.00	100948.00	280730.50	743159.00	9.818605e+06	▇▁▁▁▁
log_dist_to_prisec	1	6.19	1.410000e+00	-1.46	5.43	6.36	7.15	1.045000e+01	▁▁▃▇▁
log_pri_length_5000	1	9.82	1.080000e+00	8.52	8.52	10.05	10.73	1.205000e+01	▇▂▆▅▂
log_pri_length_10000	1	10.92	1.130000e+00	9.21	9.80	11.17	11.83	1.302000e+01	▇▂▇▇▃
log_pri_length_15000	1	11.50	1.150000e+00	9.62	10.87	11.72	12.40	1.359000e+01	▆▂▇▇▃
log_pri_length_25000	1	12.24	1.100000e+00	10.13	11.69	12.46	13.05	1.436000e+01	▅▃▇▇▃
log_prisec_length_500	1	6.99	9.500000e-01	6.21	6.21	6.21	7.82	9.400000e+00	▇▁▂▂▁
log_prisec_length_1000	1	8.56	7.900000e-01	7.60	7.60	8.66	9.20	1.047000e+01	▇▅▆▃▁
log_prisec_length_5000	1	11.28	7.800000e-01	8.52	10.91	11.42	11.83	1.278000e+01	▁▁▃▇▃
log_prisec_length_10000	1	12.41	7.300000e-01	9.21	11.99	12.53	12.94	1.385000e+01	▁▁▃▇▅
log_prisec_length_15000	1	13.03	7.200000e-01	9.62	12.59	13.13	13.57	1.441000e+01	▁▁▃▇▅
log_prisec_length_25000	1	13.82	7.000000e-01	10.13	13.38	13.92	14.35	1.523000e+01	▁▁▃▇▆
log_nei_2008_pm25_sum_10000	1	3.97	2.350000e+00	0.00	2.15	4.29	5.69	9.120000e+00	▆▅▇▆▂
log_nei_2008_pm25_sum_15000	1	4.72	2.250000e+00	0.00	3.47	5.00	6.35	9.420000e+00	▃▃▇▇▂
log_nei_2008_pm25_sum_25000	1	5.67	2.110000e+00	0.00	4.66	5.91	7.28	9.650000e+00	▂▂▇▇▃
log_nei_2008_pm10_sum_10000	1	4.35	2.320000e+00	0.00	2.69	4.62	6.07	9.340000e+00	▅▅▇▇▂
log_nei_2008_pm10_sum_15000	1	5.10	2.180000e+00	0.00	3.87	5.39	6.72	9.710000e+00	▂▃▇▇▂
log_nei_2008_pm10_sum_25000	1	6.07	2.010000e+00	0.00	5.10	6.37	7.52	9.880000e+00	▁▂▆▇▃
popdens_county	1	551.76	1.711510e+03	0.26	40.77	156.67	510.81	2.682191e+04	▇▁▁▁▁
popdens_zcta	1	1279.66	2.757490e+03	0.00	101.15	610.35	1382.52	3.041884e+04	▇▁▁▁▁
nohs	1	6.99	7.210000e+00	0.00	2.70	5.10	8.80	1.000000e+02	▇▁▁▁▁
somehs	1	10.17	6.200000e+00	0.00	5.90	9.40	13.90	7.220000e+01	▇▂▁▁▁
hs	1	30.32	1.140000e+01	0.00	23.80	30.75	36.10	1.000000e+02	▂▇▂▁▁
somecollege	1	21.58	8.600000e+00	0.00	17.50	21.30	24.70	1.000000e+02	▆▇▁▁▁
associate	1	7.13	4.010000e+00	0.00	4.90	7.10	8.80	7.140000e+01	▇▁▁▁▁
bachelor	1	14.90	9.710000e+00	0.00	8.80	12.95	19.22	1.000000e+02	▇▂▁▁▁
grad	1	8.91	8.650000e+00	0.00	3.90	6.70	11.00	1.000000e+02	▇▁▁▁▁
pov	1	14.95	1.133000e+01	0.00	6.50	12.10	21.22	6.590000e+01	▇▅▂▁▁
hs_orless	1	47.48	1.675000e+01	0.00	37.92	48.65	59.10	1.000000e+02	▁▃▇▃▁
urc2013	1	2.92	1.520000e+00	1.00	2.00	3.00	4.00	6.000000e+00	▇▅▃▂▁
urc2006	1	2.97	1.520000e+00	1.00	2.00	3.00	4.00	6.000000e+00	▇▅▃▂▁
aod	1	43.70	1.956000e+01	5.00	31.66	40.17	49.67	1.430000e+02	▃▇▁▁▁

. . .

❓ Given the dataset we’re working with, what wrangling should we consider doing here?”

❓ What’s something you’ve learned about the data from the skimr output?

Consider variable type - need more factors?
Understand why ID is not uniformally distributed; figure out which are overrepresented; decide what to do
log or other transformations necessary? decide during EDA

Reminder: to read the data in and run skimr if you haven’t already:

# install.packages("OCSdata")
# install.packages("here")
# install.packages("skimr")

OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())
pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))
skimr::skim(pm)

. . .

Things to note:

data are summarized by variable type
empty/n_missing gives you a sense of how much data are missing for each variable
n_unique for state indicates that we have data for 49 states
many different distributions for continuous data, but many show bimodal distribution
large range of possible values for many variables (i.e. population)