16-cs02-eda

Author

Professor Shannon Ellis

Published

November 28, 2023

CS02: Predicting Air Pollution (EDA)

Q&A

Q: I’m a bit unclear about the direction of case studies. Do just try to visualize data differently to find out the best biomarkers or do we use other methods like modeling to do so?
A: A little bit of both. If you can rule out a matrix/compound during EDA by looking at visualizations, do that. But, once you have a set of compounds that you want to include for analysis, then you would turn to sensitivity/specificity/ROC curves to help you decide. Modelling is also an option.

Q: I saw in the dataset, in some variables I think it’s the log values ( so I assumed it’s already normalized?). Why is it better to use the log data and not the raw values?
A: Great question! We’ll get to this a bit today!

Q: If the monitors are not localized at the granularity of zip code level, how do we assess the accuracy of our predictions? Seems like we can’t know the correct result.
A: We’ll get to this in the next set of notes!

Course Announcements

CS01 due Thursday (group work survey due Friday)
Lab07 due Friday
CS02 due next Friday
Final Project due Tues of Finals week (report + presentation)

. . .

Vote:

Option 1: Continue as planned cs01 (12.5) + cs02 (12.5) + final project (18)
Option 2:
- CS01 due Thursday (17)
- Complete either CS02 w/ external data req’d for extension OR Final Project (with final project group; due Finals week) (26)

. . .

CS02 and Final Project repos + emails went out yesterday

please look for your repo/make sure you have access!
please get in communication with your groups if you’re not already in communication

Question

Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

EDA

Getting to “know” the data

The Data

pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))

`skimr`

skimr::skim(pm)

Data summary
Name	pm
Number of rows	876
Number of columns	50
_______________________
Column type frequency:
character	3
numeric	47
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
state	1	4	20	49
county	1	3	20	471
city	1	4	48	607

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	26987.96	1.578761e+04	1003.00	13089.15	26132.00	39118.00	5.603910e+04	▇▇▆▇▆
value	1	10.81	2.580000e+00	3.02	9.27	11.15	12.37	2.316000e+01	▂▆▇▁▁
fips	1	26987.89	1.578763e+04	1003.00	13089.00	26132.00	39118.00	5.603900e+04	▇▇▆▇▆
lat	1	38.48	4.620000e+00	25.47	35.03	39.30	41.66	4.840000e+01	▁▃▅▇▂
lon	1	-91.74	1.496000e+01	-124.18	-99.16	-87.47	-80.69	-6.804000e+01	▃▂▃▇▃
CMAQ	1	8.41	2.970000e+00	1.63	6.53	8.62	10.24	2.313000e+01	▃▇▃▁▁
zcta	1	50890.29	2.778447e+04	1022.00	28788.25	48172.00	74371.00	9.920200e+04	▅▇▇▅▇
zcta_area	1	183173481.91	5.425989e+08	15459.00	14204601.75	37653560.50	160041508.25	8.164821e+09	▇▁▁▁▁
zcta_pop	1	24227.58	1.777216e+04	0.00	9797.00	22014.00	35004.75	9.539700e+04	▇▇▃▁▁
imp_a500	1	24.72	1.934000e+01	0.00	3.70	25.12	40.22	6.961000e+01	▇▅▆▃▂
imp_a1000	1	24.26	1.802000e+01	0.00	5.32	24.53	38.59	6.750000e+01	▇▅▆▃▁
imp_a5000	1	19.93	1.472000e+01	0.05	6.79	19.07	30.11	7.460000e+01	▇▆▃▁▁
imp_a10000	1	15.82	1.381000e+01	0.09	4.54	12.36	24.17	7.209000e+01	▇▃▂▁▁
imp_a15000	1	13.43	1.312000e+01	0.11	3.24	9.67	20.55	7.110000e+01	▇▃▁▁▁
county_area	1	3768701992.12	6.212830e+09	33703512.00	1116536297.50	1690826566.50	2878192209.00	5.194723e+10	▇▁▁▁▁
county_pop	1	687298.44	1.293489e+06	783.00	100948.00	280730.50	743159.00	9.818605e+06	▇▁▁▁▁
log_dist_to_prisec	1	6.19	1.410000e+00	-1.46	5.43	6.36	7.15	1.045000e+01	▁▁▃▇▁
log_pri_length_5000	1	9.82	1.080000e+00	8.52	8.52	10.05	10.73	1.205000e+01	▇▂▆▅▂
log_pri_length_10000	1	10.92	1.130000e+00	9.21	9.80	11.17	11.83	1.302000e+01	▇▂▇▇▃
log_pri_length_15000	1	11.50	1.150000e+00	9.62	10.87	11.72	12.40	1.359000e+01	▆▂▇▇▃
log_pri_length_25000	1	12.24	1.100000e+00	10.13	11.69	12.46	13.05	1.436000e+01	▅▃▇▇▃
log_prisec_length_500	1	6.99	9.500000e-01	6.21	6.21	6.21	7.82	9.400000e+00	▇▁▂▂▁
log_prisec_length_1000	1	8.56	7.900000e-01	7.60	7.60	8.66	9.20	1.047000e+01	▇▅▆▃▁
log_prisec_length_5000	1	11.28	7.800000e-01	8.52	10.91	11.42	11.83	1.278000e+01	▁▁▃▇▃
log_prisec_length_10000	1	12.41	7.300000e-01	9.21	11.99	12.53	12.94	1.385000e+01	▁▁▃▇▅
log_prisec_length_15000	1	13.03	7.200000e-01	9.62	12.59	13.13	13.57	1.441000e+01	▁▁▃▇▅
log_prisec_length_25000	1	13.82	7.000000e-01	10.13	13.38	13.92	14.35	1.523000e+01	▁▁▃▇▆
log_nei_2008_pm25_sum_10000	1	3.97	2.350000e+00	0.00	2.15	4.29	5.69	9.120000e+00	▆▅▇▆▂
log_nei_2008_pm25_sum_15000	1	4.72	2.250000e+00	0.00	3.47	5.00	6.35	9.420000e+00	▃▃▇▇▂
log_nei_2008_pm25_sum_25000	1	5.67	2.110000e+00	0.00	4.66	5.91	7.28	9.650000e+00	▂▂▇▇▃
log_nei_2008_pm10_sum_10000	1	4.35	2.320000e+00	0.00	2.69	4.62	6.07	9.340000e+00	▅▅▇▇▂
log_nei_2008_pm10_sum_15000	1	5.10	2.180000e+00	0.00	3.87	5.39	6.72	9.710000e+00	▂▃▇▇▂
log_nei_2008_pm10_sum_25000	1	6.07	2.010000e+00	0.00	5.10	6.37	7.52	9.880000e+00	▁▂▆▇▃
popdens_county	1	551.76	1.711510e+03	0.26	40.77	156.67	510.81	2.682191e+04	▇▁▁▁▁
popdens_zcta	1	1279.66	2.757490e+03	0.00	101.15	610.35	1382.52	3.041884e+04	▇▁▁▁▁
nohs	1	6.99	7.210000e+00	0.00	2.70	5.10	8.80	1.000000e+02	▇▁▁▁▁
somehs	1	10.17	6.200000e+00	0.00	5.90	9.40	13.90	7.220000e+01	▇▂▁▁▁
hs	1	30.32	1.140000e+01	0.00	23.80	30.75	36.10	1.000000e+02	▂▇▂▁▁
somecollege	1	21.58	8.600000e+00	0.00	17.50	21.30	24.70	1.000000e+02	▆▇▁▁▁
associate	1	7.13	4.010000e+00	0.00	4.90	7.10	8.80	7.140000e+01	▇▁▁▁▁
bachelor	1	14.90	9.710000e+00	0.00	8.80	12.95	19.22	1.000000e+02	▇▂▁▁▁
grad	1	8.91	8.650000e+00	0.00	3.90	6.70	11.00	1.000000e+02	▇▁▁▁▁
pov	1	14.95	1.133000e+01	0.00	6.50	12.10	21.22	6.590000e+01	▇▅▂▁▁
hs_orless	1	47.48	1.675000e+01	0.00	37.92	48.65	59.10	1.000000e+02	▁▃▇▃▁
urc2013	1	2.92	1.520000e+00	1.00	2.00	3.00	4.00	6.000000e+00	▇▅▃▂▁
urc2006	1	2.97	1.520000e+00	1.00	2.00	3.00	4.00	6.000000e+00	▇▅▃▂▁
aod	1	43.70	1.956000e+01	5.00	31.66	40.17	49.67	1.430000e+02	▃▇▁▁▁

Why 49 “states”?

pm %>% 
  distinct(state)

# A tibble: 49 × 1
   state               
   <chr>               
 1 Alabama             
 2 Arizona             
 3 Arkansas            
 4 California          
 5 Colorado            
 6 Connecticut         
 7 Delaware            
 8 District Of Columbia
 9 Florida             
10 Georgia             
# ℹ 39 more rows

DC is included
Alaska and Hawaii are not

Number of monitors per city?

pm %>% filter(city == "San Diego")

# A tibble: 2 × 50
     id value  fips   lat   lon state      county    city   CMAQ  zcta zcta_area
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <chr>     <chr> <dbl> <dbl>     <dbl>
1 6073.  11.3  6073  32.8 -117. California San Diego San …  9.68 92123  21148247
2 6073.  13.7  6073  32.7 -117. California San Diego San …  9.68 92113  13647793
# ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
#   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
#   county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …

. . .

pm %>% filter(city == "Baltimore")

# A tibble: 5 × 50
      id value  fips   lat   lon state    county     city   CMAQ  zcta zcta_area
   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    <chr>      <chr> <dbl> <dbl>     <dbl>
1 24510.  12.2 24510  39.3 -76.6 Maryland Baltimore… Balt…  10.9 21251    461424
2 24510.  12.5 24510  39.3 -76.7 Maryland Baltimore… Balt…  10.9 21215  17645223
3 24510.  12.8 24510  39.3 -76.5 Maryland Baltimore… Balt…  10.9 21224  24539976
4 24510.  14.3 24510  39.2 -76.6 Maryland Baltimore… Balt…  10.9 21226  25718732
5 24510.  13.3 24510  39.3 -76.6 Maryland Baltimore… Balt…  10.9 21202   4111039
# ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
#   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
#   county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …

. . .

San Diego has 2, while Baltimore has 5, despite having very similar population densities (popdens_county)…and San Diego having a much larger population (county_pop) and land area (county_area).

Feature Correlation

Why do we care if variables in our dataset are correlated?

we don’t want to include redundant variables
can add unnecessary noise to our algorithm causing a reduction in prediction accuracy
can cause our algorithm to be slower
can also make it difficult to interpret what variables are actually predictive

. . .

Taking a look at our numeric variables…

PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)

deep blue | strongly, positively correlated
deep red | strongly, negatively correlated

. . .

If we don’t care about direction, but only strength…and using hierarchical clustering:

corrplot::corrplot(abs(PM_cor), order = "hclust", tl.cex = 0.5, cl.lim = c(0, 1))

. . .

Observations:

development variables (imp), road density (pri), and the emission (nei) variables all seem to be correlated with their group
none of the predictors are correlated with value (our outcome)

Development (`imp`)

# we used GGally in a previous set of notes
# will need to install if you haven't yet
select(pm, contains("imp")) %>%
  GGally::ggpairs()

Emmissoins (`nei`)

select(pm, contains("nei")) %>%
  GGally::ggpairs()

. . .

Road Density (`pri`)

select(pm, contains("pri")) %>%
  GGally::ggcorr(hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

Warning: colors are reversed from above. If included in final report, you’d want consistency.

Are the categories correlated with one another?

pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, 
       log_pri_length_10000, imp_a10000, county_pop) %>%
  GGally::ggpairs()

. . .

Reminder:

log_nei_2008_pm25_sum_10000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
popdens_county | Population density (number of people per kilometer squared area of the county)
log_pri_length_10000 | Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highways only
imp_a10000 | Impervious surface measure
– Within a circle with a radius of 10000 meters around the monitor
county_pop | Population of the county of the monitor

Log-transforming right-skewed data

pm %>%
  mutate(log_popdens_county=log(popdens_county),
         log_pop_county = log(county_pop)) %>%
  select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
       log_pri_length_10000, imp_a10000, log_pop_county) %>%
  GGally::ggpairs()

Your Turn

💪 Try to learn at least three things about the data that we haven’t yet discussed now on your own.

Put a green sticky on the front of your computer when you’re done. Put a pink if you want help/have a question.

Outcome variable

summary(pm$value)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.024   9.268  11.153  10.808  12.369  23.161

ggplot(pm, aes(value)) + geom_histogram()

What do we know about the monitors that have very high values?

pm |> 
  filter(value > 20) |>
  select(state)

# A tibble: 6 × 1
  state     
  <chr>     
1 California
2 California
3 California
4 California
5 California
6 California

CS02: Predicting Air Pollution (EDA)

Q&A

Course Announcements

Question

EDA

The Data

skimr

Why 49 “states”?

Number of monitors per city?

Feature Correlation

Development (imp)

Emmissoins (nei)

Road Density (pri)

Are the categories correlated with one another?

Log-transforming right-skewed data

Your Turn

`skimr`

Development (`imp`)

Emmissoins (`nei`)

Road Density (`pri`)