17-cs02-analysis

Professor Shannon Ellis

2023-11-30

CS02: Predicting Air Pollution (Analysis)

Q&A

Q: Just to confirm, if we choose to do the CS02 do we still do it with our final group or the assigned CS02 group?
A: Final group!

Q: If we choose to do the final project instead of CS2, will everything be the same as our originally planned final projects in terms of contents except the grade weighting?
A: Yup!

Q: Understanding some of the variable’s meaning, maybe would have been good for us to recode them to make them more intuitive.
A: A good suggestion if you decide to go this route for the final!

Q: For the distribution on the diagonal line of the correlation graphs, what does x/y axis represent?
A: It’s a densityplot (shows the distribution) for each individual variable - the one that’s in that row.

Course Announcements

  • CS01 due tonight (group work survey due Friday)
  • Lab07 due tomorrow (Friday)
  • Final Project due Tues of Finals week (report + presentation + general communication)
  • Any CS01 questions?

Question

Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Analysis

Recall:

The Data

pm <- read_csv(here::here("OCS_data", "data", "raw", "pm25_data.csv"))
Rows: 876 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): state, county, city
dbl (47): id, value, fips, lat, lon, CMAQ, zcta, zcta_area, zcta_pop, imp_a5...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Converting to factors as discussed last class
pm <- pm %>%
  dplyr::mutate(across(c(id, fips, zcta), as.factor)) 

Data Splitting

Specify the split:

set.seed(1234)
pm_split <- rsample::initial_split(data = pm, prop = 2/3)
pm_split
<Training/Testing/Total>
<584/292/876>
  • set.seed <- ensures we all get the exact same random split
  • output displayed: <training data sample number, testing data sample number, original sample number>

More on how people decide what proportions to use for data splitting here

Split the Data

train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
 
# Scroll through the output!
count(train_pm, state)
# A tibble: 49 × 2
   state                    n
   <chr>                <int>
 1 Alabama                 13
 2 Arizona                 12
 3 Arkansas                 8
 4 California              55
 5 Colorado                10
 6 Connecticut             12
 7 Delaware                 3
 8 District Of Columbia     2
 9 Florida                 22
10 Georgia                 20
# ℹ 39 more rows

❓ What do you observe about the output?

Pre-processing: recipe() + bake()

Need to:

  • specify predictors vs. outcome
  • scale variables
  • remove redundant variables (feature engineering)

recipe provides a standardized format for a sequence of steps for pre-processing the data

Step 1: Specify variable roles

The simplest approach…

simple_rec <- train_pm %>%
  recipes::recipe(value ~ .)

simple_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 49

…but we need to specify which column includes ID information

simple_rec <- train_pm %>%
  recipes::recipe(value ~ .) %>%
  recipes::update_role(id, new_role = "id variable")

simple_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   48
id variable:  1

…and which are our predictors and which is our outcome

simple_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable")

simple_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   48
id variable:  1

❓ Can someone summarize what this code is specifying?

Summarizing our recipe thus far:

summary(simple_rec)
# A tibble: 50 × 4
   variable type      role        source  
   <chr>    <list>    <chr>       <chr>   
 1 id       <chr [3]> id variable original
 2 value    <chr [2]> outcome     original
 3 fips     <chr [3]> predictor   original
 4 lat      <chr [2]> predictor   original
 5 lon      <chr [2]> predictor   original
 6 state    <chr [3]> predictor   original
 7 county   <chr [3]> predictor   original
 8 city     <chr [3]> predictor   original
 9 CMAQ     <chr [2]> predictor   original
10 zcta     <chr [3]> predictor   original
# ℹ 40 more rows

Step 2: Specify pre-processing with step*()

There are step functions for a variety of purposes:

  1. Imputation – filling in missing values based on the existing data
  2. Transformation – changing all values of a variable in the same way, typically to make it more normal or easier to interpret
  3. Discretization – converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels (However this is generally not advisable!)
  4. Encoding / Creating Dummy Variables – creating a numeric code for categorical variables (More on one-hot and Dummy Variables encoding)
  5. Data type conversions – which means changing from integer to factor or numeric to date etc.
  6. Interaction term addition to the model – which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
  7. Normalization – centering and scaling the data to a similar range of values
  8. Dimensionality Reduction/ Signal Extraction – reducing the space of features or predictors to a smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
  9. Filtering – filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
  10. Row operations – performing functions on the values within the rows (ex. rearranging, filtering, imputing)
  11. Checking functions – Gut checks to look for missing values, to look at the variable classes etc.

This link and this link show the many options for recipe step functions.

There are several ways to select what variables to apply steps to:

  1. Using tidyselect methods: contains(), matches(), starts_with(), ends_with(), everything(), num_range()
  2. Using the type: all_nominal(), all_numeric() , has_type()
  3. Using the role: all_predictors(), all_outcomes(), has_role()
  4. Using the name - use the actual name of the variable/variables of interest

One-hot encoding categorical variables:

simple_rec %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE)
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   48
id variable:  1
── Operations 
• Dummy variables from: state, county, city, zcta

❓ Can anyone remind us what one-hot encoding does?

  • fips includes numeric code for state and county, so it’s another way to specify county
  • so, we’ll change fips’ role
  • we get to decide what to call it ("county id")
simple_rec %>%
  update_role("fips", new_role = "county id")
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   47
county id:    1
id variable:  1

Removing highly correlated variables:

simple_rec %>%
  step_corr(all_predictors(), - CMAQ, - aod)
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   48
id variable:  1
── Operations 
• Correlation filter on: all_predictors(), -CMAQ, -aod
  • specifying to KEEP CMAQ and aod

Removing variables with non-zero variance:

simple_rec %>%
  step_nzv(all_predictors(), - CMAQ, - aod)
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   48
id variable:  1
── Operations 
• Sparse, unbalanced variable filter on: all_predictors(), -CMAQ, -aod

Putting our recipe together

simple_rec <- simple_rec %>% 
  update_role("fips", new_role = "county id") %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
  step_corr(all_predictors(), - CMAQ, - aod)%>%
  step_nzv(all_predictors(), - CMAQ, - aod)
  
simple_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:      1
predictor:   47
county id:    1
id variable:  1
── Operations 
• Dummy variables from: state, county, city, zcta
• Correlation filter on: all_predictors(), -CMAQ, -aod
• Sparse, unbalanced variable filter on: all_predictors(), -CMAQ, -aod

Note: order of steps matters

Step 3: Running the pre-processing (prep)

There are some important arguments to know about:

  1. training - you must supply a training data set to estimate parameters for pre-processing operations (recipe steps) - this may already be included in your recipe - as is the case for us
  2. fresh - if fresh=TRUE, will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe (default is FALSE)
  3. verbose - if verbose=TRUE, shows the progress as the steps are evaluated and the size of the pre-processed training set (default is FALSE)
  4. retain - if retain=TRUE, then the pre-processed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and do not want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the pre-processed data (default is TRUE)
prepped_rec <- prep(simple_rec, verbose = TRUE, retain = TRUE )
oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.26 Mb  in memory.
names(prepped_rec)
 [1] "var_info"       "term_info"      "steps"          "template"      
 [5] "levels"         "retained"       "requirements"   "tr_info"       
 [9] "orig_lvls"      "last_term_info"

This output includes a lot of information:

  1. the steps that were run
  2. the original variable info (var_info)
  3. the updated variable info after pre-processing (term_info)
  4. the new levels of the variables
  5. the original levels of the variables (orig_lvls)
  6. info about the training data set size and completeness (tr_info)

Step 4: Extract pre-processed training data using bake()

bake(): apply our modeling steps (in this case just pre-processing on the training data) and see what it would do the data

baked_train <- bake(prepped_rec, new_data = NULL)
glimpse(baked_train)
Rows: 584
Columns: 37
$ id                          <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value                       <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips                        <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat                         <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon                         <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ                        <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area                   <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop                    <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500                    <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000                  <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area                 <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop                  <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec          <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000        <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500       <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000      <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000      <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000     <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county              <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta                <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs                        <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs                      <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs                          <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege                 <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate                   <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor                    <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad                        <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov                         <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless                   <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006                     <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod                         <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California            <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
  • new_data = NULL specifies that we’re not (yet) looking at our testing data
  • We only have 36 variables (33 predictors + 2 id variables + outcome)
  • categorical variables (state) are gone (one-hot encoding)
  • state_California remains - only state with nonzero variance (largest # of monitors)

Step 5: Extract pre-processed testing data using bake()

bake() takes a trained recipe and applies the operations to a data set to create a design matrix. For example: it applies the centering to new data sets using these means used to create the recipe. - tidymodels documentation

Typically, you want to avoid using your testing data…but our data set is not that large and NA values in our testing dataset could cause issues later on.

baked_test_pm <- recipes::bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm)
Rows: 292
Columns: 37
$ id                          <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value                       <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips                        <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat                         <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon                         <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ                        <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area                   <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop                    <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500                    <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000                  <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area                 <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop                  <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec          <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000         <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000        <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500       <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000      <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000      <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000     <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county              <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta                <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs                        <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs                      <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs                          <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege                 <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate                   <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor                    <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad                        <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov                         <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless                   <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006                     <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod                         <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city          <dbl> NA, NA, NA, 0, 1, 1, 1, NA, NA, NA, 0, NA,…

Hmm….lots of NAs now in city_Not.in.a.city

Likely b/c there are cities in our testing dataset that were not in our training dataset…

traincities <- train_pm %>% distinct(city)
testcities <- test_pm %>% distinct(city)

#get the number of cities that were different
dim(dplyr::setdiff(traincities, testcities))
[1] 381   1
#get the number of cities that overlapped
dim(dplyr::intersect(traincities, testcities))
[1] 55  1

Aside: return to wrangling

A quick return to wrangling…and re-splitting our data

pm <- pm %>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))

set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
<Training/Testing/Total>
<584/292/876>
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)

And a recipe update…(putting it all together)

novel_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
    step_corr(all_numeric()) %>%
    step_nzv(all_numeric()) 

re-bake()

prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.27 Mb  in memory.
baked_train <- bake(prepped_rec, new_data = NULL)

Looking at the output

glimpse(baked_train)
Rows: 584
Columns: 38
$ id                          <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value                       <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips                        <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat                         <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon                         <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ                        <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area                   <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop                    <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500                    <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000                  <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area                 <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop                  <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec          <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000        <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500       <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000      <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000      <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000     <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_prisec_length_25000     <dbl> 13.98749, 13.15082, 13.44293, 13.58697, 14…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county              <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta                <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs                        <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs                      <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs                          <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege                 <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate                   <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor                    <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad                        <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov                         <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless                   <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006                     <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod                         <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California            <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …

Making sure the NA issue is taken are of:

baked_test_pm <- bake(prepped_rec, new_data = test_pm)

glimpse(baked_test_pm)
Rows: 292
Columns: 38
$ id                          <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value                       <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips                        <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat                         <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon                         <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ                        <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area                   <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop                    <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500                    <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000                  <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area                 <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop                  <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec          <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000         <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000        <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500       <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000      <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000      <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000     <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_prisec_length_25000     <dbl> 13.79973, 13.70026, 13.85550, 14.45221, 13…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county              <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta                <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs                        <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs                      <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs                          <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege                 <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate                   <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor                    <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad                        <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov                         <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless                   <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006                     <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod                         <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …

Specifying our model (parsnip)

There are four things we need to define about our model:

  1. The type of model (using specific functions in parsnip like rand_forest(), logistic_reg() etc.)
  2. The package or engine that we will use to implement the type of model selected (using the set_engine() function)
  3. The mode of learning - classification or regression (using the set_mode() function)
  4. Any arguments necessary for the model/package selected (using the set_args()function - for example the mtry = argument for random forest which is the number of variables to be used as options for splitting at each tree node)

Step 1: Specify the model

  • We’ll start with linear regression, but move to random forest
  • See here for modeling options in parsnip.
lm_PM_model <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  set_mode("regression")

lm_PM_model
Linear Regression Model Specification (regression)

Computational engine: lm 

Step 2: Fit the model

  • workflows package allows us to keep track of both our pre-processing steps and our model specification
  • It also allows us to implement fancier optimizations in an automated way and it can also handle post-processing operations.
PM_wflow <- workflows::workflow() %>%
            workflows::add_recipe(novel_rec) %>%
            workflows::add_model(lm_PM_model)
PM_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

❓ Who can explain the difference between a recipe, baking, and a workflow?

Step 3: Prepare the recipe (estimate the parameters)

PM_wflow_fit <- parsnip::fit(PM_wflow, data = train_pm)
PM_wflow_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)                          lat  
                  2.936e+02                    3.261e-02  
                        lon                         CMAQ  
                  1.586e-02                    2.463e-01  
                  zcta_area                     zcta_pop  
                 -3.433e-10                    1.013e-05  
                   imp_a500                   imp_a15000  
                  5.064e-03                   -3.066e-03  
                county_area                   county_pop  
                 -2.324e-11                   -7.576e-08  
         log_dist_to_prisec          log_pri_length_5000  
                  6.214e-02                   -2.006e-01  
       log_pri_length_25000        log_prisec_length_500  
                 -5.411e-02                    2.204e-01  
     log_prisec_length_1000       log_prisec_length_5000  
                  1.154e-01                    2.374e-01  
    log_prisec_length_10000      log_prisec_length_25000  
                 -3.436e-02                    5.224e-01  
log_nei_2008_pm10_sum_10000  log_nei_2008_pm10_sum_15000  
                  1.829e-01                   -2.355e-02  
log_nei_2008_pm10_sum_25000               popdens_county  
                  2.403e-02                    2.203e-05  
               popdens_zcta                         nohs  
                 -2.132e-06                   -2.983e+00  
                     somehs                           hs  
                 -2.956e+00                   -2.962e+00  
                somecollege                    associate  
                 -2.967e+00                   -2.999e+00  
                   bachelor                         grad  
                 -2.979e+00                   -2.978e+00  
                        pov                    hs_orless  
                  1.859e-03                           NA  
                    urc2006                          aod  
                  2.577e-01                    1.535e-02  
           state_California           city_Not.in.a.city  
                  3.114e+00                   -4.250e-02  

Step 4: Assess model fit

wflowoutput <- PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  broom::tidy() 

wflowoutput
# A tibble: 36 × 5
   term         estimate std.error statistic       p.value
   <chr>           <dbl>     <dbl>     <dbl>         <dbl>
 1 (Intercept)  2.94e+ 2  1.18e+ 2     2.49  0.0130       
 2 lat          3.26e- 2  2.28e- 2     1.43  0.153        
 3 lon          1.59e- 2  1.01e- 2     1.58  0.115        
 4 CMAQ         2.46e- 1  3.97e- 2     6.20  0.00000000108
 5 zcta_area   -3.43e-10  1.60e-10    -2.15  0.0320       
 6 zcta_pop     1.01e- 5  5.33e- 6     1.90  0.0578       
 7 imp_a500     5.06e- 3  7.42e- 3     0.683 0.495        
 8 imp_a15000  -3.07e- 3  1.16e- 2    -0.263 0.792        
 9 county_area -2.32e-11  1.97e-11    -1.18  0.238        
10 county_pop  -7.58e- 8  9.29e- 8    -0.815 0.415        
# ℹ 26 more rows
  • We have fit our model on our training data
  • We have created a model to predict values of air pollution based on the predictors that we have included

Understanding what variables are most important in our model…

PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip::vip(num_features = 10)

A closer look at monitors in CA:

baked_train %>% 
  mutate(state_California = as.factor(state_California)) %>%
  mutate(state_California = recode(state_California, 
                                   "0" = "Not California", 
                                   "1" = "California")) %>%
  ggplot(aes(x = state_California, y = value)) + 
  geom_boxplot() +
  geom_jitter(width = .05) + 
  xlab("Location of Monitor")

Remember: machine learning (ML) as an optimization problem that tries to minimize the distance between our predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) using our features (or predictor variables) \(X\) as input to a function \(f\) that we want to estimate.

\[d(Y - \hat{Y})\]

Let’s pull out our predicted outcome values \(\hat{Y} = f(X)\) from the models we fit (using different approaches).

wf_fit <- PM_wflow_fit %>% 
  extract_fit_parsnip()


wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) %>% 
  select(value, .fitted:.std.resid)

head(wf_fitted_values)
# A tibble: 6 × 6
  value .fitted   .hat .sigma   .cooksd .std.resid
  <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
1 11.7    12.2  0.0370   2.05 0.0000648     -0.243
2  6.96    9.14 0.0496   2.05 0.00179       -1.09 
3 13.3    12.6  0.0484   2.05 0.000151       0.322
4 10.7    10.4  0.0502   2.05 0.0000504      0.183
5 14.5    11.9  0.0243   2.05 0.00113        1.26 
6 12.2     9.52 0.476    2.04 0.0850         1.81 

Visualizing Model Performance

wf_fitted_values %>% 
  ggplot(aes(x =  value, y = .fitted)) + 
  geom_point() + 
  xlab("actual outcome values") + 
  ylab("predicted outcome values")

❓ What do you notice about/learn from these results?

Quantifying Model Performance

\[RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}\]

Can use the yardstick package using the rmse()` function to calculate:

yardstick::metrics(wf_fitted_values,
                   truth = value, estimate = .fitted)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.98 
2 rsq     standard       0.392
3 mae     standard       1.47 
  • RMSE isn’t too bad
  • \(R^2\) suggests model is only explaining 39% of the variance in the data
  • The MAE value suggests that the average difference between the value predicted and the real value was 1.47 ug/m3. The range of the values was 3-22 in the training data, so this is a relatively small amount

Cross-Validation

Resampling + Re-partitioning:

Preparing the data for cross-validation:

Note: this is called v-fold or k-fold CV

Implementing in rsample()

set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
#  4-fold cross-validation 
# A tibble: 4 × 2
  splits            id   
  <list>            <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4
pull(vfold_pm, splits)
[[1]]
<Analysis/Assess/Total>
<438/146/584>

[[2]]
<Analysis/Assess/Total>
<438/146/584>

[[3]]
<Analysis/Assess/Total>
<438/146/584>

[[4]]
<Analysis/Assess/Total>
<438/146/584>

Visualizing this process:

Model Assessment on v-folds

Where this workflow thing really shines…

resample_fit <- tune::fit_resamples(PM_wflow, vfold_pm)
→ A | warning: the standard deviation is zero, The correlation matrix has missing values. 415 columns were excluded from the filter.
There were issues with some computations   A: x1
→ B | warning: There are new levels in a factor: Maine, There are new levels in a factor: Forest, Mecklenburg, Clermont, Camden, Trumbull, Yellowstone, Caddo, Hinds, Codington, Preble, Broward, Rowan, Beaver, Dauphin, Buncombe, LaPorte, Ashley, Clayton, Talladega, Queens, Jones, Mitchell, Kalamazoo, Seminole, Henderson, Sussex, Ingham, Sangamon, Aroostook, Muscogee, Plumas, Dodge, Bennington, Sumner, Butler, Butte, Passaic, Page, Custer, Sainte Genevieve, Bullitt, Palo Alto, Rapides, Faulkner, San Francisco, Ravalli, San Mateo, Delaware, Davis, Fremont, Santa Clara, White, Carter, DeSoto, Wilkinson, Muscatine, Hampden, Yakima, Solano, Mendocino, Mobile, Roanoke City, Wake, Gwinnett, Alamance, prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
There were issues with some computations   A: x1
→ C | warning: the standard deviation is zero, The correlation matrix has missing values. 408 columns were excluded from the filter.
There were issues with some computations   A: x1
→ D | warning: There are new levels in a factor: Athens, Kent, Linn, Stark, Cabell, Arlington, St. Lucie, Grafton, Champaign, Brewster, Morgan, Lawrence, Tarrant, Yolo, Weber, Mille Lacs, Clarke, Harrison, Will, Grant, Morris, Santa Cruz, Taylor, Klamath, Prince George's, Howard, Buchanan, Cedar, Ventura, Monongalia, Bolivar, Medina, Dona Ana, Hancock, Missoula, Chittenden, Monroe, Knox, Essex, Pierce, Tuscaloosa, Ellis, Contra Costa, Apache, Harris, Edgecombe, Stearns, Outagamie, Escambia, Hidalgo, Teton, Loudoun, Belknap, Sauk, Pittsburg, Charles, Gibson, Marshall, Chester, prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
There were issues with some computations   A: x1
→ E | warning: the standard deviation is zero, The correlation matrix has missing values. 417 columns were excluded from the filter.
There were issues with some computations   A: x1
→ F | warning: There are new levels in a factor: Nevada, North Dakota, There are new levels in a factor: Placer, Bay, Niagara, DeKalb, Hampton City, Oconee, Spencer, Sutter, Cobb, Randolph, Anne Arundel, Houston, Kane, Genesee, Dane, Yavapai, Lenawee, Washtenaw, Durham, Scioto, Henry, Spartanburg, Harney, Converse, Portage, St. Croix, Colusa, Berkshire, Lenoir, Lancaster, Haywood, Iberville, Adams, Catawba, St. Clair, Lynchburg City, Nassau, Brookings, Raleigh, Summit, Sebastian, Ouachita, Westmoreland, Rock Island, Duplin, Erie, Burleigh, Vilas, Kanawha, Rutland, San Joaquin, Washoe, Sandoval, Josephine, Kenosha, Plymouth, Stanislaus, Caswell, Cameron, Lucas, Sarpy, West Baton Rouge, Mayes, Cass, Chautauqua, Terrebonne, Sweetwater, Glynn, Harford, Spokane, La Salle, prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
There were issues with some computations   A: x1
There were issues with some computations   A: x1   B: x1   C: x2   D: x1   E: x…
→ G | warning: There are new levels in a factor: Allen, McDowell, Macon, Chaves, Yuma, Dougherty, Flathead, Ashland, Manistee, Hartford, Park, Box Elder, East Baton Rouge, Chesterfield, Woodbury, Bell, Citrus, New London, Cumberland, Fairfax, Forrest, Allegan, Ohio, Pueblo, Gaston, Bernalillo, Sullivan, Nevada, McLean, McCracken, Potter, Mahoning, Porter, Albemarle, Manitowoc, Shawnee, Ocean, Ottawa, El Paso, Baldwin, Bannock, Cheshire, Clay, Jersey, Brown, Lexington, Clinton, Peoria, Macomb, Davidson, Tooele, Dubois, Robeson, St. Lawrence, Lincoln, Virginia Beach City, Shelby, prediction from rank-deficient fit; consider predict(., rankdeficient="NA")
There were issues with some computations   A: x1   B: x1   C: x2   D: x1   E: x…
There were issues with some computations   A: x1   B: x1   C: x2   D: x1   E: x…

Gives us a sense of the RMSE across the four folds:

tune::show_best(resample_fit, metric = "rmse")
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    2.12     4  0.0444 Preprocessor1_Model1

Random Forest

Fitting a different model

Based on a decision tree:

[source]

. . .

But…in the case of random forest:

  • multiple decision trees are created (hence: forest),
  • each tree is built using a random subset of the training data (with replacement) (hence: random)
  • helps to keep the algorithm from overfitting the data
  • The mean of the predictions from each of the trees is used in the final output.

. . .

Updating our recipe()

RF_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_novel("state") %>%
    step_string2factor("state", "county", "city") %>%
    step_rm("county") %>%
    step_rm("zcta") %>%
    step_corr(all_numeric())%>%
    step_nzv(all_numeric())
  • can use our categorical data as is (no dummy coding)
  • step_novel()necessary here for the state variable to get all cross validation folds to work, (b/c there will be different levels included in each fold test and training sets. The new levels for some of the test sets would otherwise result in an error.; “step_novel creates a specification of a recipe step that will assign a previously unseen factor level to a new value.”

Model Specification

Model parameters:

  1. mtry - The number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3.
  2. min_n - The minimum number of data points in a node that are required for the node to be split further.
  3. trees - the number of trees in the ensemble
# install.packages("randomForest")
RF_PM_model <- parsnip::rand_forest(mtry = 10, min_n = 3) %>% 
  set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Workflow

RF_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(RF_PM_model)

RF_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Fit the Data

RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)

RF_wflow_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────

Call:
 randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10,      x), nodesize = min_rows(~3, x)) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 10

          Mean of squared residuals: 2.633639
                    % Var explained: 59.29

Assess Feature Importance

RF_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip::vip(num_features = 10)

❓ What’s your interpretation of these results?

Assess Model Performance

set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   1.67      4  0.101  Preprocessor1_Model1
2 rsq     standard   0.591     4  0.0514 Preprocessor1_Model1

For comparison:

collect_metrics(resample_fit)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   2.12      4  0.0444 Preprocessor1_Model1
2 rsq     standard   0.307     4  0.0263 Preprocessor1_Model1

Model Tuning

Hyperparameters are often things that we need to specify about a model. Instead of arbitrarily specifying this, we can try to determine the best option for model performance by a process called tuning.

Rather than specifying values, we can use tune():

tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")
    
tune_RF_model
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  min_n = tune()

Computational engine: randomForest 

Create Workflow:

RF_tune_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(tune_RF_model)

RF_tune_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  min_n = tune()

Computational engine: randomForest 

Detect how many cores you have access to:

n_cores <- parallel::detectCores()
n_cores
[1] 10

This code will take some time to run:

# install.packages("doParallel")
doParallel::registerDoParallel(cores = n_cores)

set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
i Creating pre-processing data to finalize unknown parameter: mtry
tune_RF_results
# Tuning results
# 4-fold cross-validation 
# A tibble: 4 × 4
  splits            id    .metrics          .notes          
  <list>            <chr> <list>            <list>          
1 <split [438/146]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
2 <split [438/146]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
3 <split [438/146]> Fold3 <tibble [40 × 6]> <tibble [1 × 3]>
4 <split [438/146]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>

There were issues with some computations:

  - Warning(s) x1: 36 columns were requested but there were 35 predictors in the dat...

Run `show_notes(.Last.tune.result)` for more information.

Check Metrics:

tune_RF_results %>%
  collect_metrics()
# A tibble: 40 × 8
    mtry min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1    12    33 rmse    standard   1.72      4  0.0866 Preprocessor1_Model01
 2    12    33 rsq     standard   0.562     4  0.0466 Preprocessor1_Model01
 3    27    35 rmse    standard   1.69      4  0.102  Preprocessor1_Model02
 4    27    35 rsq     standard   0.563     4  0.0511 Preprocessor1_Model02
 5    22    40 rmse    standard   1.71      4  0.106  Preprocessor1_Model03
 6    22    40 rsq     standard   0.556     4  0.0543 Preprocessor1_Model03
 7     1    27 rmse    standard   2.03      4  0.0501 Preprocessor1_Model04
 8     1    27 rsq     standard   0.440     4  0.0245 Preprocessor1_Model04
 9     6    32 rmse    standard   1.77      4  0.0756 Preprocessor1_Model05
10     6    32 rsq     standard   0.552     4  0.0435 Preprocessor1_Model05
# ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
# A tibble: 1 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    32    11 rmse    standard    1.65     4   0.113 Preprocessor1_Model10

Final Model Evaluation

tuned_RF_values<- select_best(tune_RF_results, "rmse")
tuned_RF_values
# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1    32    11 Preprocessor1_Model10

The testing data!

# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow %>%
  tune::finalize_workflow(tuned_RF_values)

# fit model with those parameters on train AND test
overallfit <- RF_wflow %>%
  tune::last_fit(pm_split)

collect_metrics(overallfit)
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       1.72  Preprocessor1_Model1
2 rsq     standard       0.608 Preprocessor1_Model1

Results are similar to what we saw in training (RMSE: 1.65)

Getting the predictions for the test data:

test_predictions <- collect_predictions(overallfit)

Visualizing our results

A map of the US

Packages needed:

  1. sf - the simple features package helps to convert geographical coordinates into geometry variables which are useful for making 2D plots
  2. maps - this package contains geographical outlines and plotting functions to create plots with maps
  3. rnaturalearth- this allows for easy interaction with map data from Natural Earth which is a public domain map dataset
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(maps)

Attaching package: 'maps'
The following object is masked from 'package:purrr':

    map
library(rnaturalearth)
Support for Spatial objects (`sp`) will be deprecated in {rnaturalearth} and will be removed in a future release of the package. Please use `sf` objects with {rnaturalearth}. For example: `ne_download(returnclass = 'sf')`

Outline of the US

world <- ne_countries(scale = "medium", returnclass = "sf")
glimpse(world)
Rows: 241
Columns: 64
$ scalerank  <int> 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 5, 3, 1, 1, 1, 1, 1, 1,…
$ featurecla <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ labelrank  <dbl> 5, 3, 3, 6, 6, 6, 6, 4, 2, 6, 4, 4, 5, 6, 6, 2, 4, 5, 6, 2,…
$ sovereignt <chr> "Netherlands", "Afghanistan", "Angola", "United Kingdom", "…
$ sov_a3     <chr> "NL1", "AFG", "AGO", "GB1", "ALB", "FI1", "AND", "ARE", "AR…
$ adm0_dif   <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,…
$ level      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ type       <chr> "Country", "Sovereign country", "Sovereign country", "Depen…
$ admin      <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ adm0_a3    <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ geou_dif   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ geounit    <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ gu_a3      <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ su_dif     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ subunit    <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ su_a3      <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_diff   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ name       <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_long  <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_a3     <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_name   <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_group  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ abbrev     <chr> "Aruba", "Afg.", "Ang.", "Ang.", "Alb.", "Aland", "And.", "…
$ postal     <chr> "AW", "AF", "AO", "AI", "AL", "AI", "AND", "AE", "AR", "ARM…
$ formal_en  <chr> "Aruba", "Islamic State of Afghanistan", "People's Republic…
$ formal_fr  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ note_adm0  <chr> "Neth.", NA, NA, "U.K.", NA, "Fin.", NA, NA, NA, NA, "U.S.A…
$ note_brk   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Multiple claim…
$ name_sort  <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_alt   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ mapcolor7  <dbl> 4, 5, 3, 6, 1, 4, 1, 2, 3, 3, 4, 4, 1, 7, 2, 1, 3, 1, 2, 3,…
$ mapcolor8  <dbl> 2, 6, 2, 6, 4, 1, 4, 1, 1, 1, 5, 5, 2, 5, 2, 2, 1, 6, 2, 2,…
$ mapcolor9  <dbl> 2, 8, 6, 6, 1, 4, 1, 3, 3, 2, 1, 1, 2, 9, 5, 2, 3, 5, 5, 1,…
$ mapcolor13 <dbl> 9, 7, 1, 3, 6, 6, 8, 3, 13, 10, 1, NA, 7, 11, 5, 7, 4, 8, 8…
$ pop_est    <dbl> 103065, 28400000, 12799293, 14436, 3639453, 27153, 83888, 4…
$ gdp_md_est <dbl> 2258.0, 22270.0, 110300.0, 108.9, 21810.0, 1563.0, 3660.0, …
$ pop_year   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lastcensus <dbl> 2010, 1979, 1970, NA, 2001, NA, 1989, 2010, 2010, 2001, 201…
$ gdp_year   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ economy    <chr> "6. Developing region", "7. Least developed region", "7. Le…
$ income_grp <chr> "2. High income: nonOECD", "5. Low income", "3. Upper middl…
$ wikipedia  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fips_10    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ iso_a2     <chr> "AW", "AF", "AO", "AI", "AL", "AX", "AD", "AE", "AR", "AM",…
$ iso_a3     <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ iso_n3     <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ un_a3      <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ wb_a2      <chr> "AW", "AF", "AO", NA, "AL", NA, "AD", "AE", "AR", "AM", "AS…
$ wb_a3      <chr> "ABW", "AFG", "AGO", NA, "ALB", NA, "ADO", "ARE", "ARG", "A…
$ woe_id     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_is <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ adm0_a3_us <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ adm0_a3_un <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_wb <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ continent  <chr> "North America", "Asia", "Africa", "North America", "Europe…
$ region_un  <chr> "Americas", "Asia", "Africa", "Americas", "Europe", "Europe…
$ subregion  <chr> "Caribbean", "Southern Asia", "Middle Africa", "Caribbean",…
$ region_wb  <chr> "Latin America & Caribbean", "South Asia", "Sub-Saharan Afr…
$ name_len   <dbl> 5, 11, 6, 8, 7, 5, 7, 20, 9, 7, 14, 10, 23, 22, 17, 9, 7, 1…
$ long_len   <dbl> 5, 11, 6, 8, 7, 13, 7, 20, 9, 7, 14, 10, 27, 35, 19, 9, 7, …
$ abbrev_len <dbl> 5, 4, 4, 4, 4, 5, 4, 6, 4, 4, 9, 4, 7, 10, 6, 4, 5, 4, 4, 5…
$ tiny       <dbl> 4, NA, NA, NA, NA, 5, 5, NA, NA, NA, 3, NA, NA, 2, 4, NA, N…
$ homepart   <dbl> NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1,…
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((-69.89912 1..., MULTIPOLYGON (…

World map:

ggplot(data = world) +
    geom_sf() 

Just the US

According to this link, these are the latitude and longitude bounds of the continental US:

  • top = 49.3457868 # north lat
  • left = -124.7844079 # west long
  • right = -66.9513812 # east long
  • bottom = 24.7433195 # south lat
ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)

Adding in our monitors…

ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)+
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred")

Adding in county lines

counties <- sf::st_as_sf(maps::map("county", plot = FALSE,
                                   fill = TRUE))

counties
Simple feature collection with 3076 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.6813 ymin: 25.12993 xmax: -67.00742 ymax: 49.38323
Geodetic CRS:  +proj=longlat +ellps=clrk66 +no_defs +type=crs
First 10 features:
                               ID                           geom
alabama,autauga   alabama,autauga MULTIPOLYGON (((-86.50517 3...
alabama,baldwin   alabama,baldwin MULTIPOLYGON (((-87.93757 3...
alabama,barbour   alabama,barbour MULTIPOLYGON (((-85.42801 3...
alabama,bibb         alabama,bibb MULTIPOLYGON (((-87.02083 3...
alabama,blount     alabama,blount MULTIPOLYGON (((-86.9578 33...
alabama,bullock   alabama,bullock MULTIPOLYGON (((-85.66866 3...
alabama,butler     alabama,butler MULTIPOLYGON (((-86.8604 31...
alabama,calhoun   alabama,calhoun MULTIPOLYGON (((-85.74313 3...
alabama,chambers alabama,chambers MULTIPOLYGON (((-85.59416 3...
alabama,cherokee alabama,cherokee MULTIPOLYGON (((-85.46812 3...

And now onto the map…

monitors <- ggplot(data = world) +
    geom_sf(data = counties, fill = NA, color = gray(.5))+
      coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE) +
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred") +
    ggtitle("Monitor Locations") +
    theme(axis.title.x=element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          axis.title.y = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank())

monitors

Wrangle counties:

  • separate county and state into separate columns
  • make title case
  • combine with PM data
counties <- counties %>% 
  tidyr::separate(ID, into = c("state", "county"), sep = ",") %>% 
  dplyr::mutate(county = stringr::str_to_title(county))

map_data <- dplyr::inner_join(counties, pm, by = "county")

Map: Truth

truth <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = value)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("True PM 2.5 levels") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.

Map: Predictions

# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)

# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_pm) %>% 
  bind_cols(train_pm %>% select(value, fips, county, id)) 

# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_pm) %>% 
  bind_cols(test_pm %>% select(value, fips, county, id)) 
values_pred_test
# A tibble: 292 × 5
   .pred value fips  county     id       
   <dbl> <dbl> <fct> <chr>      <fct>    
 1  11.6  11.2 1033  Colbert    1033.1002
 2  11.9  12.4 1055  Etowah     1055.001 
 3  11.1  10.5 1069  Houston    1069.0003
 4  13.9  15.6 1073  Jefferson  1073.0023
 5  12.0  12.4 1073  Jefferson  1073.1005
 6  11.3  11.1 1073  Jefferson  1073.1009
 7  11.5  11.8 1073  Jefferson  1073.5003
 8  11.0  10.0 1097  Mobile     1097.0003
 9  11.9  12.0 1101  Montgomery 1101.0007
10  12.9  13.2 1113  Russell    1113.0001
# ℹ 282 more rows
# combine
all_pred <- bind_rows(values_pred_test, values_pred_train)
map_data <- inner_join(counties, all_pred, by = "county")

pred <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = .pred)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("Predicted PM 2.5 levels") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.

Final Plot

library(patchwork)

(truth/pred) + 
  plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods to predict levels\nof pollution in places with poor monitoring",
                  theme = theme(plot.title = element_text(size =12, face = "bold"), 
                                plot.subtitle = element_text(size = 8)))

❓ What do you learn from these results?