library(tidyverse)
library(tidymodels)
09-mlr
Multiple Linear Regression
Q&A
Q: What are some ways we can make our visuals accessible? For example, how can we account for color blind people when incorporating graphs and figures in our presentation?
A: Love this question! One way to do this is to use tools built for just this purpose! For example, for color blindness, ColorBrewer2 allows you to toggle for color palletes that are “colorblind safe.” Similarly, for written content online, ensuring that you’re including alt-text for all images, to enable visual understanding by those with vision impariments/differences. For those with learning differences (i.e. dyslexia, ADHD), I’m sure there’s research out there with respect to viz, and I should make myself more familiar, but ensuring that you’re not using very similar acronyms/initialisms for different context or using consistent icons (Rather than letters) across visualizations, etc. can really help individuals with learning differences and overall make your viz more consistent/understandable. This is just the tip of the iceberg. Additional suggestions here.
Course Announcements
- 🎃 Happy Halloween!
- No OH today after class
- Make up: tomorrow 10-11AM (CSB 243)
- 🔬 There is NO lab to turn in this week
- Lab will be used for midterm review
- Come to lab with questions!
- 🏫 Midterm is due next Monday at 11:59PM
- will be released 5PM Friday
- completed individually
- open notes/open Internet
- NOT timed
- typically takes students ~2.5h to complete (big range: 2-15h)
- Lab05 & HW03 will be released next Monday
Suggested Reading
Introduction to Modern Statistics Chapter 8: Linear Regression with Multiple Predictors
Agenda
- Multiple Linear Regression
- Multiple predictors
- Main vs interaction effects
- Model comparison
- Backward selection
Packages & Data
. . .
Data: Paris Paintings
<- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv",
pp na = c("n/a", "", "NA")) |>
mutate(log_price = log(price))
- Number of observations: 3393
- Number of variables: 62
Two numerical predictors
Multiple predictors
- Response variable:
log_price
- Explanatory variables: Width and height
. . .
<- linear_reg() |>
lin_mod set_engine("lm")
<- lin_mod |>
pp_fit fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.77 0.0579 82.4 0
2 Width_in 0.0269 0.00373 7.22 6.58e-13
3 Height_in -0.0133 0.00395 -3.36 7.93e- 4
Linear model with multiple predictors
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.77 0.0579 82.4 0
2 Width_in 0.0269 0.00373 7.22 6.58e-13
3 Height_in -0.0133 0.00395 -3.36 7.93e- 4
\[\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height\]
. . .
❓ How do we interpret this model?
Numerical and categorical predictors
Price, surface area, and living artist
- Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living
- First visualize and explore, then model
. . .
- But first, prep the data:
<- pp |>
pp mutate(artistliving = case_when(artistliving == 0 ~ "Deceased",
TRUE ~ "Living"))
|>
pp count(artistliving)
# A tibble: 2 × 2
artistliving n
<chr> <int>
1 Deceased 2937
2 Living 456
Typical surface area
Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.
ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
geom_histogram(binwidth = 500) +
facet_grid(artistliving ~ .) +
scale_fill_manual(values = c("#E48957", "#071381")) +
guides(fill = "none") +
labs(x = "Surface area", y = NULL) +
geom_vline(xintercept = 1000) +
geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")
Narrowing the scope
For simplicity let’s focus on the paintings with Surface < 5000
:
<- pp |>
pp_Surf_lt_5000 filter(Surface < 5000)
ggplot(data = pp_Surf_lt_5000,
aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
geom_point(alpha = 0.5) +
labs(color = "Artist", shape = "Artist") +
scale_color_manual(values = c("#E48957", "#071381"))
Facet to get a better look
ggplot(data = pp_Surf_lt_5000,
aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
geom_point(alpha = 0.5) +
facet_wrap(~artistliving) +
scale_color_manual(values = c("#E48957", "#071381")) +
labs(color = "Artist", shape = "Artist")
Two ways to model
- Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living.
- Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living.
Interacting explanatory variables
- Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
- This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
- This can be accomplished by adding an interaction variable: the product of two explanatory variables.
Two ways to model
- Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living
- Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living
. . .
❓ Which does your intuition/knowledge of the data suggest is more appropriate?
Put a green sticky if you think main; pink if you think interaction.
Fit model with main effects
- Response variable:
log_price
- Explanatory variables:
Surface
area andartistliving
<- lin_mod |>
pp_main_fit fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.88 0.0424 115. 0
2 Surface 0.000265 0.0000415 6.39 1.85e-10
3 artistlivingLiving 0.137 0.0970 1.41 1.57e- 1
. . .
\[\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving\]
Solving the model
- Non-living artist: Plug in 0 for
artistliving
\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)
\(= 4.88 + 0.000265 \times surface\)
. . .
- Living artist: Plug in 1 for
artistliving
\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)
\(= 5.017 + 0.000265 \times surface\)
Visualizing main effects
- Same slope: Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists.
- Different intercept: Paintings by living artists are consistently more expensive than paintings by non-living artists.
Interpreting main effects
tidy(pp_main_fit) |>
mutate(exp_estimate = exp(estimate)) |>
select(term, estimate, exp_estimate)
# A tibble: 3 × 3
term estimate exp_estimate
<chr> <dbl> <dbl>
1 (Intercept) 4.88 132.
2 Surface 0.000265 1.00
3 artistlivingLiving 0.137 1.15
- All else held constant, for each additional square inch in painting’s surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
- All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
- Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.
Main vs. interaction effects
- The way we specified our main effects model only lets
artistliving
affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the same slope and only allows for different intercepts.
. . .
❓ What seems more appropriate in this case?
- Same slope and same intercept for both colors
- Same slope and different intercept for both colors
- Different slope and different intercept for both colors
Interaction: Surface * artistliving
Fit model with interaction effects
- Response variable:
log_price
- Explanatory variables:
Surface
area,artistliving
, and their interaction
<- lin_mod |>
pp_int_fit fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.91 0.0432 114. 0
2 Surface 0.000206 0.0000442 4.65 0.00000337
3 artistlivingLiving -0.126 0.119 -1.06 0.289
4 Surface:artistlivingLiving 0.000479 0.000126 3.81 0.000139
Linear model with interaction effects
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.91 0.0432 114. 0
2 Surface 0.000206 0.0000442 4.65 0.00000337
3 artistlivingLiving -0.126 0.119 -1.06 0.289
4 Surface:artistlivingLiving 0.000479 0.000126 3.81 0.000139
\[\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving\] \[+ ~ 0.00048 \times surface * artistliving\]
Interpretation of interaction effects
- Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes)
- Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept).
. . .
- Non-living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 0 + 0.00048 \times surface \times 0\) \(= 4.91 + 0.00021 \times surface\)
- Living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 1 + 0.00048 \times surface \times 1\) \(= 4.91 + 0.00021 \times surface\) \(- 0.126 + 0.00048 \times surface\) \(= 4.784 + 0.00069 \times surface\)
Comparing models
R-squared
- \(R^2\) is the percentage of variability in the response variable explained by the regression model.
glance(pp_main_fit)$r.squared
[1] 0.01320884
glance(pp_int_fit)$r.squared
[1] 0.0176922
. . .
- Clearly the model with interactions has a higher \(R^2\).
. . .
- However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.
Adjusted R-squared
It appears that adding the interaction actually increased adjusted \(R^2\), so we should indeed use the model with the interactions.
glance(pp_main_fit)$adj.r.squared
[1] 0.01258977
glance(pp_int_fit)$adj.r.squared
[1] 0.01676753
Third order interactions
- Can you? Yes
- Should you? Probably not if you want to interpret these interactions in context of the data.
In pursuit of Occam’s razor
Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection follows this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.
Backward selection
For this demo, we’ll ignore interaction effects…and just model main effects to start:
<- lin_mod |>
pp_full fit(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp)
glance(pp_full)$adj.r.squared
[1] 0.02570141
- \(R^2\) (full): 0.0257014
. . .
Remove artistliving
<- lin_mod |>
pp_noartist fit(log_price ~ Width_in + Height_in + Surface, data=pp)
glance(pp_noartist)$adj.r.squared
[1] 0.02579859
. . .
- \(R^2\) (full): 0.0257
- \(R^2\) (no
artistliving
): 0.0258
…Improved variance explained
. . .
Remove Surface
<- lin_mod |>
pp_noartist_nosurface fit(log_price ~ Width_in + Height_in, data=pp)
glance(pp_noartist_nosurface)$adj.r.squared
[1] 0.02231559
. . .
- \(R^2\) (full): 0.0257
- \(R^2\) (no
artistliving
): 0.0258 - \(R^2\) (no
artistliving
orSurface
): 0.0223
. . .
…no longer gaining improvement, so we stick with: log_price ~ Width_in + Height_in + Surface
Other approach:
# requires package installation:
# install.packages("olsrr")
library(olsrr)
. . .
Step 1: Fit model (w/o tidymodels
)
# fit the model (not using tidymodels)
<- lm(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp_Surf_lt_5000) mod
. . .
Step 2: Determine which variables to remove
ols_step_backward_p(mod)
Elimination Summary
------------------------------------------------------------------------------
Variable Adj.
Step Removed R-Square R-Square C(p) AIC RMSE
------------------------------------------------------------------------------
1 artistliving 0.0261 0.0251 3.8495 12603.7727 1.8315
------------------------------------------------------------------------------
…specifies that artistliving
should be removed
. . .
Step 2 (alternate): Compare all possible models…
ols_step_all_possible(mod) |>
arrange(desc(adjr))
Index N Predictors R-Square Adj. R-Square
1 11 3 Width_in Height_in Surface 0.0260749939 0.0251349118
2 15 4 Width_in Height_in Surface artistliving 0.0263412027 0.0250876993
3 5 2 Width_in Height_in 0.0256902566 0.0250634893
4 12 3 Width_in Height_in artistliving 0.0259732581 0.0250330779
5 6 2 Width_in Surface 0.0249136264 0.0242863596
6 13 3 Width_in Surface artistliving 0.0251787948 0.0242378477
7 7 2 Width_in artistliving 0.0212864021 0.0206568018
8 1 1 Width_in 0.0209415833 0.0206267736
9 8 2 Surface artistliving 0.0132088377 0.0125897717
10 2 1 Surface 0.0125899681 0.0122803381
11 14 3 Height_in Surface artistliving 0.0130836930 0.0121310711
12 9 2 Height_in Surface 0.0126782901 0.0120431523
13 10 2 Height_in artistliving 0.0062698155 0.0056305552
14 3 1 Height_in 0.0058797727 0.0055601199
15 4 1 artistliving 0.0005531617 0.0002397573
Mallow's Cp
1 3.849487
2 5.000000
3 3.077206
4 4.174132
5 5.555476
6 6.709309
7 17.130153
8 16.230489
9 51.971190
10 52.001268
11 45.305459
12 44.599123
13 65.048926
14 64.293574
15 91.485605
End of Material Covered on Midterm
Recap
- Can you model and interpret linear models with multiple predictors?
- Can you explain the difference in a model with main effects vs. interaction effects?
- Can you compare different models and determine how to proceed?
- Can you carry out and explain backward selection?