09-mlr

Professor Shannon Ellis

2023-10-31

Multiple Linear Regression

Q&A

Q: What are some ways we can make our visuals accessible? For example, how can we account for color blind people when incorporating graphs and figures in our presentation?
A: Love this question! One way to do this is to use tools built for just this purpose! For example, for color blindness, ColorBrewer2 allows you to toggle for color palletes that are “colorblind safe.” Similarly, for written content online, ensuring that you’re including alt-text for all images, to enable visual understanding by those with vision impariments/differences. For those with learning differences (i.e. dyslexia, ADHD), I’m sure there’s research out there with respect to viz, and I should make myself more familiar, but ensuring that you’re not using very similar acronyms/initialisms for different context or using consistent icons (Rather than letters) across visualizations, etc. can really help individuals with learning differences and overall make your viz more consistent/understandable. This is just the tip of the iceberg. Additional suggestions here.

Course Announcements

  • 🎃 Happy Halloween!
    • No OH today after class
    • Make up: tomorrow 10-11AM (CSB 243)
  • 🔬 There is NO lab to turn in this week
    • Lab will be used for midterm review
    • Come to lab with questions!
  • 🏫 Midterm is due next Monday at 11:59PM
    • will be released 5PM Friday
    • completed individually
    • open notes/open Internet
    • NOT timed
    • typically takes students ~2.5h to complete (big range: 2-15h)
  • Lab05 & HW03 will be released next Monday

Suggested Reading

Introduction to Modern Statistics Chapter 8: Linear Regression with Multiple Predictors

Agenda

  • Multiple Linear Regression
    • Multiple predictors
    • Main vs interaction effects
    • Model comparison
    • Backward selection

Packages & Data

library(tidyverse)
library(tidymodels)

Data: Paris Paintings

pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv", 
               na = c("n/a", "", "NA")) |> 
  mutate(log_price = log(price))
  • Number of observations: 3393
  • Number of variables: 62

Two numerical predictors

Multiple predictors

  • Response variable: log_price
  • Explanatory variables: Width and height
lin_mod <- linear_reg() |>
  set_engine("lm")

pp_fit <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4

Linear model with multiple predictors

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4


\[\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height\]

❓ How do we interpret this model?

Numerical and categorical predictors

Price, surface area, and living artist

  • Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living
  • First visualize and explore, then model
  • But first, prep the data:
pp <- pp |>
  mutate(artistliving = case_when(artistliving == 0 ~ "Deceased", 
                                  TRUE ~ "Living"))

pp |>
  count(artistliving)
# A tibble: 2 × 2
  artistliving     n
  <chr>        <int>
1 Deceased      2937
2 Living         456

Typical surface area

Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.

ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = "none") +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")

Narrowing the scope

For simplicity let’s focus on the paintings with Surface < 5000:

pp_Surf_lt_5000 <- pp |>
  filter(Surface < 5000)

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  labs(color = "Artist", shape = "Artist") +
  scale_color_manual(values = c("#E48957", "#071381"))

Facet to get a better look

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living.
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living.

Interacting explanatory variables

  • Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
  • This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
  • This can be accomplished by adding an interaction variable: the product of two explanatory variables.

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living

❓ Which does your intuition/knowledge of the data suggest is more appropriate?

Put a green sticky if you think main; pink if you think interaction.

Fit model with main effects

  • Response variable: log_price
  • Explanatory variables: Surface area and artistliving
pp_main_fit <- lin_mod |>
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
# A tibble: 3 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        4.88     0.0424       115.   0       
2 Surface            0.000265 0.0000415      6.39 1.85e-10
3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1

\[\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving\]

Solving the model

  • Non-living artist: Plug in 0 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)
\(= 4.88 + 0.000265 \times surface\)

  • Living artist: Plug in 1 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)
\(= 5.017 + 0.000265 \times surface\)

Visualizing main effects

  • Same slope: Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists.
  • Different intercept: Paintings by living artists are consistently more expensive than paintings by non-living artists.

Interpreting main effects

tidy(pp_main_fit) |> 
  mutate(exp_estimate = exp(estimate)) |>
  select(term, estimate, exp_estimate)
# A tibble: 3 × 3
  term               estimate exp_estimate
  <chr>                 <dbl>        <dbl>
1 (Intercept)        4.88           132.  
2 Surface            0.000265         1.00
3 artistlivingLiving 0.137            1.15
  • All else held constant, for each additional square inch in painting’s surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
  • All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
  • Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

Main vs. interaction effects

  • The way we specified our main effects model only lets artistliving affect the intercept.
  • Model implicitly assumes that paintings with living and deceased artists have the same slope and only allows for different intercepts.

❓ What seems more appropriate in this case?

  • Same slope and same intercept for both colors
  • Same slope and different intercept for both colors
  • Different slope and different intercept for both colors

Interaction: Surface * artistliving

Fit model with interaction effects

  • Response variable: log_price
  • Explanatory variables: Surface area, artistliving, and their interaction
pp_int_fit <- lin_mod |>
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

Linear model with interaction effects

# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

\[\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving\] \[+ ~ 0.00048 \times surface * artistliving\]

Interpretation of interaction effects

  • Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes)
  • Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept).
  • Non-living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 0 + 0.00048 \times surface \times 0\) \(= 4.91 + 0.00021 \times surface\)
  • Living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 1 + 0.00048 \times surface \times 1\) \(= 4.91 + 0.00021 \times surface\) \(- 0.126 + 0.00048 \times surface\) \(= 4.784 + 0.00069 \times surface\)

Comparing models

R-squared

  • \(R^2\) is the percentage of variability in the response variable explained by the regression model.
glance(pp_main_fit)$r.squared
[1] 0.01320884
glance(pp_int_fit)$r.squared
[1] 0.0176922
  • Clearly the model with interactions has a higher \(R^2\).
  • However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.

Adjusted R-squared

It appears that adding the interaction actually increased adjusted \(R^2\), so we should indeed use the model with the interactions.

glance(pp_main_fit)$adj.r.squared
[1] 0.01258977
glance(pp_int_fit)$adj.r.squared
[1] 0.01676753

Third order interactions

  • Can you? Yes
  • Should you? Probably not if you want to interpret these interactions in context of the data.

In pursuit of Occam’s razor

  • Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

  • Model selection follows this principle.

  • We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

  • In other words, we prefer the simplest best model, i.e. parsimonious model.

Backward selection

For this demo, we’ll ignore interaction effects…and just model main effects to start:

pp_full <-  lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp) 

glance(pp_full)$adj.r.squared
[1] 0.02570141
  • \(R^2\) (full): 0.0257014

Remove artistliving

pp_noartist <- lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface, data=pp) 

glance(pp_noartist)$adj.r.squared
[1] 0.02579859

. . .

  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258

…Improved variance explained

. . .

Remove Surface

pp_noartist_nosurface <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data=pp) 

glance(pp_noartist_nosurface)$adj.r.squared
[1] 0.02231559

. . .

  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258
  • \(R^2\) (no artistliving or Surface): 0.0223

. . .

…no longer gaining improvement, so we stick with: log_price ~ Width_in + Height_in + Surface

Other approach:

# requires package installation: 
# install.packages("olsrr")
library(olsrr)

Step 1: Fit model (w/o tidymodels)

# fit the model (not using tidymodels)
mod <- lm(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp_Surf_lt_5000)

Step 2: Determine which variables to remove

ols_step_backward_p(mod)

                             Elimination Summary                               
------------------------------------------------------------------------------
        Variable                      Adj.                                        
Step      Removed       R-Square    R-Square     C(p)        AIC         RMSE     
------------------------------------------------------------------------------
   1    artistliving      0.0261      0.0251    3.8495    12603.7727    1.8315    
------------------------------------------------------------------------------

…specifies that artistliving should be removed

Step 2 (alternate): Compare all possible models…

ols_step_all_possible(mod) |>
  arrange(desc(adjr))
   Index N                              Predictors     R-Square Adj. R-Square
1     11 3              Width_in Height_in Surface 0.0260749939  0.0251349118
2     15 4 Width_in Height_in Surface artistliving 0.0263412027  0.0250876993
3      5 2                      Width_in Height_in 0.0256902566  0.0250634893
4     12 3         Width_in Height_in artistliving 0.0259732581  0.0250330779
5      6 2                        Width_in Surface 0.0249136264  0.0242863596
6     13 3           Width_in Surface artistliving 0.0251787948  0.0242378477
7      7 2                   Width_in artistliving 0.0212864021  0.0206568018
8      1 1                                Width_in 0.0209415833  0.0206267736
9      8 2                    Surface artistliving 0.0132088377  0.0125897717
10     2 1                                 Surface 0.0125899681  0.0122803381
11    14 3          Height_in Surface artistliving 0.0130836930  0.0121310711
12     9 2                       Height_in Surface 0.0126782901  0.0120431523
13    10 2                  Height_in artistliving 0.0062698155  0.0056305552
14     3 1                               Height_in 0.0058797727  0.0055601199
15     4 1                            artistliving 0.0005531617  0.0002397573
   Mallow's Cp
1     3.849487
2     5.000000
3     3.077206
4     4.174132
5     5.555476
6     6.709309
7    17.130153
8    16.230489
9    51.971190
10   52.001268
11   45.305459
12   44.599123
13   65.048926
14   64.293574
15   91.485605

End of Material Covered on Midterm

Recap

  • Can you model and interpret linear models with multiple predictors?
  • Can you explain the difference in a model with main effects vs. interaction effects?
  • Can you compare different models and determine how to proceed?
  • Can you carry out and explain backward selection?