09-mlr

Author

Professor Shannon Ellis

Published

October 31, 2023

Multiple Linear Regression

Q&A

Q: What are some ways we can make our visuals accessible? For example, how can we account for color blind people when incorporating graphs and figures in our presentation?
A: Love this question! One way to do this is to use tools built for just this purpose! For example, for color blindness, ColorBrewer2 allows you to toggle for color palletes that are “colorblind safe.” Similarly, for written content online, ensuring that you’re including alt-text for all images, to enable visual understanding by those with vision impariments/differences. For those with learning differences (i.e. dyslexia, ADHD), I’m sure there’s research out there with respect to viz, and I should make myself more familiar, but ensuring that you’re not using very similar acronyms/initialisms for different context or using consistent icons (Rather than letters) across visualizations, etc. can really help individuals with learning differences and overall make your viz more consistent/understandable. This is just the tip of the iceberg. Additional suggestions here.

Course Announcements

🎃 Happy Halloween!
- No OH today after class
- Make up: tomorrow 10-11AM (CSB 243)
🔬 There is NO lab to turn in this week
- Lab will be used for midterm review
- Come to lab with questions!
🏫 Midterm is due next Monday at 11:59PM
- will be released 5PM Friday
- completed individually
- open notes/open Internet
- NOT timed
- typically takes students ~2.5h to complete (big range: 2-15h)
Lab05 & HW03 will be released next Monday

Agenda

Multiple Linear Regression
- Multiple predictors
- Main vs interaction effects
- Model comparison
- Backward selection

Packages & Data

library(tidyverse)
library(tidymodels)

. . .

Data: Paris Paintings

pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv", 
               na = c("n/a", "", "NA")) |> 
  mutate(log_price = log(price))

Number of observations: 3393
Number of variables: 62

Two numerical predictors

Multiple predictors

Response variable: log_price
Explanatory variables: Width and height

. . .

lin_mod <- linear_reg() |>
  set_engine("lm")

pp_fit <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4

Linear model with multiple predictors

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4

\[\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height\]

. . .

❓ How do we interpret this model?

Numerical and categorical predictors

Price, surface area, and living artist

Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living
First visualize and explore, then model

. . .

But first, prep the data:

pp <- pp |>
  mutate(artistliving = case_when(artistliving == 0 ~ "Deceased", 
                                  TRUE ~ "Living"))

pp |>
  count(artistliving)

# A tibble: 2 × 2
  artistliving     n
  <chr>        <int>
1 Deceased      2937
2 Living         456

Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.

ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = "none") +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")

Narrowing the scope

Plot
Code

For simplicity let’s focus on the paintings with Surface < 5000:

pp_Surf_lt_5000 <- pp |>
  filter(Surface < 5000)

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  labs(color = "Artist", shape = "Artist") +
  scale_color_manual(values = c("#E48957", "#071381"))

Facet to get a better look

Plot
Code

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")

Two ways to model

Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living.
Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living.

Interacting explanatory variables

Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
This can be accomplished by adding an interaction variable: the product of two explanatory variables.

Two ways to model

Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living
Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living

. . .

❓ Which does your intuition/knowledge of the data suggest is more appropriate?

Put a green sticky if you think main; pink if you think interaction.

Fit model with main effects

Response variable: log_price
Explanatory variables: Surface area and artistliving

pp_main_fit <- lin_mod |>
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)

# A tibble: 3 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        4.88     0.0424       115.   0       
2 Surface            0.000265 0.0000415      6.39 1.85e-10
3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1

. . .

\[\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving\]

Solving the model

Non-living artist: Plug in 0 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)
\(= 4.88 + 0.000265 \times surface\)

. . .

Living artist: Plug in 1 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)
\(= 5.017 + 0.000265 \times surface\)

Visualizing main effects

Same slope: Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists.
Different intercept: Paintings by living artists are consistently more expensive than paintings by non-living artists.

Interpreting main effects

tidy(pp_main_fit) |> 
  mutate(exp_estimate = exp(estimate)) |>
  select(term, estimate, exp_estimate)

# A tibble: 3 × 3
  term               estimate exp_estimate
  <chr>                 <dbl>        <dbl>
1 (Intercept)        4.88           132.  
2 Surface            0.000265         1.00
3 artistlivingLiving 0.137            1.15

All else held constant, for each additional square inch in painting’s surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

Main vs. interaction effects

The way we specified our main effects model only lets artistliving affect the intercept.
Model implicitly assumes that paintings with living and deceased artists have the same slope and only allows for different intercepts.

. . .

❓ What seems more appropriate in this case?

Same slope and same intercept for both colors
Same slope and different intercept for both colors
Different slope and different intercept for both colors

Interaction: `Surface * artistliving`

Fit model with interaction effects

Response variable: log_price
Explanatory variables: Surface area, artistliving, and their interaction

pp_int_fit <- lin_mod |>
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)

# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139

Linear model with interaction effects

# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139

\[\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving\] \[+ ~ 0.00048 \times surface * artistliving\]

Interpretation of interaction effects

Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes)
Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept).

. . .

Non-living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 0 + 0.00048 \times surface \times 0\) \(= 4.91 + 0.00021 \times surface\)
Living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 1 + 0.00048 \times surface \times 1\) \(= 4.91 + 0.00021 \times surface\) \(- 0.126 + 0.00048 \times surface\) \(= 4.784 + 0.00069 \times surface\)

Comparing models

R-squared

\(R^2\) is the percentage of variability in the response variable explained by the regression model.

glance(pp_main_fit)$r.squared

[1] 0.01320884

glance(pp_int_fit)$r.squared

[1] 0.0176922

. . .

Clearly the model with interactions has a higher \(R^2\).

. . .

However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.

Adjusted R-squared

It appears that adding the interaction actually increased adjusted \(R^2\), so we should indeed use the model with the interactions.

glance(pp_main_fit)$adj.r.squared

[1] 0.01258977

glance(pp_int_fit)$adj.r.squared

[1] 0.01676753

Third order interactions

Can you? Yes
Should you? Probably not if you want to interpret these interactions in context of the data.

In pursuit of Occam’s razor

Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection follows this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.

Backward selection

For this demo, we’ll ignore interaction effects…and just model main effects to start:

pp_full <-  lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp) 

glance(pp_full)$adj.r.squared

[1] 0.02570141

\(R^2\) (full): 0.0257014

. . .

Remove `artistliving`

pp_noartist <- lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface, data=pp) 

glance(pp_noartist)$adj.r.squared

[1] 0.02579859

. . .

\(R^2\) (full): 0.0257
\(R^2\) (no artistliving): 0.0258

…Improved variance explained

. . .

Remove `Surface`

pp_noartist_nosurface <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data=pp) 

glance(pp_noartist_nosurface)$adj.r.squared

[1] 0.02231559

. . .

\(R^2\) (full): 0.0257
\(R^2\) (no artistliving): 0.0258
\(R^2\) (no artistliving or Surface): 0.0223

. . .

…no longer gaining improvement, so we stick with: log_price ~ Width_in + Height_in + Surface

Other approach:

# requires package installation: 
# install.packages("olsrr")
library(olsrr)

. . .

Step 1: Fit model (w/o tidymodels)

# fit the model (not using tidymodels)
mod <- lm(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp_Surf_lt_5000)

. . .

Step 2: Determine which variables to remove

ols_step_backward_p(mod)


                             Elimination Summary                               
------------------------------------------------------------------------------
        Variable                      Adj.                                        
Step      Removed       R-Square    R-Square     C(p)        AIC         RMSE     
------------------------------------------------------------------------------
   1    artistliving      0.0261      0.0251    3.8495    12603.7727    1.8315    
------------------------------------------------------------------------------

…specifies that artistliving should be removed

. . .

Step 2 (alternate): Compare all possible models…

ols_step_all_possible(mod) |>
  arrange(desc(adjr))

   Index N                              Predictors     R-Square Adj. R-Square
1     11 3              Width_in Height_in Surface 0.0260749939  0.0251349118
2     15 4 Width_in Height_in Surface artistliving 0.0263412027  0.0250876993
3      5 2                      Width_in Height_in 0.0256902566  0.0250634893
4     12 3         Width_in Height_in artistliving 0.0259732581  0.0250330779
5      6 2                        Width_in Surface 0.0249136264  0.0242863596
6     13 3           Width_in Surface artistliving 0.0251787948  0.0242378477
7      7 2                   Width_in artistliving 0.0212864021  0.0206568018
8      1 1                                Width_in 0.0209415833  0.0206267736
9      8 2                    Surface artistliving 0.0132088377  0.0125897717
10     2 1                                 Surface 0.0125899681  0.0122803381
11    14 3          Height_in Surface artistliving 0.0130836930  0.0121310711
12     9 2                       Height_in Surface 0.0126782901  0.0120431523
13    10 2                  Height_in artistliving 0.0062698155  0.0056305552
14     3 1                               Height_in 0.0058797727  0.0055601199
15     4 1                            artistliving 0.0005531617  0.0002397573
   Mallow's Cp
1     3.849487
2     5.000000
3     3.077206
4     4.174132
5     5.555476
6     6.709309
7    17.130153
8    16.230489
9    51.971190
10   52.001268
11   45.305459
12   44.599123
13   65.048926
14   64.293574
15   91.485605

End of Material Covered on Midterm

Recap

Can you model and interpret linear models with multiple predictors?
Can you explain the difference in a model with main effects vs. interaction effects?
Can you compare different models and determine how to proceed?
Can you carry out and explain backward selection?

Multiple Linear Regression

Q&A

Course Announcements

Suggested Reading

Agenda

Packages & Data

Two numerical predictors

Multiple predictors

Linear model with multiple predictors

Numerical and categorical predictors

Price, surface area, and living artist

Typical surface area

Narrowing the scope

Facet to get a better look

Two ways to model

Interacting explanatory variables

Two ways to model

Fit model with main effects

Solving the model

Visualizing main effects

Interpreting main effects

Main vs. interaction effects

Interaction: Surface * artistliving

Fit model with interaction effects

Linear model with interaction effects

Interpretation of interaction effects

Comparing models

R-squared

Adjusted R-squared

Third order interactions

In pursuit of Occam’s razor

Backward selection

Remove artistliving

Remove Surface

Other approach:

End of Material Covered on Midterm

Recap

Interaction: `Surface * artistliving`

Remove `artistliving`

Remove `Surface`