07-linear-models

Author

Professor Shannon Ellis

Published

October 24, 2023

Linear Models

Q&A

Q: I was just wondering if you could still provide those videos you talked about with like syntax stuff so that we can follow along. I would also appreciate if it was really through with like a walk through of how you got to each point
A: I like this suggestion, and I’ll try to make these. No promises though. Just have to find magical time in the schedule to make them. Note that lectures will be the more detailed walk through with more time on your own to figure it out (rather than detailed walk through outside of class), so you will get the information…the time/location will just differ. But, I’ll try to make some supplemental videos for those interested.

Q: Had a question about the modeling and the last bit showing how decreasing the alpha showed greater clustering at the bottom left corner. If the bottom left corner looks like 0 height and 0 width, how does that translate into the dimensions of an actual painting?
A: Good observation. The follow-up to this is…are there any paintings with zero width or zero height? And, if you dig in the data (i.e. min(pp$Height_in, na.rm=TRUE)), you’ll see that there are some very small paintings, but that none are zero.

Q: If we want to built our own model, can we plot them with ggplot2?
A: Yup! This post starts to get at that. It does so for a linear model, but the logic follows for other models.

Q: For the paintings dataset, how could we perform EDA on specific subject matter, like seeing how many portraits include Jesus as part of the subject?
A: Love this question. There is a whole field of natural language processing that would have sophisticated ways to analyze this. A simple first pass would be to, for example, filter for paintings that include “Jesus” in the subject column.

Q: I think the segmented bar plots based on proportion seem difficult to read. I’m not sure why we should be using this instead of the stacked plots?
A: Grouped bar blots are typically most quickly understood. Proportion stacked plots are then easiest to understand proportion across categories. Stacked plots of raw numbers take longer (for most) to understand and thus are often avoided, but like all viz, it depends on context and audience.

Q: In the last lecture, we talked about how segmented bar plots might not be the ideal choice for data visualization, but we still demonstrated them in today’s lecture. So, specifically in what cases should we choose segmented bar plots as a data analysis tool?
A: They can be helpful when the audience is familiar with them, but typically are most helpful when you want to display relative proportions rather than counts

Course Announcements

Due Dates:

  • Lab 04 due Friday
    • Model Interpretations
    • Text, code, & viz all matter
  • Lecture Participation survey “due” after class
  • HW02 due Monday (10/30; 11:59 PM)
  • discuss displaying image in Markdown

Agenda

  • Linear Models
    • Quantitative Predictor
    • Categorical Predictor (2 & >2 levels)
    • residuals
    • data transformations

Suggested Reading

tidymodels

Data: Paris Paintings

pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv", 
               na = c("n/a", "", "NA"))
  • Number of observations: 3393
  • Number of variables: 61

Goal: Predict height from width

\[\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}\]

tidymodels

  • NOT a core tidyverse package
  • follows the structure of a tidyverse package

# should already be installed for you on datahub
library(tidymodels)

Step 1: Specify model

linear_reg()
Linear Regression Model Specification (regression)

Computational engine: lm 

Step 2: Set model fitting engine

linear_reg() |>
  set_engine("lm") # lm: linear model
Linear Regression Model Specification (regression)

Computational engine: lm 

Step 3: Fit model & estimate parameters

… using formula syntax

linear_reg() |>
  set_engine("lm") |>
  fit(Height_in ~ Width_in, data = pp)
parsnip model object


Call:
stats::lm(formula = Height_in ~ Width_in, data = data)

Coefficients:
(Intercept)     Width_in  
     3.6214       0.7808  

A closer look at model output

parsnip model object


Call:
stats::lm(formula = Height_in ~ Width_in, data = data)

Coefficients:
(Intercept)     Width_in  
     3.6214       0.7808  

\[\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}\]

A tidy look at model output

linear_reg() |>
  set_engine("lm") |>
  fit(Height_in ~ Width_in, data = pp) |>
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    3.62    0.254        14.3 8.82e-45
2 Width_in       0.781   0.00950      82.1 0       

\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]

Slope and intercept

\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]

  • Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches.
  • Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. (Does this make sense?)

Correlation does not imply causation

Remember this when interpreting model coefficients

Source: XKCD, Cell phones

Parameter Estimation

Linear model with a single predictor

  • We’re interested in \(\beta_0\) (population parameter for the intercept) and \(\beta_1\) (population parameter for the slope) in the following model:

\[\hat{y}_{i} = \beta_0 + \beta_1~x_{i}\]

  • Tough luck, you can’t have them…
  • So we use sample statistics to estimate them:

\[\hat{y}_{i} = b_0 + b_1~x_{i}\]

Least squares regression

  • The regression line minimizes the sum of squared residuals.
  • If \(e_i = y_i - \hat{y}_i\), then, the regression line minimizes \(\sum_{i = 1}^n e_i^2\).

Visualizing residuals

Visualizing residuals (cont.)

Visualizing residuals (cont.)

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to average \(x\) and average \(y\), \((\bar{x}, \bar{y})\):

\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]

  • The slope has the same sign as the correlation coefficient: \(b_1 = r \frac{s_y}{s_x}\)
  • The sum of the residuals is zero: \(\sum_{i = 1}^n e_i = 0\)
  • The residuals and \(x\) values are uncorrelated

Models with categorical explanatory variables

Categorical predictor with 2 levels

# A tibble: 3,393 × 3
   name      Height_in landsALL
   <chr>         <dbl>    <dbl>
 1 L1764-2          37        0
 2 L1764-3          18        0
 3 L1764-4          13        1
 4 L1764-5a         14        1
 5 L1764-5b         14        1
 6 L1764-6           7        0
 7 L1764-7a          6        0
 8 L1764-7b          6        0
 9 L1764-8          15        0
10 L1764-9a          9        0
11 L1764-9b          9        0
12 L1764-10a        16        1
13 L1764-10b        16        1
14 L1764-10c        16        1
15 L1764-11         20        0
16 L1764-12a        14        1
17 L1764-12b        14        1
18 L1764-13a        15        1
19 L1764-13b        15        1
20 L1764-14         37        0
# ℹ 3,373 more rows
  • landsALL = 0: No landscape features
  • landsALL = 1: Some landscape features

Height & landscape features

m_ht_lands <- linear_reg() |>
  set_engine("lm") |>
  fit(Height_in ~ factor(landsALL), data = pp)

m_ht_lands |> tidy()
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          22.7      0.328      69.1 0       
2 factor(landsALL)1    -5.65     0.532     -10.6 7.97e-26

Height & landscape features

\[\widehat{Height_{in}} = 22.7 - 5.645~landsALL\]

  • Slope: Paintings with landscape features are expected, on average, to be 5.645 inches shorter than paintings that without landscape features
    • Compares baseline level (landsALL = 0) to the other level (landsALL = 1)
  • Intercept: Paintings that don’t have landscape features are expected, on average, to be 22.7 inches tall

Categorical predictor with >2 levels

# A tibble: 3,393 × 3
   name      Height_in school_pntg
   <chr>         <dbl> <chr>      
 1 L1764-2          37 F          
 2 L1764-3          18 I          
 3 L1764-4          13 D/FL       
 4 L1764-5a         14 F          
 5 L1764-5b         14 F          
 6 L1764-6           7 I          
 7 L1764-7a          6 F          
 8 L1764-7b          6 F          
 9 L1764-8          15 I          
10 L1764-9a          9 D/FL       
11 L1764-9b          9 D/FL       
12 L1764-10a        16 X          
13 L1764-10b        16 X          
14 L1764-10c        16 X          
15 L1764-11         20 D/FL       
16 L1764-12a        14 D/FL       
17 L1764-12b        14 D/FL       
18 L1764-13a        15 D/FL       
19 L1764-13b        15 D/FL       
20 L1764-14         37 F          
# ℹ 3,373 more rows
  • school from which painting came (details in a few slides)

Relationship between height and school

linear_reg() |>
  set_engine("lm") |>
  fit(Height_in ~ school_pntg, data = pp) |>
  tidy()
# A tibble: 7 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)        14.0       10.0     1.40  0.162  
2 school_pntgD/FL     2.33      10.0     0.232 0.816  
3 school_pntgF       10.2       10.0     1.02  0.309  
4 school_pntgG        1.65      11.9     0.139 0.889  
5 school_pntgI       10.3       10.0     1.02  0.306  
6 school_pntgS       30.4       11.4     2.68  0.00744
7 school_pntgX        2.87      10.3     0.279 0.780  

Dummy variables

# A tibble: 7 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)        14.0       10.0     1.40  0.162  
2 school_pntgD/FL     2.33      10.0     0.232 0.816  
3 school_pntgF       10.2       10.0     1.02  0.309  
4 school_pntgG        1.65      11.9     0.139 0.889  
5 school_pntgI       10.3       10.0     1.02  0.306  
6 school_pntgS       30.4       11.4     2.68  0.00744
7 school_pntgX        2.87      10.3     0.279 0.780  
  • When the categorical explanatory variable has many levels, they’re encoded to dummy variables
  • Each coefficient describes the expected difference between heights in that particular school compared to the baseline level

Categorical predictor with 3+ levels

school_pntg D_FL F G I S X
A 0 0 0 0 0 0
D/FL 1 0 0 0 0 0
F 0 1 0 0 0 0
G 0 0 1 0 0 0
I 0 0 0 1 0 0
S 0 0 0 0 1 0
X 0 0 0 0 0 1
# A tibble: 3,393 × 3
   name      Height_in school_pntg
   <chr>         <dbl> <chr>      
 1 L1764-2          37 F          
 2 L1764-3          18 I          
 3 L1764-4          13 D/FL       
 4 L1764-5a         14 F          
 5 L1764-5b         14 F          
 6 L1764-6           7 I          
 7 L1764-7a          6 F          
 8 L1764-7b          6 F          
 9 L1764-8          15 I          
10 L1764-9a          9 D/FL       
11 L1764-9b          9 D/FL       
12 L1764-10a        16 X          
13 L1764-10b        16 X          
14 L1764-10c        16 X          
15 L1764-11         20 D/FL       
16 L1764-12a        14 D/FL       
17 L1764-12b        14 D/FL       
18 L1764-13a        15 D/FL       
19 L1764-13b        15 D/FL       
20 L1764-14         37 F          
# ℹ 3,373 more rows

The linear model with multiple predictors

  • Population model:

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]

  • Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Relationship b/w height and school

# A tibble: 7 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)        14.0       10.0     1.40  0.162  
2 school_pntgD/FL     2.33      10.0     0.232 0.816  
3 school_pntgF       10.2       10.0     1.02  0.309  
4 school_pntgG        1.65      11.9     0.139 0.889  
5 school_pntgI       10.3       10.0     1.02  0.306  
6 school_pntgS       30.4       11.4     2.68  0.00744
7 school_pntgX        2.87      10.3     0.279 0.780  
  • Austrian school (A) paintings are expected, on average, to be 14 inches tall.
  • Dutch/Flemish school (D/FL) paintings are expected, on average, to be 2.33 inches taller than Austrian school paintings.
  • French school (F) paintings are expected, on average, to be 10.2 inches taller than Austrian school paintings.
  • German school (G) paintings are expected, on average, to be 1.65 inches taller than Austrian school paintings.
  • Italian school (I) paintings are expected, on average, to be 10.3 inches taller than Austrian school paintings.
  • Spanish school (S) paintings are expected, on average, to be 30.4 inches taller than Austrian school paintings.
  • Paintings whose school is unknown (X) are expected, on average, to be 2.87 inches taller than Austrian school paintings. ]

Prediction with models

Predict height from width

❓ On average, how tall are paintings that are 60 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]

3.62 + 0.78 * 60
[1] 50.42

“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”

Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)

Prediction vs. extrapolation

❓ On average, how tall are paintings that are 400 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]

Watch out for extrapolation!

“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.”1
Stephen Colbert, April 6th, 2010

Measuring model fit

Measuring the strength of the fit

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\).

  • It tells us what percent of variability in the response variable is explained by the model.

  • The remainder of the variability is explained by variables not included in the model.

  • \(R^2\) is sometimes called the coefficient of determination.

Obtaining \(R^2\) in R

  • Height vs. width
glance(m_ht_wt)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.683         0.683  8.30     6749.       0     1 -11083. 22173. 22191.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m_ht_wt)$r.squared # extract R-squared
[1] 0.6829468

Roughly 68% of the variability in heights of paintings can be explained by their widths.

  • Height vs. landscape features
glance(m_ht_lands)$r.squared
[1] 0.03456724

Exploring linearity

Data: Paris Paintings

Price vs. width

❓ Describe the relationship between price and width of paintings whose width is less than 100in.

Price vs. width

❓ Which plot shows a more linear relationship?

Price vs. width, residuals

❓ Which plot shows a residuals that are uncorrelated with predicted values from the model?

❓What’s the unit of residuals?

Transforming the data

  • We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.
  • In these situations a transformation applied to the response variable may be useful.
  • In order to decide which transformation to use, we should examine the distribution of the response variable.
  • The extremely right skewed distribution suggests that a log transformation may be useful.
    • log = natural log, \(ln\)
    • Default base of the log function in R is the natural log:
      log(x, base = exp(1))

Logged price vs. width

❓ How do we interpret the slope of this model?

Interpreting models with log transformation

m_lprice_wt <- linear_reg() |>
  set_engine("lm") |>
  fit(log(price) ~ Width_in, data = pp_wt_lt_100)

m_lprice_wt |>
  tidy() |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 3))
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    4.67 
2 Width_in       0.019

Linear model with log transformation

\[ \widehat{log(price)} = 4.67 + 0.02 Width \]

  • For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres.
  • which is not a very useful statement…

Working with logs

  • Subtraction and logs: \(log(a) − log(b) = log(a / b)\)
  • Natural logarithm: \(e^{log(x)} = x\)
  • We can use these identities to “undo” the log transformation

Interpreting models with log transformation

The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.

\[ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]

\[ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]

\[ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]

\[ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 \]

For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.

Shortcuts in R

m_lprice_wt |>
  tidy() |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 3))
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)    4.67 
2 Width_in       0.019
m_lprice_wt |>
  tidy() |>
  select(term, estimate) |>
  mutate(estimate = round(exp(estimate), 3))
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   107.  
2 Width_in        1.02

Recap: Log Transformations

  • Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.
  • The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.
  • This transformation is also useful for variance stabilization.
  • When using a log transformation on the response variable the interpretation of the slope changes: *“For each unit increase in x, y is expected on average to be higher/lower*
    by a factor of \(e^{b_1}\).”
  • Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.

Aside: when \(y = 0\)

In some cases the value of the response variable might be 0, and

log(0)
[1] -Inf

The trick is to add a very small number to the value of the response variable for these cases so that the log function can still be applied:

log(0 + 0.00001)
[1] -11.51293

Recap

  • Can I carry out linear regression using the tidymodels approach?
  • Can I interpret and explain the results from a linear model with a single predictor?
  • Do I understand the limitations of modelling data w/ linear regression?
  • Can I describe and implement the use of a dummy variable in linear regression?
  • Can I determine when logistic transformation may be appropriate? Can I interpret these results?

Footnotes

  1. Introduction to Modern Statistics. “Extrapolation is treacherous.”↩︎