Lab05: Modelling course evaluations, Pt 2 - Answers

Author

Sean Trott

Getting Started

Load packages

library(tidyverse)
library(tidymodels)

Data

Exercise 1

First, we read in the data from the appropriate directory.

We also recreate the bty_avg column as we did in Lab 4, which is just the mean of beauty ratings from each of the relevant groups.

evals = read_csv("data/evals-mod.csv")
evals <- evals |>
  rowwise() |>
  mutate(bty_avg = mean( c( bty_f1lower, bty_f1upper,
                            bty_f2upper, bty_m1lower,
                            bty_m1upper, bty_m2upper) )) |>
  ungroup()

Part 1

Exercise 2

As in Lab 4, we fit a model m_bty predicting score from bty_avg.

The linear equation would be written as follows:

\(\hat{y} = 3.88 + 0.07*X\)

Additionally, this model has an \(R^2\) of 0.035 and an adjusted \(R^2\) of 0.0329.

m_bty <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ bty_avg, data = evals)

m_bty |>
  tidy() |>
  select(term, estimate)
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   3.88  
2 bty_avg       0.0666
m_bty |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0350        0.0329 0.535      16.7 0.0000508     1  -366.  738.  751.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Part 2

Exercise 3

\(\hat{y} = 3.75 + 0.07X_1 + 0.17X_2\)

Where \(X_1\) is bty_avg and \(X_2\) corresponds to a binary variable that takes on the value of 0 for female professors and 1 for male professors.

This model has an \(R^2\) of 0.059 and an adjusted \(R^2\) of 0.055.

m_bty_gen <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ bty_avg + gender, data = evals)

m_bty_gen |>
  tidy() |>
  select(term, estimate)
# A tibble: 3 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   3.75  
2 bty_avg       0.0742
3 gendermale    0.172 
m_bty_gen |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0591        0.0550 0.529      14.5 0.000000818     2  -360.  729.  745.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Exercise 4

The intercept means: female teachers with an average beauty score of 0 are predicted to have a score of 3.75.

The slope for bty_avg means that for every 1-unit increase in bty_avg, teachers are predicted to have an increase in their score (relative to the intercept) of approximately 0.07. This is true for both male and female teachers, but of course, the predictions for male teachers are also affected by the slope of gender, which means: holding bty_avg constant, male teachers are predicted to have a higher score (relative to the intercept) by approximately 0.172.

m_bty_gen |>
  tidy() |>
  select(term, estimate)
# A tibble: 3 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)   3.75  
2 bty_avg       0.0742
3 gendermale    0.172 

Exercise 5

This model has an \(R^2\) of 0.059, which means that the model explains about 5.9% of the varaince in score.

Exercise 6

For male professors only, the intercept term can be added to the slope coefficient for \(X_2\) (gender), giving us a simplified expression:

\(\hat{y} = 3.92 + 0.07X_1\)

Exercise 7

Holding bty_avg constant, male professors are predicted to have a score that’s about 0.172 higher than female professors. Thus, according to the model, male professors with equivalent beauty ratings have higher scores.

Exercise 8

The adjusted \(R^2\) for the larger model (m_bty_gen) is 0.055, whereas it is 0.0329 for the smaller model (m_bty). The difference between these values is 0.0221.

This can be interpreted as follows: adding gender to a model explains an additional 2.21% of variance in score beyond the variance already explained by bty_avg.

Exercise 9

The slope for bty_avg in m_bty is 0.067. The slope for bty_avg in m_bty_gen is 0.074.

The short answer is yes: adding gender has changed the slope for bty_avg. Specifically, it has increased the steepness of this slope: the predicted increase in score for each 1-unit increase in bty_avg is higher in a model that’s also equipped with gender than in a model without gender.

m_bty |>
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   3.88      0.0761     51.0  1.56e-191
2 bty_avg       0.0666    0.0163      4.09 5.08e-  5
m_bty_gen |>
  tidy()
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   3.75      0.0847     44.3  6.23e-168
2 bty_avg       0.0742    0.0163      4.56 6.48e-  6
3 gendermale    0.172     0.0502      3.43 6.52e-  4

Exercise 10

m_bty_rank <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ bty_avg + rank, data = evals)

m_bty_rank |>
  tidy() |>
  select(term, estimate)
# A tibble: 4 × 2
  term             estimate
  <chr>               <dbl>
1 (Intercept)        3.98  
2 bty_avg            0.0678
3 ranktenure track  -0.161 
4 ranktenured       -0.126 

\(\hat{y} = 3.98 + 0.068X_1 - 0.16X_2 - 0.13X_3\)

Where \(X_1\) is bty_avg, \(X_2\) = 1 for tenure track professors (and 0 for all else), and \(X_3\) = 1 for tenured professors (and 0 for all else).

The intercept can be interpreted as follows: teaching professors with a bty_avg of 0 are predicted to have a score of 3.98.

For each 1-unit increase in bty_avg, all professors should have an increase in score by 0.068. The critical difference is simply what else is added to this increase. For teaching professors, this increase is considered relative only to the intercept. However, tenure track professors are subject to a decrease of 0.16 in their predicted score, and tenured professors are also subject to a decrease of 0.13 in their predicted score.

Part 3

Exercise 11

Personally, I would expect either cls_did_eval or cls_perc_eval to be the worst predictors. All the other variables seem more likely to correlate with score, either because of potential societal biases (like age or gender), or because of how they might correlate with other factors that could make a class a better or worse experience for students, on average (e.g., cls_students).

Of course, it’s possible that the proportion of students who fill out the survey (cls_perc_eval) is correlated with score: maybe people are more motivated to fill out the evaluations for really great professors. This suggests that cls_did_eval might be the worst predictor, since it won’t accurately reflect the proportion of students who chose to fill out, and it’s confounded with cls_students.

Exercise 12

It turns out I was wrong. The worst \(R^2\) values (and adjusted \(R^2\) values) are obtained from models with cls_profs (the number of professors teaching the class) or cls_students (the number of students).

The absolute worst model is cls_profs (with an \(R^2\) of 0.0006). In comparison, the model with cls_did_eval is about 6x better (with an \(R^2\) of 0.004). Neither is very good at all, but my prediction/intuition was incorrect.

The coefficient for cls_profs:single (i.e., classes with a single professor) is negative, but the standard error is almost 2x as larger as the coefficient, suggesting a lot of uncertainty about that estimate.

## Fit with cls_students
m_bty_cls_students <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ cls_students, 
      data = evals)

m_bty_cls_students |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.000674      -0.00149 0.544     0.311   0.577     1  -374.  755.  767.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## Fit with cls_profs
m_bty_cls_profs <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ cls_profs, 
      data = evals)

m_bty_cls_profs |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.000649      -0.00152 0.544     0.299   0.585     1  -374.  755.  767.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
m_bty_cls_profs |>
  tidy()
# A tibble: 2 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       4.18      0.0311   134.      0    
2 cls_profssingle  -0.0292    0.0534    -0.547   0.585
## Fit model with the variable I predicted would be worst
m_bty_cls_did_eval <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ cls_did_eval, 
      data = evals)

m_bty_cls_did_eval |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00395       0.00179 0.543      1.83   0.177     1  -374.  753.  766.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Exercise 13

If we’re already going to include cls_students (the number of students in a class) and cls_perc_eval (the percentage who filled out the evaluations), we wouldn’t need cls_did_eval, because that is simply the product of cls_students and cls_perc_eval.

Including this parameter shouldn’t improve the fit of the model, and it’ll only negatively impact metrics like AIC or adjusted R-squared that take into account the number of parameters in the model.

Exercise 14

Here, we fit a full model with every parameter but cls_did_eval.

The adjusted \(R^2\) is 0.141.

m_bty_full <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ bty_avg + rank +
        ethnicity + gender + language + age +
        cls_perc_eval + cls_students + cls_level +
        cls_profs + cls_credits + bty_avg, 
      data = evals)

m_bty_full |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.164         0.141 0.504      7.33 2.40e-12    12  -333.  694.  752.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Exercise 15

Now, we fit a series of models, taking out one variable at a time.

For each model of size \(k\), we remove one of the original parameters and ask which parameter’s removal results in the highest adjusted \(R^2\).

In order of removal from the full model in Exercise 15:

  • cls_profs (yields adjusted R^2 of 0.143).
  • cls_level (yields adjusted R^2 of 0.1448).
  • rank (yields adjusted R^2 of 0.145).

After these three, there are no other parameters for which removing them improves model fit.

## First reduced model, no cls_profs
linear_reg() |>
  set_engine("lm") |>
  fit(score ~ 
        bty_avg + 
        rank +
        ethnicity + 
        gender + 
        language + 
        age +
        cls_perc_eval + 
        cls_students + 
        cls_level +
        # cls_profs +
        cls_credits,
      data = evals) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.1430708
## Reduced model 2, no cls_level or cls_profs
linear_reg() |>
  set_engine("lm") |>
  fit(score ~ 
        bty_avg + 
        rank +
        ethnicity + 
        gender + 
        language + 
        age +
        cls_perc_eval + 
        cls_students + 
        # cls_level +
        # cls_profs +
        cls_credits,
      data = evals) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.1447586
## Reduced model 3, no cls_level or cls_profs or rank
linear_reg() |>
  set_engine("lm") |>
  fit(score ~ 
        bty_avg + 
        # rank +
        ethnicity + 
        gender + 
        language + 
        age +
        cls_perc_eval + 
        cls_students + 
        # cls_level +
        # cls_profs +
        cls_credits,
      data = evals) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.1453683

The final model is below, with the corresponding linear equation:

\(\hat{y} = 3.39 + 0.062X_1 + 0.2X_2 + 0.18X_3 - 0.15X_4 - 0.005X_5 + 0.006X_6 + 0.0004X_7 + 0.52X_8\)

Where:

  • \(X_1\) is bty_avg
  • \(X_2\) is ethnicity (corresponding to 1 for not minority professors)
  • \(X_3\) is gender (corresponding to 1 for male professors)
  • \(X_4\) is language (corresponding to 1 for non-english professors)
  • \(X_5\) is age
  • \(X_6\) is cls_perc_eval
  • \(X_7\) is cls_students
  • \(X_8\) is cls_credits (corresponding to 1 for one credit professors)
## Final model.
m_final <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ 
        bty_avg + 
        # rank +
        ethnicity + 
        gender + 
        language + 
        age +
        cls_perc_eval + 
        cls_students + 
        # cls_level +
        # cls_profs +
        cls_credits,
      data = evals)

m_final |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.160         0.145 0.503      10.8 5.46e-14     8  -334.  688.  730.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
m_final |>
  tidy() |>
  select(term, estimate)
# A tibble: 9 × 2
  term                   estimate
  <chr>                     <dbl>
1 (Intercept)            3.39    
2 bty_avg                0.0619  
3 ethnicitynot minority  0.204   
4 gendermale             0.177   
5 languagenon-english   -0.151   
6 age                   -0.00487 
7 cls_perc_eval          0.00575 
8 cls_students           0.000407
9 cls_creditsone credit  0.523   

A shorter non-tidy approach…

library(olsrr)

Attaching package: 'olsrr'
The following object is masked from 'package:datasets':

    rivers
# fit model
mod <- lm(score ~ 
        bty_avg + 
        rank +
        ethnicity + 
        gender + 
        language + 
        age +
        cls_perc_eval + 
        cls_students + 
        cls_level +
        cls_profs +
        cls_credits,
      data = evals)

# determine which predictors to eliminate
ols_step_backward_p(mod)
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted

Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted

Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted

Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted

                           Elimination Summary                            
-------------------------------------------------------------------------
        Variable                   Adj.                                      
Step     Removed     R-Square    R-Square     C(p)       AIC        RMSE     
-------------------------------------------------------------------------
   1    cls_profs      0.1635      0.1431    9.0279    692.3066    0.5035    
   2    cls_level      0.1633      0.1448    7.1373    690.4192    0.5030    
   3    rank           0.1602      0.1454    6.8068    688.1332    0.5028    
-------------------------------------------------------------------------

Exercise 16

The variable cls_credits encodes whether a class is worth multiple credits or simply one credit. We see there is a positive coefficient, 0.523, corresponding to the predicted difference for one credit classes relative to multi credit. That is, all else held equal, professors of one credit classes are predicted to have a score 0.523 higher.

The variable cls_perc_eval encodes what proportion of students filled out the evaluation. The coefficient is 0.00575, meaning that for every percent increase in how many students filled out the evaluatino, the professor is predicted to have 0.00575 increase in their score.

Exercise 17

A high score would be associated with a professor that is:

  • male
  • not a minority
  • perceived as more attractive
  • younger
  • received education at English-speaking school

Additionally, scores should be higher for classes for which:

  • a higher proportion of students filled out the evaluation
  • there are more students in the class
  • is one credit (as opposed to multi credit)

Exercise 18

The original sample of evaluations was taken from classes/professors at University of Texas, Austin.

As always, there are constraints on generalizability. It is possible that the effects observed here are specific to either (or both) the professors or students who happen to attend UT Austin. For example, maybe at other universities, teacher evaluations are less impacted by perceived attractiveness of the professor—or maybe they are more impacted. In both cases, this parameter estimate might be a poor estimate (either over or underestimating the impact of perceived attractiveness on evaluations).

On the other hand, I don’t necessarily have a strong reason a priori to think that UT Austin students or professors are somehow outliers in the USA higher-education system. I would feel more comfortable generalizing if I had similar results from another US university, but as it stands, I would tentatively conclude that there’s evidence for the effects we observed, and suggest that the effect should be replicated in other universities.