14-tidymodels

Professor Shannon Ellis

2023-11-21

tidymodels

Q&A

Q: I had a question about the presentations for the final projects; since it is due during finals week, is it a live presentation in class or do we submit a video? If it is a live presentation, do we present during our designated final day/time on webreg?
A: Video submission!

Q: I also wanted to mention that the mid/pre-course extra credit surveys doesn’t reflect a change in grade on canvas. (For ex. if i put a 0 or 100 for E.C my grade stays the same).
A: Correct - I add these in at the end. Canvas can do many things, but it doesn’t handle EC well (from what I can tell).

Q: I’m overwhelmed/confused by “the code :’) it’s quite a bit to take in”
A: Yes! It’s a lot! This is why we have group mates on the case study. I encourage everyone to sit with the code after class and then work through it together as you complete the case study!

Q: For oral fluid you mentioned looking more into why there’s that big dip in specificity and that we should look more into that on Friday with eda but would that be slightly guided because I have no idea where to start with that.
A: I would make some plots that specifically look at the data/numbers there to figure out what could be leading to that drop at that particular time window.

Q: Why are specificity graphs so high?  A: Good question - this is generally b/c people who didn’t smoke have values very close to zero across compounds…so they will rarely be above the cutoff, making this very effective at identifying individuals who did not smoke

Q: What is the dplyr::select notation, like is it a way to use select from dplyr without librarying first?
A: Yes!

Q: Also separate topic, but do we have information on impairment so we can account for that with recent use?  A: Great question - impairment is very hard to define here. We (the researchers) have data on self-reported high and what the police officers determined, but y’all don’t have that data. So, we’re using knowledge from other studies (see 11-cs01-data notes) to understand what we know on impoairment but only focusing on detecting recent use here.

Q: I am unable to locate where to sign up for groups for the final project
A: This form was just released (sorry for delay). link to survey

Q: I think I need more time to digest how the code works together to produce the visuals that we saw.
A: I agree. I think I could balance and give more time in class…but I will say this is an exercise I want groups to work through together!

Course Announcements

Due Dates:

  • No class This Th; No Lab this Fri (Happy Thanksgiving!)
  • CS01 due Monday 11/27
    • group work survey due Tues 11/28

Notes:

  • Be sure you watch the video from last Thursday on Canvas
  • Any questions about CS01?

Agenda

  • machine learning intro
  • (re)introduce tidymodels
  • worked example: ML in tidymodels

Suggested Resources

  • The package itself has some worked examples: https://www.tidymodels.org/start/models/
  • There’s a whole book (written by the developer of tidymodels) that covers the tidymodels package: https://www.tmwr.org/

tidymodels: philosophy

“Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret. The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.” - Max Kuhn (tidymodels developer)

Benefits:

  1. Standardized workflow/format/notation across different types of machine learning algorithms
  2. Can easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy

tidymodels: ecosystem

The main packages (and their roles):

Machine Learning: intro

In intro stats, you should have learned the central dogma of statistics: we sample from a population

The data from the sample are used to make an inference about the population:

For prediction, we have a similar sampling problem:

But now we are trying to build a rule that can be used to predict a single observation’s value of some characteristic using characteristics of the other observations.

ML: the goal

The goal is to:

build a machine learning algorithm

that uses features as input

and predicts an outcome variable

in the situation where we do not know the outcome variable.

Classic ML

Typically, you use data where you have both the input and output data to train a machine learning algorithm.

What you need:

  1. A data set to train from.
  2. An algorithm or set of algorithms you can use to try values of \(f\).
  3. A distance metric \(d\) for measuring how close \(Y\) is to \(\hat{Y}\).
  4. A definition of what a “good” distance is.

tidymodels for ML

How these packages fit together for carrying out machine learning:

tidymodels: steps

Recap

  • Can you describe the basics of machine learning?
  • Can you describe the goals of and general steps in tidymodels?