19-wrap-up

Professor Shannon Ellis

12/6/23

Wrap-up

Q&A

Q: Are we required to try models other than linear models/random forest if we are doing cs02 as our final project?
A: No. In fact, I think you could just do the random forest model (and not discuss the linear regression models.) Using additional models would be a good extension. Of course, you’d also need to consider some outside dataset as well.

Q: How much workload are you expecting for the final project compared to CS01? Our group spent like at least 14 hours on CS01 when your expectation was to spend 4-6 hours if I remember correctly. Should I expect to spend a similar amount of time for the final project or more?
A: Historically groups have spend less time on the final relative to the case studies b/c they choose more straightforward datasets/questions.

Q: I am still curious about presentation styles, and what is the most effective manner. In my time taking courses in the COGS and DSC departments at UCSD, I have noticed the usage of emojis a lot in programming assignments (esp. Jupyter notebook) but recently got some mixed opinions of emoji use in my Jupyter notebook by someone looking over a personal project. I am curious to know what the conventions are, and to delve deeper into presenting things catered to a specific audience. Additionally, I was wondering if there are any resources on how to make a data science portfolio/website for graduate school and internship/job applications.
A: We’ll discuss a bit about the second question soon. As for the first, my response is that it depends on your audience and the setting. If it’s a very serios/stuffy conference, maybe be more formal (fewer emojis)….but in data science, typically presentations are more casual/fun (relative to other fields), so I’d say do what you’re comfortable with.

Course Announcements

  • Please fill out your SET course evaluations (due Sat 12/9 at 8AM)
  • Final Project due Tues 12/12 at 11:59 PM
    • .Rmd (report/slides)
    • Presentation (recording; submit on Canvas)
    • General Communication
    • group work survey (due Wednesday)
  • Post-course survey “due” next Wednesday (for EC)

Final Project Details

  • Data Analysis option
    • if wrangling not needed…don’t make wrangling up
    • want you to demonstrate your skills across the final project
  • Presentation: at the level of a COGS 137 student
    • pre-recorded
    • the time limit matters
    • probably best to reference the effective communication notes
  • General communication: to a non-technical audience
    • for a technical presentation, likely best to think of it as an “ad” for the package/statistical approach

Final Project

  • Who has a plan for what they’re doing for the final project?
  • What questions do you have about the final project?

Open Q&A

What questions do you have about data science, stats, R, jobs/internships, life, analysis, communication, my life/opinions, etc.?

Where is R used?

  • R is used by data scientists
  • particularly popular in certain fields: (bio)statistics, biology, economics, psychology, finance, healthcare, business analytics, government/public policy, data journalism, education, etc.
  • It is less popular than Python
  • Really great for: data wrangling, visualization, and modelling

Next Steps in R

  • Interactive Visualization
  • Package Development
  • Books, Slides, and Personal websites
  • Shiny Apps

Note: Any packages described today ARE allowed to be used for the final project, if you’re going the technical presentation route.

Packages

library(ggplot2)
library(plotly)
library(gganimate)
library(gapminder) # the dataset being used

The Data: Gapminder

The gapminder visualization was made famous by Hans Rosling. The dataset used here includes life expectancy, population, and GDP across 142 countries and 5 continents from 1952-2007.

Interactive Viz

plotly, gganimate, and r2d3

plotly

  • wrapper around ggplot plots: ggplotly()
  • when it works, it works
  • less control over specifics
p <- gapminder |>
  filter(year==1977) |>
  ggplot(aes(gdpPercap, lifeExp, size = pop, color=continent)) +
  geom_point() +
  theme_bw()

p <- ggplotly(p)

plotly

gg <- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent, frame = year)) +
  geom_point() +
  theme_bw()

gg <- ggplotly(gg) |> 
  highlight("plotly_hover")

gganimate

Extends grammar of graphics for use in animation:

  • transition_*() defines how the data should be spread out and how it relates to itself across time.
  • view_*() defines how the positional scales should change along the animation.
  • shadow_*() defines how data from other points in time should be presented in the given point in time.
  • enter_*()/exit_*() defines how new data should appear and how old data should disappear during the course of the animation.
  • ease_aes() defines how different aesthetics should be eased during transitions.

Source: https://gganimate.com/

  • more control
  • slower to render
  • generates GIFs

For example…

gg <- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent, frame = year)) +
  geom_point() +
  theme_bw() +
  #gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')

r2d3

  • D3 is a javascript library for producing viz for HTML
  • able to use custom D3 Visualizations within R
  • create D3.js scripts
  • call them from RMarkdown/Shiny/etc.
  • Example here

Package Development

Why develop an R package?

  • reproducibility
  • include data + code
  • organize a project
  • tools needed: devtools and usethis

Other Package Suggestsions

  • Dataviz: handful of packages that extend the functionality of ggplot2; Good Tables: gt, formattable, and reactable
  • Modelling: we only really used the tidymodels and broom packages, so the other packages in tidymodels are options
  • There are lots of packages for doing machine learning. caret is a precursor to tidymodels and good for this
  • Webscraping: rvest
  • Sports: nwslR, baseballr, NFL
  • A whole more curated here

Books, Slides and Personal Websites

bookdown, xaringan, blogdown, and quarto

An ode to Yihui Xie

Books: bookdown

An R package by Yihui Xie to write online books, with the philosophy that it “should be technically easy to write a book, visually pleasant to view the book, fun to interact with the book, convenient to navigate through the book, straightforward for readers to contribute or leave feedback to the book author(s), and more importantly, authors should not always be distracted by typesetting details”

Slides: xaringan

An RMarkdown extension (based on JS library remark.js) to generate slides from .Rmd documents.

Websites: blogdown

Enables personal website creation using R Markdown and Hugo (or Jekyll)

Quarto

an open-source scientific and technical publishing system built on Pandoc

  • outputs: HTML, PDF, MS Word, ePub, etc.
  • language-agnostic
  • allows for multiple programming languages in a single document
  • Website: Quarto

Shiny Apps

Shiny

Shiny is an R package that allows you to build interactive web apps directly from R (initially developed by Winston Chang)

Quarto Dashboards

In the upcoming release of quarto, dashboards will be even simpler to generate…(currently available in pre-release)…here

Tidy Tuesday

An online community that works with a new dataset every week. You could continue your R practice. There is a Twitter hashtag to share your work: #TidyTuesday

Note: your first midterm dataset came from Tidy Tuesday.

These are also options for portfolios/personal projects…

What’s a DS portfolio?

A public showcase of your work!

Kaggle is a great place to get practice, but not necessarily for personal projects for your portfolio

…b/c literally millions of other people have already worked with the data/done the project.

You want your portfolio to 1) demonstrate your skills and 2) set you apart

DS Portfolio Examples

Your Turn: Get started on one of these…

Always wanted a personal website? Get Started with blogdown! Have a data-centric app you want to share with the world? Shiny it up! Have slides that need to be created for a final project? Give xaringan a go! Have a visualization that needs animation? Make it move!

The Wrap Up

COGS 137: Where We’ve Been

  • R, RMarkdown & RStudio
  • Data Wrangling w/ the tidyverse
  • Dataviz w/ ggplot2
  • CS01: Biomarkers of Recent THC Use (Inference)
  • CS02: Predicting Air Pollution (ML)
  • Next Steps in R: Shiny, bookdown, blogdown, plotly/gganimate

COGS 137: A Semi-New Course

Lots of thanks!

  • course staff! (Kunal & Shenova - feedback, grading, labs, office hours, etc.)
  • all of you
  • Mine Çetinkaya-Rundel, Open Case Studies Team, Posit (RStudio, quarto & tidyverse teams)
  • Sean Kross & Prof Drew Walker

Good Luck on Finals, Get Sleep, Be Safe, Drink Water, Take Care of Yourselves, & Have a Wonderful Winter Break!