library(ggplot2)
library(plotly)
library(gganimate)
library(gapminder) # the dataset being used
19-wrap-up
Wrap-up
Q&A
Q: Are we required to try models other than linear models/random forest if we are doing cs02 as our final project?
A: No. In fact, I think you could just do the random forest model (and not discuss the linear regression models.) Using additional models would be a good extension. Of course, you’d also need to consider some outside dataset as well.
Q: How much workload are you expecting for the final project compared to CS01? Our group spent like at least 14 hours on CS01 when your expectation was to spend 4-6 hours if I remember correctly. Should I expect to spend a similar amount of time for the final project or more?
A: Historically groups have spend less time on the final relative to the case studies b/c they choose more straightforward datasets/questions.
Q: I am still curious about presentation styles, and what is the most effective manner. In my time taking courses in the COGS and DSC departments at UCSD, I have noticed the usage of emojis a lot in programming assignments (esp. Jupyter notebook) but recently got some mixed opinions of emoji use in my Jupyter notebook by someone looking over a personal project. I am curious to know what the conventions are, and to delve deeper into presenting things catered to a specific audience. Additionally, I was wondering if there are any resources on how to make a data science portfolio/website for graduate school and internship/job applications.
A: We’ll discuss a bit about the second question soon. As for the first, my response is that it depends on your audience and the setting. If it’s a very serios/stuffy conference, maybe be more formal (fewer emojis)….but in data science, typically presentations are more casual/fun (relative to other fields), so I’d say do what you’re comfortable with.
Course Announcements
- Please fill out your SET course evaluations (due Sat 12/9 at 8AM)
- Final Project due Tues 12/12 at 11:59 PM
- .Rmd (report/slides)
- Presentation (recording; submit on Canvas)
- General Communication
- group work survey (due Wednesday)
- Post-course survey “due” next Wednesday (for EC)
Final Project Details
- Data Analysis option
- if wrangling not needed…don’t make wrangling up
- want you to demonstrate your skills across the final project
- Presentation: at the level of a COGS 137 student
- pre-recorded
- the time limit matters
- probably best to reference the effective communication notes
- General communication: to a non-technical audience
- for a technical presentation, likely best to think of it as an “ad” for the package/statistical approach
Final Project
- Who has a plan for what they’re doing for the final project?
- What questions do you have about the final project?
Open Q&A
What questions do you have about data science, stats, R, jobs/internships, life, analysis, communication, my life/opinions, etc.?
Where is R used?
- R is used by data scientists
- particularly popular in certain fields: (bio)statistics, biology, economics, psychology, finance, healthcare, business analytics, government/public policy, data journalism, education, etc.
- It is less popular than Python
- Really great for: data wrangling, visualization, and modelling
Next Steps in R
- Interactive Visualization
- Package Development
- Books, Slides, and Personal websites
- Shiny Apps
. . .
Note: Any packages described today ARE allowed to be used for the final project, if you’re going the technical presentation route.
Packages
The Data: Gapminder
The gapminder visualization was made famous by Hans Rosling. The dataset used here includes life expectancy, population, and GDP across 142 countries and 5 continents from 1952-2007.
Interactive Viz
plotly
, gganimate
, and r2d3
plotly
- wrapper around
ggplot
plots:ggplotly()
- when it works, it works
- less control over specifics
<- gapminder |>
p filter(year==1977) |>
ggplot(aes(gdpPercap, lifeExp, size = pop, color=continent)) +
geom_point() +
theme_bw()
<- ggplotly(p) p
plotly
<- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent, frame = year)) +
gg geom_point() +
theme_bw()
<- ggplotly(gg) |>
gg highlight("plotly_hover")
gganimate
Extends grammar of graphics for use in animation:
transition_*()
defines how the data should be spread out and how it relates to itself across time.view_*()
defines how the positional scales should change along the animation.shadow_*()
defines how data from other points in time should be presented in the given point in time.enter_*()
/exit_*()
defines how new data should appear and how old data should disappear during the course of the animation.ease_aes()
defines how different aesthetics should be eased during transitions.
Source: https://gganimate.com/
. . .
- more control
- slower to render
- generates GIFs
. . .
For example…
<- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent, frame = year)) +
gg geom_point() +
theme_bw() +
#gganimate specific bits
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
transition_time(year) +
ease_aes('linear')
r2d3
- D3 is a javascript library for producing viz for HTML
- able to use custom D3 Visualizations within R
- create D3.js scripts
- call them from RMarkdown/Shiny/etc.
- Example here
Package Development
Why develop an R package?
- reproducibility
- include data + code
- organize a project
- tools needed:
devtools
andusethis
- Book: R Packages, by Jenny Bryan and Hadley Wickham
- Blogpost: Writing an R Package from Scratch, by Hilary Parker (Disclaimer: this is from 2014 and does not implement
usethis
) - Blogpost: Your First Package in 1 hour, by Shannon Pileggi for R-Ladies Philly in 2020
Other Package Suggestsions
- Dataviz: handful of packages that extend the functionality of
ggplot2
; Good Tables:gt
,formattable
, andreactable
- Modelling: we only really used the
tidymodels
andbroom
packages, so the other packages in tidymodels are options - There are lots of packages for doing machine learning.
caret
is a precursor to tidymodels and good for this - Webscraping:
rvest
- Sports:
nwslR
,baseballr
, NFL - A whole more curated here
Books, Slides and Personal Websites
bookdown
, xaringan
, blogdown
, and quarto
An ode to Yihui Xie
Books: bookdown
An R package by Yihui Xie to write online books, with the philosophy that it “should be technically easy to write a book, visually pleasant to view the book, fun to interact with the book, convenient to navigate through the book, straightforward for readers to contribute or leave feedback to the book author(s), and more importantly, authors should not always be distracted by typesetting details”
bookdown
: Authoring Books and Technical Documents with R Markdown, by Yihui Xiebookdown
gallery- Example: What they forgot to teach you about R, by Jenny Bryan and Jim Hester
Slides: xaringan
An RMarkdown extension (based on JS library remark.js) to generate slides from .Rmd documents.
- Book Chapter:
xaringan Presentations
- Slide Show: Meet Xaringan, by Alison Hill
Websites: blogdown
Enables personal website creation using R Markdown and Hugo (or Jekyll)
- Book:
blogdown
: Creating websites with R Markdown, by Yihui Xie, Amber Thomas, and Alison Presmanes Hill - Blogpost Up & running with blogdown in 2021, by Alison Presmanes Hill
- Some examples: Alison Hill, Yihui, Prof
Quarto
an open-source scientific and technical publishing system built on Pandoc
- outputs: HTML, PDF, MS Word, ePub, etc.
- language-agnostic
- allows for multiple programming languages in a single document
- Website: Quarto
Class notes, slides, and website for COGS 137 were all built utilizing quarto. Generating slides for technical presentation using quarto is an option for the final project. (.qmd rather than .Rmd)
Shiny Apps
Shiny
Shiny
is an R package that allows you to build interactive web apps directly from R (initially developed by Winston Chang)
- Website:
Shiny
- Examples: Freedom of the Press Index, COVID-19 Tracker, and recount
- How-To: How To Build a
Shiny
app
Quarto Dashboards
In the upcoming release of quarto, dashboards will be even simpler to generate…(currently available in pre-release)…here
Tidy Tuesday
An online community that works with a new dataset every week. You could continue your R practice. There is a Twitter hashtag to share your work: #TidyTuesday
Note: your first midterm dataset came from Tidy Tuesday.
. . .
These are also options for portfolios/personal projects…
What’s a DS portfolio?
A public showcase of your work!
. . .
Kaggle is a great place to get practice, but not necessarily for personal projects for your portfolio
. . .
…b/c literally millions of other people have already worked with the data/done the project.
. . .
You want your portfolio to 1) demonstrate your skills and 2) set you apart
DS Portfolio Examples
Your Turn: Get started on one of these…
Always wanted a personal website? Get Started with blogdown
! Have a data-centric app you want to share with the world? Shiny
it up! Have slides that need to be created for a final project? Give xaringan
a go! Have a visualization that needs animation? Make it move!
The Wrap Up
COGS 137: Where We’ve Been
- R, RMarkdown & RStudio
- Data Wrangling w/ the
tidyverse
- Dataviz w/
ggplot2
- CS01: Biomarkers of Recent THC Use (Inference)
- CS02: Predicting Air Pollution (ML)
- Next Steps in R:
Shiny
,bookdown
,blogdown
,plotly
/gganimate
COGS 137: A Semi-New Course
Lots of thanks!
- course staff! (Kunal & Shenova - feedback, grading, labs, office hours, etc.)
- all of you
- Mine Çetinkaya-Rundel, Open Case Studies Team, Posit (RStudio, quarto & tidyverse teams)
- Sean Kross & Prof Drew Walker