00-welcome

Professor Shannon Ellis

9/28/23

Welcome to COGS 137!

Practical Data Science in R

Please take one green sticky and one pink sticky as they come around. If you’re able, try and save these. We’ll use them most classes. (But, I’ll always have extra!)

Agenda

  1. Describe what this class is
  2. Describe how the class will run
  3. Go over the tooling for this course: R, RStudio, GitHub

What is R?

: R is a statistical programming language.

While R has most/all of the functionality of YFPL (your favorite programming language), it was designed for the specific use of analyzing data.

What is data science?

: Data science is the scientific process of using data to answer interesting questions and/or solve important problems.

Practical Data Science in R

  • Program at the introductory level in the R statistical programming language
  • Employ the tidyverse suite of packages to interact with, wrangle, visualize, and model data
  • Explain & apply statistical concepts (estimation, linear regression, logistic regression, etc.) for data analysis
  • Communicate data science projects through effective visualization, oral presentation, and written reports

Who am I?

Shannon Ellis: Associate Teaching Professor, Mom & wife, volleyball-obsessed, and baking & cooking lover

  sellis@ucsd.edu
  shanellis.com
 MOS 0204
  Tu/Th 2-3:20PM (Lab: Fri 3-3:50PM)

Who all is involved?

Instructor Shannon Ellis sellis@ucsd.edu Wed 11A-12P Virtual (see canvas)
Th 12:50-1:50 CSB 243
TA Kunal Rustagi Time TBD Location TBD
IAs Shenova Davis Time TBD Location TBD

Course Staff

Kunal Rustagi (TA) Shenova Davis (IA)

What is this course?

Everything you want to know about the course, and everything you will need for the course will be posted at: https://cogs137.github.io/website/

  • Is this an intro CS course? No.
  • Will we be doing computing? Yes.
  • What computing language will we learn? R.
  • Is this an intro stats course? No.
  • Will we be doing stats? Yes.
  • Are there any prerequisites? Yes, an intro statistics course!

So…I don’t have to know how to program already?

Nope! The first few weeks of the course will be all about getting comfortable using the R programming language!


After that, we’ll focus on delving into interesting statistical analyses through case studies.

Course Structure and Policies

The General Plan

  • Weeks 1-4: Learn to program in the tidyverse in R
  • Weeks 5-10: Communication, Data Analysis, Statistics, & Case Studies (two Case Studies)

Note: This course is back-loaded. But, that’s when group work happens.

The Nitty Gritty

Class Meetings

  • Interactive
  • Lectures & lots of learn-by-doing
  • Bring your laptop to class every day

In-person, synchronous learning

  1. I will be teaching (so long as I’m healthy and have child care) in person.
  2. Lectures and lab will be podcast.
  3. Attendance will be incentivized using a daily participation survey.
  4. If you’re not feeling well, please stay home. I will do the same.
  5. Exam will be take-home.

The (Dreaded) Waitlist

  • Course enrollment is supposed to be 50 for this course
  • There are 72 people currently enrolled
  • I don’t control the waitlist (cogsadvising@ucsd.edu does)
  • I’d anticipate our staff adding 3-5 people from the waitlist (but cannot guarantee this)

Lab & Office Hours

  • Office hours begin week 1
    • Prof: Tu: 3:30-4:30 (drop-in); W 11-12 (10 min slots; appt.)
  • Lab begins week 1 (next Friday)
    • it’s not in a computer lab, so you’ll need to bring your own
    • details about labs covered on Tues and in lab
    • typically labs will be released Monday and due Friday
  • I will hang out after class today for questions/concerns from students

Course Materials

  • Textbooks are free and available online
  • Course platforms:
    • Website : schedule, policies, due dates, etc.
    • GitHub : retrieving assignments, labs, exams, etc.
    • datahub : completing assignments, labs, exams etc.
    • Canvas : grades, course-specific links
    • Piazza : Q&A

Diversity & Inclusion:

Goal: every student be well-served by this course

Philosophy: The diversity of students in this class is a huge asset to our learning community; our differences provide opportunities for learning and understanding.

Plan: Present course materials that are conscious of and respectful to diversity (gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, politics, and culture)

But… if I ever fall short or if you ever have suggestions for improvement, please do share with me! There is also an anonymous Google Form if you’re more comfortable there.

A new-ish course!

  • Offered twice previously
  • If something doesn’t make sense, tell me!
  • If you’ve got feedback/suggestions, I’m all ears!

Changes since last iteration (based on feedback):

  • spread out second half
  • likely changing the heaviness of a case study
  • add in communication to public portion
  • one fewer HW assignments

How to get help

  • Lab
  • Office Hours
  • Piazza

A few (Piazza) guidelines:

1. No duplicates.
2. Public posts are best.
3. Posts should include your question, what you've tried so far, & resources used.
4. Helping others is encouraged.
5. No assignment code in public posts.
6. We're not robots.

The R Community

R Rollercoaster

Artwork by @allison_horst

Academic integrity

Don’t cheat.

Teamwork is allowed, but you should be able to answer “Yes” to each of the following:

  • Can I explain each piece of code and each analysis carried out in what I’m submitting?
  • Could I reproduce this code/analysis on my own?

The Internet is a great resource. Cite your sources.

Teamwork is not allowed on your midterm. It is open-notes and open-Google/ChatGPT. You cannot discuss the questions on the exam with anyone.

When To (Can I) Use ChatGPT/LLMs?

For anything in this course.

How To Use ChatGPT/LLMs

Probably never first or right away.

To learn: Think first. Try first. Then use external resources.

Always read/think about/understand the output.

ChatGPT: What to Avoid

  • Over-reliance (thwarts learning)
  • Having to look everything up (wastes time)
  • Leaving tasks to the last minute (can lead to bad decisions/academic integrity issues)
  • Taking the output without thinking (thwarts learning; limits critical thinking practice)
  • Using it right away for brainstorming ideas (limits ideas generated)

Course components:

  • Labs (8): Individual submission; graded on effort
  • Homework (3): Individual submission; graded on correctness
  • Exam (1): Individual completion & submission, take-home midterm
  • Case Studies (2): Team submission, technical analysis report
  • Final Project (1) : Team submission, due Tues of finals week

Grading

Your final grade will be comprised of the following:

Assignment (#) % of grade
Labs (8) 16%
Homework (3) 32%
Midterm (1) 15%
Case Study Projects* (2) 20%
Final project* (1) 17%

* indicates group submission

Late/missed work policy

  • Homework and case study projects: accepted up to 3 days (72 hours) after the assigned deadline for a 25% deduction

  • No late deadlines for labs, the exam, or the final project

Note: Prof Ellis is a reasonable person; reach out to her if you have an extenuating circumstance at any point in the quarter.

Tooling

Datahub

Datahub is a platform hosted by UCSD that gives students access to computational resources.

This means that while you’ll be typing on your keyboard, you’ll be using UCSD’s computers in this class.

Website: https://datahub.ucsd.edu/

Launch Environment

When working on “stuff” for this course, select the COGS 137 environment.

datahub ## Datahub Usage

Q: Do I have to use datahub?

A: Nope. You could download and install all the packages we use and complete the course locally! However, many packages have already been installed for you on datahub, so it will be a tiny bit more work up front…but you won’t be dependent on the internet/datahub!

Toolkit

  • Scriptability \(\rightarrow\) R

  • Literate programming (code, narrative, output in one place) \(\rightarrow\) R Markdown

  • Version control \(\rightarrow\) Git / GitHub

  • The Internet (Google/ChatGPT/etc.)

R and RStudio

R & RStudio

  • R is a statistical programming language
  • RStudio is a convenient interface for R (an integreated development environment, IDE)
[DEMO]

Concepts introduced:

  • Console
  • Using R as a calculator
  • Environment
  • Loading and viewing a data frame
  • Accessing a variable in a data frame
  • R functions

Your Turn

  1. Login to datahub
  2. Carry out a mathematical operation in the console
  3. View the airquality dataframe
  4. Access a column from the airquality dataframe
  5. Calculate the median for one of the numeric columns

Put a green sticky on the front of your computer when you’re done. Put a pink if you want help/have a question.

  • Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data 1
  • As of Sept 2023, there are ~19,941 R packages available on CRAN (the Comprehensive R Archive Network)2
  • We’re going to work with a small (but important) subset of these!

What is the Tidyverse?

tidyverse.org
  • The tidyverse is an opinionated collection of R packages designed for data science.
  • All packages share an underlying philosophy and a common syntax.

RStudio Projects1

  • Built-in functionality to keep all files for a single project organized

R Markdown

  • Fully reproducible reports – each time you knit, the document is executed from top to bottom
  • Simple markdown syntax for text
  • Code goes in chunks, defined by three backticks, narrative goes outside of chunks

R Markdown tips

  • Keep the R Markdown cheat sheet and Markdown Quick Reference (Help -> Markdown Quick Reference) handy, we’ll refer to it often as the course progresses

  • The workspace of your R Markdown document is separate from the Console



[DEMO]

How will we use R Markdown?

  • Every lab / midterm / project / homework / notes / etc. is an R Markdown document
  • You’ll always have a template R Markdown document to start with
  • The amount of scaffolding in the template will decrease over the quarter

Collaboration: Git & GitHub

  • The statistical programming language we’ll use is R
  • The software we use to interface with R is RStudio
  • But how do I get you the course materials that you can build on for your assignments?
    • I’m not going to email you documents, that would be a mess!

Version control

  • We introduced GitHub as a platform for collaboration
  • But it’s much more than that…
  • It’s actually designed for version control

Versioning

Lego versions

Versioning

with human readable messages

Lego versions with commit messages

Why do we need version control?

PhD Comics

Git and GitHub tips

  • Git is a version control system – like “Track Changes” feature Google Docs…but optimized for code. GitHub is the home for your Git-based projects on the internet – like Drive with additional features for code.
  • There are millions of git commands – ok, that’s an exaggeration, but there are a lot of them – and very few people know them all. 99% of the time you will use git to add, commit, push, and pull.
  • We will be doing Git things and interfacing with GitHub through RStudio, but if you google for help you might come across methods for doing these things in the command line – skip that and move on to the next resource unless you feel comfortable trying it out.

Let’s take a tour – Git / GitHub

We’ll cover this time permitting, you’ll see it again in lab this week

Concepts introduced:

  • Connect an R project to Github repository
  • Working with a local and remote repository
  • Committing, Pushing and Pulling

There is a bit more of GitHub that we’ll use in this class, but for today this is enough.

Getting Help

  • Trying things out
  • Undersetanding Documentation
  • Using ChatGPT/LLMs

Documentation

Consider ggplot2 (a package we’ll learn a lot)

ChatGPT: What it could look like

Imagine: You’ve been asked to carry out a number of wrangling operations on a dataset and make a plot…

[DEMO]

Additional help

  • classmates
  • course staff (OH, Piazza, class, lab)

Recap

Can you answer these questions?

  • What is R vs RStudio?
  • What are RStudio Projects?
  • What is version control, and why do we care?
  • What is git vs GitHub (and do I need to care)?

Additional git Resources

Version Control (git and GitHub):

Slides to PDF

  1. Toggle into Print View using the Esc key (or using the Navigation Menu)
  2. Open the in-browser print dialog (CTRL/CMD+P).
  3. Change the Destination setting to Save as PDF.
  4. Change the Layout to Landscape.
  5. Change the Margins to None.
  6. Enable the Background graphics option.
  7. Click Save 🎉

Students

Who’s in this class?

roster <- read_sheet('10kG09t5Uvjy2zLt4sToHvveBRnqYXTwfaAhPpgEFr3s')

ggplot(roster, aes(x = College)) +
  geom_bar() +
  labs(title = "COGS 137") +
  theme_bw(base_size = 14) + 
  theme(plot.title.position = "plot")

Who’s in this class?

roster |>
  mutate(major = substr(Major, 1, 2)) |>
  ggplot(aes(fct_infreq(major))) + 
  geom_bar() +
  labs(title = "COGS 137",
       x = "Major") +
  theme_bw(base_size = 12) + 
  theme(plot.title.position = "plot")

Who’s in this class?

roster |>
  ggplot(aes(fct_relevel(Level, "SO", "JR", "SR"))) +
  geom_bar() +
  labs(title = "COGS 137",
       x = "Level") +
  theme_bw(base_size = 14) + 
  theme(plot.title.position = "plot")
Warning: 1 unknown level in `f`: SO
1 unknown level in `f`: SO

I’d like to know more!

(required)Student Survey - complete by Tuesday at 11:59 PM.

This is required and completion will be used for CAA/#finaid. DO complete this even if you’re on the waitlist, please.

(optional) Daily Post-Lecture Feedback

  • opportunity to reflect on learning
  • opportunity to ask questions (I will read and answer these.)
  • opportunity for extra credit on final project