2023-09-28
Practical Data Science in R
Please take one green sticky and one pink sticky as they come around. If you’re able, try and save these. We’ll use them most classes. (But, I’ll always have extra!)
: R is a statistical programming language.
While R has most/all of the functionality of YFPL (your favorite programming language), it was designed for the specific use of analyzing data.
: Data science is the scientific process of using data to answer interesting questions and/or solve important problems.
Shannon Ellis: Associate Teaching Professor, Mom & wife, volleyball-obsessed, and baking & cooking lover
sellis@ucsd.edu
shanellis.com
MOS 0204
Tu/Th 2-3:20PM (Lab: Fri 3-3:50PM)
Instructor | Shannon Ellis | sellis@ucsd.edu | Wed 11A-12P | Virtual (see canvas) |
Th 12:50-1:50 | CSB 243 | |||
TA | Kunal Rustagi | Time TBD | Location TBD | |
IAs | Shenova Davis | Time TBD | Location TBD |
Kunal Rustagi (TA) | Shenova Davis (IA) |
---|---|
Everything you want to know about the course, and everything you will need for the course will be posted at: https://cogs137.github.io/website/
Nope! The first few weeks of the course will be all about getting comfortable using the R programming language!
After that, we’ll focus on delving into interesting statistical analyses through case studies.
Artwork by @allison_horst
Note: This course is back-loaded. But, that’s when group work happens.
Class Meetings
In-person, synchronous learning
The (Dreaded) Waitlist
Lab & Office Hours
Course Materials
Goal: every student be well-served by this course
Philosophy: The diversity of students in this class is a huge asset to our learning community; our differences provide opportunities for learning and understanding.
Plan: Present course materials that are conscious of and respectful to diversity (gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, politics, and culture)
But… if I ever fall short or if you ever have suggestions for improvement, please do share with me! There is also an anonymous Google Form if you’re more comfortable there.
Changes since last iteration (based on feedback):
A few (Piazza) guidelines:
1. No duplicates.
2. Public posts are best.
3. Posts should include your question, what you've tried so far, & resources used.
4. Helping others is encouraged.
5. No assignment code in public posts.
6. We're not robots.
Artwork by @allison_horst
Don’t cheat.
Teamwork is allowed, but you should be able to answer “Yes” to each of the following:
The Internet is a great resource. Cite your sources.
Teamwork is not allowed on your midterm. It is open-notes and open-Google/ChatGPT. You cannot discuss the questions on the exam with anyone.
For anything in this course.
Probably never first or right away.
To learn: Think first. Try first. Then use external resources.
Always read/think about/understand the output.
Your final grade will be comprised of the following:
Assignment (#) | % of grade |
---|---|
Labs (8) | 16% |
Homework (3) | 32% |
Midterm (1) | 15% |
Case Study Projects* (2) | 20% |
Final project* (1) | 17% |
* indicates group submission
Homework and case study projects: accepted up to 3 days (72 hours) after the assigned deadline for a 25% deduction
No late deadlines for labs, the exam, or the final project
Note: Prof Ellis is a reasonable person; reach out to her if you have an extenuating circumstance at any point in the quarter.
Datahub is a platform hosted by UCSD that gives students access to computational resources.
This means that while you’ll be typing on your keyboard, you’ll be using UCSD’s computers in this class.
Website: https://datahub.ucsd.edu/
Launch Environment
When working on “stuff” for this course, select the COGS 137 environment.
## Datahub Usage
Q: Do I have to use datahub?
A: Nope. You could download and install all the packages we use and complete the course locally! However, many packages have already been installed for you on datahub, so it will be a tiny bit more work up front…but you won’t be dependent on the internet/datahub!
Scriptability \(\rightarrow\) R
Literate programming (code, narrative, output in one place) \(\rightarrow\) R Markdown
Version control \(\rightarrow\) Git / GitHub
The Internet (Google/ChatGPT/etc.)
R & RStudio
Concepts introduced:
Your Turn
airquality
dataframeairquality
dataframePut a green sticky on the front of your computer when you’re done. Put a pink if you want help/have a question.
Keep the R Markdown cheat sheet and Markdown Quick Reference (Help -> Markdown Quick Reference) handy, we’ll refer to it often as the course progresses
The workspace of your R Markdown document is separate from the Console
with human readable messages
We’ll cover this time permitting, you’ll see it again in lab this week
Concepts introduced:
There is a bit more of GitHub that we’ll use in this class, but for today this is enough.
Consider ggplot2
(a package we’ll learn a lot)
Imagine: You’ve been asked to carry out a number of wrangling operations on a dataset and make a plot…
Can you answer these questions?
git
Resourcesgit
from the command line
git
(Part 1), by COGS 108 TA Ganesh (youtube, 22min tutorial)git
with GitHub Desktop, by COGS 108 TA Sidharth Suresh (youtube, 13min tutorial)(required)Student Survey - complete by Tuesday at 11:59 PM.
This is required and completion will be used for CAA/#finaid. DO complete this even if you’re on the waitlist, please.
(optional) Daily Post-Lecture Feedback