03-tidyr

Professor Shannon Ellis

2023-10-10

Tidy Data with tidyr

[ad] Data Science Student Society

Join DS3 at their Fall General Body Meeting to learn more about the events they’re offering this quarter, open board positions for the year, and free food! It will be happening on Wednesday (10/11) from 6-8pm, at PC Ballroom West

Q&A

Q: Is it possible to integrate to github using other systems than datahub? Datahub has already been spotty for me in this course and is notorious for slumping at critical pts in the quarter.
A: Yup. The same steps can be carried out by downloading RStudio onto your computer and connecting it with GitHub.

Q: Should we write in the console or in the rmd file first when writing code?
A: Great question! I’d suggest starting in the Rmd file and editing there. That way you don’t have to copy+paste once you get it right. It’s already there.

Q: How do you take notes for coding classes? I know there are lecture notes available, but how would you recommend taking notes for this class?
A: I would recommend opening a blank Rmd each day for class and saving it with the lecture number. I’d keep notes and things I tried in that file. But, I wouldn’t copy+paste everything, since the other lecture notes are available.

Course Announcements

Due Dates:

  • Lab 02 due Friday
  • HW01 now available; due Monday (10/16; 11:59 PM)
  • Lecture Participation survey open until Thursday

Notes:

  • Lab01 scores and feedback posted
  • Datahub: Launch RStudio (possible solution?)
  • Staff office hours updated (see Canvas or website)

Student Comment

I have been struggling to grasp the material in the course. It feels like we are diving into the content in the labs, but I don’t even feel like I truly understand what I’m doing. It often seems like I’m just copying and pasting code from the website without a clear understanding of the bigger picture. I’m particularly stuck because I feel like I don’t have a solid grasp of the fundamental concepts of coding in R; it feels so new. I understand that the pace of the course may be challenging, but I think a bit more stronger focus on the foundational aspects of coding in R would greatly benefit students like me who are struggling with the content. I’m looking forward to the course and I hope I can grasp the content as we go through the next week. I’m concerned about learning the material and also how that may affect my grade.

Let’s see how y’all feel in a week. The first week can be a lot in this course. Often, students feel a lot more comfortable come week 3.

Student Survey

  • 89% know Python; 15% know R; most (but not all!) have programmed before
  • 64% feel confident about effective data science communication
  • Reasons for taking course: learn R, add to resume, analyze data, improve data science skills

My favorite boring facts:

  • I was actually born on my birthday
  • i don’t like to eat eggs but my roommate loves them
  • I like to have a midday nap.
  • I can raise my eyebrows really well
  • I eat peanut butter straight from the jar
  • i have a jack russell terrier.. named jack (we weren’t feeling creative)

Suggested Reading

R4DS:

Tidy Data

The opinionated tidyverse is named as such b/c it assumes/necessitates your data be “tidy”.

Tidy datasets are all alike, but every messy dataset is messy in its own way. —- Hadley Wickham

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Tidy or not?

❓ Given the rules discussed, is the cat_lovers dataset tidy?

cat_lovers <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/cat-lovers.csv")
Rows: 60 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, number_of_cats, handedness

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat_lovers |> datatable()

❓ Given the rules discussed, is the bike dataset tidy?

bike <- read_csv2("https://raw.githubusercontent.com/COGS137/datasets/main/nc_bike_crash.csv", 
                  na = c("NA", "", "."))
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 5716 Columns: 54
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (44): AmbulanceR, BikeAge_Gr, Bike_Alc_D, Bike_Dir, Bike_Injur, Bike_Po...
dbl   (8): FID, OBJECTID, Bike_Age, Crash_Hour, Crash_Ty_1, Crash_Year, Drvr...
dttm  (1): Crash_Time
date  (1): Crash_Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike |> datatable()
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html