02-dplyr

Author

Professor Shannon Ellis

Published

October 5, 2023

Data Manipulation with dplyr

Q&A

Q: How many people are in a group for case studies and final project?
A: 3-4

Q: How to turn in assignments
A: We’ll discuss this today!

Q: Why don’t we have uniform keyboard-shortcut (like run code, new cell) for both R and Python and other coding environment?
A: Lack of communication? Preferences of developers? I think we’ll get there…

Q: Wasn’t clear about the ‘single quotes’ vs “double quotes” thing
A: When creating a string, or any time you need to use quotes in R, single and double quotes are interchangeable. R doesn’t care which you use. However, your code will be stylistically better if you consistently use one.

Q: What are useful libraries that we can use to analyze data?
A: We’ll be discussing lots, but the tidyverse packages (the first of which we’ll discuss is dplyr) is a great place to start. There are also different packages for basically every statistical analysis out there

Q: Is there any way to prevent coercion? / I was wondering if you can types cast a variable when concatenation
A: Yup. You can explicitly state as._____() when creating a variable (i.e. as.character()) and when reading in data you can specify. You’ll find that R does a pretty good job at guessing, but we can always fix to what we want after the fact.

Q: What is the difference between mylist[1] and mylist[[1]]? It looked like class(mylist[1]) returned list and class(mylist[[1]]) returned the class of the element.
A: Double brackets returns the element directly. Single bracket (for lists) always returns a list.

Q: I’m curious about how to handle dataframes in R
A: Excellent - we’ll start this discussion today and continue throughout the quarter!

Course Announcements

Due Dates:

  • Lab 01 due tomorrow (Friday; 11:59 PM)
  • Student survey open until next Thursday
  • HW01 and Lab02 will both be released Monday
  • Lecture Participation survey “due” after class

Suggested Reading

R4DS:

Agenda

  • dplyr
    • philosophy
    • pipes
    • common operations

Philosophy

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges

Pipes

The pipe in baseR

  • |> should be read as “and then”
  • for example “Wake up |> brush teeth” would be read as “wake up and then brush teeth”

Where does the name come from?

The pipe operator was first implemented in the package magrittr.

You will see this frequently in code online. It’s equivalent to |>.

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.

. . .

  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))

. . .

  • Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") |>
  start_car() |>
  drive(to = "campus") |>
  park()

Data

To get started with lecture code: library(tidyverse)

NC DOT Fatal Crashes in North Carolina

From OpenDurham’s Data Portal

bike <- read_csv2("https://raw.githubusercontent.com/COGS137/datasets/main/nc_bike_crash.csv", 
                  na = c("NA", "", "."))

Variables

View the names of variables via

names(bike)
 [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
 [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
[11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
[16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
[21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
[26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
[31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
[36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
[41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
[46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
[51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"  

Viewing your data

  • In the Environment, click on the name of the data frame to view it in the data viewer (or use the View function)

  • Use the glimpse function to take a peek

glimpse(bike)
Rows: 5,716
Columns: 54
$ FID        <dbl> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82, 84, 85,…
$ OBJECTID   <dbl> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83, 85, 86,…
$ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", …
$ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40-49", "1…
$ Bike_Age   <dbl> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20, 14, 15,…
$ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ Bike_Dir   <chr> "Not Applicable", "With Traffic", "With Traffic", NA, "Faci…
$ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury", "B: E…
$ Bike_Pos   <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "Travel L…
$ Bike_Race  <chr> "Black", "Black", "Black", "White", "Black", "White", "Blac…
$ Bike_Sex   <chr> "Female", "Male", "Male", "Male", "Male", "Male", "Female",…
$ City       <chr> "Durham", "Greenville", "Farmville", "Charlotte", "Charlott…
$ County     <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenburg", "Du…
$ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ CrashDay   <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA, NA, NA,…
$ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 2007-01-15…
$ Crash_Grp  <chr> "Bicyclist Failed to Yield - Midblock", "Crossing Paths - O…
$ Crash_Hour <dbl> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, 17, 19, …
$ Crash_Loc  <chr> "Non-Intersection", "Intersection-Related", "Intersection",…
$ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01-02-01",…
$ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-01 16:44…
$ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Crossing Path…
$ Crash_Ty_1 <dbl> 353311, 211180, 111144, 119139, 112114, 311231, 119144, 132…
$ Crash_Year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007,…
$ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No Injury",…
$ Developmen <chr> "Residential", "Commercial", "Residential", "Residential", …
$ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "40-49", N…
$ Drvr_Age   <dbl> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50, 66, 30,…
$ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Missing", "…
$ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "16-20 mp…
$ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O: No Inju…
$ Drvr_Race  <chr> "Black", "Black", "White", "White", "/Missing", "White", "B…
$ Drvr_Sex   <chr> "Male", "Male", "Female", "Female", NA, "Female", "Male", N…
$ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport Utility"…
$ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ Hit_Run    <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No…
$ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "Dark - R…
$ Locality   <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Developed)", "…
$ Num_Lanes  <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lanes", "4 l…
$ Num_Units  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight - Level",…
$ Rd_Class   <chr> "Local Street", "Local Street", "Local Street", "NC Route",…
$ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dr…
$ Rd_Config  <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotected Medi…
$ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "None", "No…
$ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Four-Way In…
$ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt", "Smoo…
$ Region     <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Piedmont", "…
$ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban", "Urba…
$ Speed_Limi <chr> "20 - 25  MPH", "40 - 45  MPH", "30 - 35  MPH", "40 - 45  M…
$ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop Sign", "S…
$ Weather    <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clear", "Cle…
$ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ Location   <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35.595676, …

dplyr

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

Single data frame functions / verbs:

  • filter: pick rows matching criteria
  • slice: pick rows using index(es)
  • select: pick columns by name
  • pull: grab a column as a vector
  • rename: rename specific columns
  • arrange: reorder rows
  • mutate: add new variables
  • transmute: create new data frame with variables
  • distinct: filter for unique rows
  • sample_n / sample_frac: randomly sample rows
  • summarize: reduce variables to values
  • … (many more)

dplyr rules for functions

  1. First argument is always a data frame

  2. Subsequent arguments say what to do with that data frame

  3. Always return a data frame

  4. Do not modify in place

  5. Performance via lazy evaluation

Filter rows with filter

  • Select a subset of rows in a data frame.
  • Easily filter for many conditions at once.

filter

for crashes in Durham County

bike |>
  filter(County == "Durham")
# A tibble: 253 × 54
     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir  Bike_Injur
   <dbl>    <dbl> <chr>      <chr>         <dbl> <chr>      <chr>     <chr>     
 1    18       19 No         <NA>              6 No         Not Appl… C: Possib…
 2    53       54 Yes        50-59            52 No         With Tra… A: Disabl…
 3    56       57 Yes        16-19            18 No         <NA>      C: Possib…
 4   209      210 No         16-19            16 No         Facing T… C: Possib…
 5   228      229 Yes        40-49            40 No         With Tra… B: Eviden…
 6   620      621 Yes        50-59            55 No         With Tra… B: Eviden…
 7   667      668 Yes        60-69            61 No         Not Appl… B: Eviden…
 8   458      459 Yes        60-69            62 No         With Tra… B: Eviden…
 9   576      577 No         40-49            49 No         With Tra… C: Possib…
10   618      619 No         20-24            23 No         With Tra… C: Possib…
# ℹ 243 more rows
# ℹ 46 more variables: Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>,
#   City <chr>, County <chr>, CrashAlcoh <chr>, CrashDay <chr>,
#   Crash_Date <date>, Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>,
#   Crash_Mont <chr>, Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>,
#   Crash_Year <dbl>, Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>,
#   Drvr_Age <dbl>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>, …

filter

for crashes in Durham County where biker was < 10 yrs old

bike |>
  filter(County == "Durham", Bike_Age < 10)
# A tibble: 20 × 54
     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir  Bike_Injur
   <dbl>    <dbl> <chr>      <chr>         <dbl> <chr>      <chr>     <chr>     
 1    18       19 No         <NA>              6 No         Not Appl… C: Possib…
 2    47       48 No         10-Jun            9 No         Not Appl… O: No Inj…
 3   124      125 Yes        10-Jun            8 No         With Tra… C: Possib…
 4   531      532 Yes        10-Jun            7 No         With Tra… C: Possib…
 5   704      705 Yes        10-Jun            9 No         Not Appl… C: Possib…
 6    42       43 No         10-Jun            8 No         With Tra… O: No Inj…
 7   392      393 Yes        0-5               2 No         Not Appl… B: Eviden…
 8   941      942 No         10-Jun            9 No         With Tra… C: Possib…
 9   436      437 Yes        10-Jun            6 No         Not Appl… O: No Inj…
10   160      161 Yes        10-Jun            7 No         With Tra… C: Possib…
11   273      274 Yes        10-Jun            7 No         Facing T… C: Possib…
12    78       79 Yes        10-Jun            7 No         With Tra… C: Possib…
13   422      423 No         10-Jun            9 No         Not Appl… O: No Inj…
14   570      571 No         <NA>              0 Missing    Not Appl… Injury    
15   683      684 Yes        10-Jun            8 No         Not Appl… C: Possib…
16    62       63 Yes        10-Jun            7 No         With Tra… C: Possib…
17   248      249 No         0-5               4 No         Not Appl… O: No Inj…
18   306      307 Yes        10-Jun            8 No         With Tra… C: Possib…
19   231      232 Yes        10-Jun            8 No         With Tra… C: Possib…
20   361      362 Yes        10-Jun            9 No         With Tra… B: Eviden…
# ℹ 46 more variables: Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>,
#   City <chr>, County <chr>, CrashAlcoh <chr>, CrashDay <chr>,
#   Crash_Date <date>, Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>,
#   Crash_Mont <chr>, Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>,
#   Crash_Year <dbl>, Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>,
#   Drvr_Age <dbl>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>,
#   Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ExcsSpdInd <chr>, …

Aside: real data is messy!

  What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?

bike |>
  group_by(BikeAge_Gr) |>
  summarize(crash_count = n())
# A tibble: 13 × 2
   BikeAge_Gr crash_count
   <chr>            <int>
 1 0-5                 60
 2 10-Jun             421
 3 15-Nov             747
 4 16-19              605
 5 20-24              680
 6 25-29              430
 7 30-39              658
 8 40-49              920
 9 50-59              739
10 60-69              274
11 70                  12
12 70+                 58
13 <NA>               112

Careful data scientists clean up their data first!

  • We’re going to need to do some text parsing to clean up these data
    • 10-Jun should be 6-10
    • 15-Nov should be 11-15

Correct and overwrite mutate

  • Remember we want to do the following in the BikeAge_Gr variable
    • 10-Jun should be 6-10
    • 15-Nov should be 11-15
bike <- bike |>
  mutate(
    BikeAge_Gr = case_when(
      BikeAge_Gr == "10-Jun" ~ "6-10",
      BikeAge_Gr == "15-Nov" ~ "11-15",
      TRUE                   ~ BikeAge_Gr     # everything else
    )
  )
  • Note that we’re overwriting existing data and columns, so be careful!
    • But remember, it’s easy to revert if you make a mistake since we didn’t touch the raw data, we can always reload it and start over

Check before you move on

Always check your changes and confirm code did what you wanted it to do

bike |>
  group_by(BikeAge_Gr) |>
  summarize(count = n())
# A tibble: 13 × 2
   BikeAge_Gr count
   <chr>      <int>
 1 0-5           60
 2 11-15        747
 3 16-19        605
 4 20-24        680
 5 25-29        430
 6 30-39        658
 7 40-49        920
 8 50-59        739
 9 6-10         421
10 60-69        274
11 70            12
12 70+           58
13 <NA>         112

mutate to add new variables

  How is the new alcohol variable determined?

bike |>
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))

“Save” when you mutate

Most often when you define a new variable with mutate you’ll also want to save the resulting data frame, often by writing over the original data frame.

bike <- bike |>
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))

transmute to create a new dataset

You’ll use this much less often than mutate but when you need it, you need it.

bike |> 
  transmute(ID = paste(FID, OBJECTID, sep = "-"))
# A tibble: 5,716 × 1
   ID   
   <chr>
 1 18-19
 2 29-30
 3 33-34
 4 35-36
 5 49-50
 6 53-54
 7 56-57
 8 60-61
 9 63-64
10 66-67
# ℹ 5,706 more rows

mutate vs. transmute

  • mutate adds new and keeps original
  • transmute adds new; drops existing

Your Turn

How many accidents in our dataset required an ambulance ride (AmbulanceR) and had the Crash_Type “Bicyclist Lost Control - Mechanical Problems”?

Put a green sticky on the front of your computer when you’re done. Put a pink if you want help/have a question.

slice for certain row numbers

First five

bike |>
  slice(1:5)
# A tibble: 5 × 54
    FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir   Bike_Injur
  <dbl>    <dbl> <chr>      <chr>         <dbl> <chr>      <chr>      <chr>     
1    18       19 No         <NA>              6 No         Not Appli… C: Possib…
2    29       30 Yes        50-59            51 No         With Traf… C: Possib…
3    33       34 No         <NA>             10 No         With Traf… Injury    
4    35       36 Yes        16-19            17 No         <NA>       B: Eviden…
5    49       50 No         <NA>              6 No         Facing Tr… O: No Inj…
# ℹ 46 more variables: Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>,
#   City <chr>, County <chr>, CrashAlcoh <chr>, CrashDay <chr>,
#   Crash_Date <date>, Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>,
#   Crash_Mont <chr>, Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>,
#   Crash_Year <dbl>, Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>,
#   Drvr_Age <dbl>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>,
#   Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ExcsSpdInd <chr>, …

slice for certain row numbers

Last five

last_row <- nrow(bike)
bike |>
  slice((last_row - 4):last_row)
# A tibble: 5 × 54
    FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir   Bike_Injur
  <dbl>    <dbl> <chr>      <chr>         <dbl> <chr>      <chr>      <chr>     
1   460      461 Yes        6-10              7 No         Not Appli… C: Possib…
2   474      475 Yes        50-59            50 No         With Traf… B: Eviden…
3   479      480 Yes        16-19            16 No         Not Appli… C: Possib…
4   487      488 No         40-49            47 Yes        With Traf… C: Possib…
5   488      489 Yes        30-39            35 No         Facing Tr… C: Possib…
# ℹ 46 more variables: Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>,
#   City <chr>, County <chr>, CrashAlcoh <chr>, CrashDay <chr>,
#   Crash_Date <date>, Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>,
#   Crash_Mont <chr>, Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>,
#   Crash_Year <dbl>, Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>,
#   Drvr_Age <dbl>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>,
#   Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ExcsSpdInd <chr>, …

select to keep only the variables you mention

bike |>
  select(Crash_Loc, Hit_Run) |>
  table()
                      Hit_Run
Crash_Loc                No  Yes
  Intersection         2223  275
  Intersection-Related  252   42
  Location                3    7
  Non-Intersection     2213  462
  Non-Roadway           205   30

or select to exclude variables

bike |>
  select(-OBJECTID)
# A tibble: 5,716 × 53
     FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir  Bike_Injur Bike_Pos
   <dbl> <chr>      <chr>         <dbl> <chr>      <chr>     <chr>      <chr>   
 1    18 No         <NA>              6 No         Not Appl… C: Possib… Drivewa…
 2    29 Yes        50-59            51 No         With Tra… C: Possib… Travel …
 3    33 No         <NA>             10 No         With Tra… Injury     Travel …
 4    35 Yes        16-19            17 No         <NA>      B: Eviden… Travel …
 5    49 No         <NA>              6 No         Facing T… O: No Inj… Travel …
 6    53 Yes        50-59            52 No         With Tra… A: Disabl… Travel …
 7    56 Yes        16-19            18 No         <NA>      C: Possib… Travel …
 8    60 No         40-49            40 No         Facing T… B: Eviden… Sidewal…
 9    63 Yes        6-10              6 No         Facing T… B: Eviden… Travel …
10    66 Yes        6-10              7 No         <NA>      B: Eviden… Non-Roa…
# ℹ 5,706 more rows
# ℹ 45 more variables: Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
#   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
#   Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>, Crash_Mont <chr>,
#   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>, Crash_Year <dbl>,
#   Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>, Drvr_Age <dbl>,
#   Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>, Drvr_Race <chr>, …

or select a range of variables

bike |>
  select(OBJECTID:Bike_Injur)
# A tibble: 5,716 × 7
   OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir       Bike_Injur 
      <dbl> <chr>      <chr>         <dbl> <chr>      <chr>          <chr>      
 1       19 No         <NA>              6 No         Not Applicable C: Possibl…
 2       30 Yes        50-59            51 No         With Traffic   C: Possibl…
 3       34 No         <NA>             10 No         With Traffic   Injury     
 4       36 Yes        16-19            17 No         <NA>           B: Evident…
 5       50 No         <NA>              6 No         Facing Traffic O: No Inju…
 6       54 Yes        50-59            52 No         With Traffic   A: Disabli…
 7       57 Yes        16-19            18 No         <NA>           C: Possibl…
 8       61 No         40-49            40 No         Facing Traffic B: Evident…
 9       64 Yes        6-10              6 No         Facing Traffic B: Evident…
10       67 Yes        6-10              7 No         <NA>           B: Evident…
# ℹ 5,706 more rows

pull to extract a column as a vector

bike |>
  slice(1:6) |>
  pull(Location)
[1] "36.002743, -78.8785"  "35.612984, -77.39265" "35.595676, -77.59074"
[4] "35.076767, -80.7728"  "35.19999, -80.75713"  "35.966644, -78.96749"
bike |>
  slice(1:6) |>
  select(Location)
# A tibble: 6 × 1
  Location            
  <chr>               
1 36.002743, -78.8785 
2 35.612984, -77.39265
3 35.595676, -77.59074
4 35.076767, -80.7728 
5 35.19999, -80.75713 
6 35.966644, -78.96749

The two pulls in your lives

  • Don’t get pull happy when wrangling data! Only extract out variables if you truly need to, otherwise keep in data frame.

  • But always ⬇️ Pull before starting your work when collaborating on GitHub.

rename specific columns

Useful for correcting typos, and renaming to make variable names shorter and/or more informative

  • Original names:
names(bike)
 [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
 [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
[11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
[16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
[21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
[26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
[31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
[36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
[41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
[46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
[51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"  

rename specific columns

  • Rename Speed_Limi to Speed_Limit:
bike <- bike |>
  rename(Speed_Limit = Speed_Limi)

Check before you move on

Always check your changes and confirm code did what you wanted it to do

names(bike)
 [1] "FID"         "OBJECTID"    "AmbulanceR"  "BikeAge_Gr"  "Bike_Age"   
 [6] "Bike_Alc_D"  "Bike_Dir"    "Bike_Injur"  "Bike_Pos"    "Bike_Race"  
[11] "Bike_Sex"    "City"        "County"      "CrashAlcoh"  "CrashDay"   
[16] "Crash_Date"  "Crash_Grp"   "Crash_Hour"  "Crash_Loc"   "Crash_Mont" 
[21] "Crash_Time"  "Crash_Type"  "Crash_Ty_1"  "Crash_Year"  "Crsh_Sevri" 
[26] "Developmen"  "DrvrAge_Gr"  "Drvr_Age"    "Drvr_Alc_D"  "Drvr_EstSp" 
[31] "Drvr_Injur"  "Drvr_Race"   "Drvr_Sex"    "Drvr_VehTy"  "ExcsSpdInd" 
[36] "Hit_Run"     "Light_Cond"  "Locality"    "Num_Lanes"   "Num_Units"  
[41] "Rd_Charact"  "Rd_Class"    "Rd_Conditi"  "Rd_Config"   "Rd_Defects" 
[46] "Rd_Feature"  "Rd_Surface"  "Region"      "Rural_Urba"  "Speed_Limit"
[51] "Traff_Cntr"  "Weather"     "Workzone_I"  "Location"   

Your Turn

Your boss in Cumberland County gets overwhelmed by data easily, but he wants some data from you. He wants all bike accidents from his County, but he only wants to know the road’s speed limit, the age of the biker, and to know if alcohol was involved. If you have time, mine as well make the column names very clear to your boss while you’re at it…

Put a green sticky on the front of your computer when you’re done. Put a pink if you want help/have a question.

summarize to reduce variables to values

The values are summarized in a data frame

bike |>
  group_by(BikeAge_Gr) |>
  summarize(crash_count = n())
# A tibble: 13 × 2
   BikeAge_Gr crash_count
   <chr>            <int>
 1 0-5                 60
 2 11-15              747
 3 16-19              605
 4 20-24              680
 5 25-29              430
 6 30-39              658
 7 40-49              920
 8 50-59              739
 9 6-10               421
10 60-69              274
11 70                  12
12 70+                 58
13 <NA>               112

and arrange to order rows

bike |>
  group_by(BikeAge_Gr) |>
  summarize(crash_count = n()) |>
  arrange(desc(crash_count))
# A tibble: 13 × 2
   BikeAge_Gr crash_count
   <chr>            <int>
 1 40-49              920
 2 11-15              747
 3 50-59              739
 4 20-24              680
 5 30-39              658
 6 16-19              605
 7 25-29              430
 8 6-10               421
 9 60-69              274
10 <NA>               112
11 0-5                 60
12 70+                 58
13 70                  12

count to group by then count

bike |>
  count(BikeAge_Gr)
# A tibble: 13 × 2
   BikeAge_Gr     n
   <chr>      <int>
 1 0-5           60
 2 11-15        747
 3 16-19        605
 4 20-24        680
 5 25-29        430
 6 30-39        658
 7 40-49        920
 8 50-59        739
 9 6-10         421
10 60-69        274
11 70            12
12 70+           58
13 <NA>         112

  If you wanted to arrange these in ascending order what would you add to the pipe?

Select rows with sample_n or sample_frac

  • sample_n: randomly sample 5 observations
bike_n5 <- bike |>
  sample_n(5, replace = FALSE)

dim(bike_n5)
[1]  5 54
  • sample_frac: randomly sample 20% of observations
bike_perc20 <- bike |>
  sample_frac(0.2, replace = FALSE)

dim(bike_perc20)
[1] 1143   54

distinct to filter for unique rows

bike |> 
  select(County, City) |> 
  distinct() |> 
  arrange(County, City)
# A tibble: 360 × 2
   County    City              
   <chr>     <chr>             
 1 Alamance  Alamance          
 2 Alamance  Burlington        
 3 Alamance  Elon College      
 4 Alamance  Gibsonville       
 5 Alamance  Graham            
 6 Alamance  Green Level       
 7 Alamance  Mebane            
 8 Alamance  None - Rural Crash
 9 Alexander None - Rural Crash
10 Alleghany None - Rural Crash
# ℹ 350 more rows

distinct has a .keep_all parameter

bike |> 
  distinct(County, City, .keep_all = TRUE) |> 
  arrange(County, City)
# A tibble: 360 × 54
     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir  Bike_Injur
   <dbl>    <dbl> <chr>      <chr>         <dbl> <chr>      <chr>     <chr>     
 1   524      525 Yes        11-15            12 No         <NA>      B: Eviden…
 2    84       85 Yes        20-24            20 No         With Tra… B: Eviden…
 3   571      572 Yes        16-19            16 No         Not Appl… B: Eviden…
 4   509      510 Yes        40-49            43 Yes        With Tra… K: Killed 
 5   855      856 Yes        30-39            30 No         With Tra… A: Disabl…
 6     5        6 Yes        40-49            44 Yes        With Tra… C: Possib…
 7   163      164 Yes        30-39            35 No         Not Appl… C: Possib…
 8    96       97 Yes        30-39            36 No         With Tra… C: Possib…
 9    46       47 Yes        50-59            53 No         With Tra… B: Eviden…
10   485      486 Yes        60-69            62 No         With Tra… C: Possib…
# ℹ 350 more rows
# ℹ 46 more variables: Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>,
#   City <chr>, County <chr>, CrashAlcoh <chr>, CrashDay <chr>,
#   Crash_Date <date>, Crash_Grp <chr>, Crash_Hour <dbl>, Crash_Loc <chr>,
#   Crash_Mont <chr>, Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <dbl>,
#   Crash_Year <dbl>, Crsh_Sevri <chr>, Developmen <chr>, DrvrAge_Gr <chr>,
#   Drvr_Age <dbl>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, Drvr_Injur <chr>, …

Factors

Factors

Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).

(x = factor(c("BS", "MS", "PhD", "MS")))
[1] BS  MS  PhD MS 
Levels: BS MS PhD
glimpse(x)
 Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
typeof(x)
[1] "integer"

Returning to: Cat lovers

Reading in the cat-lovers data…

cat_lovers <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/cat-lovers.csv")

Read data in as character strings

glimpse(cat_lovers)
Rows: 60
Columns: 3
$ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "Tyro…
$ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0", "0", …
$ handedness     <chr> "left", "left", "left", "left", "left", "left", "left",…

But coerce when plotting

ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()

Use forcats to manipulate factors

cat_lovers <- cat_lovers |>
  mutate(handedness = fct_relevel(handedness, 
                                  "right", "left", "ambidextrous"))
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()

forcats functionality

  • R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.

  • factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors.

Recap

  • Understand the basic tenants of dplyr
  • Describe and utilize the pipe in workflows
  • Describe and use common verbs (functions)
  • Understand the documentation for dplyr functions
  • Understand what factors are an that forcats is a package with functionality for working with them