Lab 02 - Wrangling (Ans)

Load packages

library(tidyverse) 
Warning

Your numbers/output may differ based on the version of the storms dataset that you have/used to complete. The logic will follow what you see here. This is expected behavior.

Exercise 1

How many unique hurricanes are included in this dataset?

(Note the specific value may differ based on the version of the dataset you’re using, but the code would not change.)

n_unique <- storms |> 
  filter(status == "hurricane") |>
  distinct(name, year, .keep_all = TRUE) |>
  count() |>
  pull(n)

# OR

storms |> 
  filter(status == "hurricane") |>
  group_by(year, name) |> 
  count() |>
  nrow()
[1] 310

There are 310 unique hurricanes.

(Note that we need to group by name *and* year, as certain storms have the same name...in different years. There might still be instances where the same storm is double-counted if it extends from December into January, which if you accounted for, excellent work!)

Exercise 2

Note: If you used storms on datahub, the ts_diameter column has missing information and were likely unable to complete this question. Otherwise…this would have been the approach…

Which tropical storm affected the largest area experiencing tropical storm strength winds? And, what was the maximum sustained wind speed for that storm?

storms |> 
  filter(status == "tropical storm", 
         !is.na(ts_diameter)) |> 
  slice_max(ts_diameter)

OR

storms |>
  filter(status == "tropical storm",
         !is.na(ts_diameter)) |> 
  filter(ts_diameter == max(ts_diameter, na.rm=TRUE))

Sandy (2012) had the largest area affected.

Exercise 3

Among all storms in this dataset, in which month are storms most common? Does this depend on the status of the storm? (In other words, are hurricanes more common in certain months than tropical depressions? or tropical storms?)

# most common month
storms |> 
  distinct(name, year, .keep_all=TRUE) |>
  group_by(month) |>
  summarise(n = n()) |> # could alternatively use count() here
  arrange(desc(n))
# A tibble: 10 × 2
   month     n
   <dbl> <int>
 1     9   208
 2     8   173
 3    10    99
 4     7    67
 5     6    41
 6    11    29
 7     5    13
 8    12     5
 9     1     2
10     4     2

September is the most common month.

# depend on status?
storms |> 
  group_by(status, month) |>
  summarise(n = n()) |> # could alternatively use count() here
  slice_max(n)
`summarise()` has grouped output by 'status'. You can override using the
`.groups` argument.
# A tibble: 9 × 3
# Groups:   status [9]
  status                 month     n
  <fct>                  <dbl> <int>
1 disturbance                7    45
2 extratropical              9   732
3 hurricane                  9  2380
4 other low                  9   446
5 subtropical depression     8    36
6 subtropical storm          9    72
7 tropical depression        9  1315
8 tropical storm             9  2448
9 tropical wave              8    55

It does not depend on status. September is the most common for all three storm types.

Exercise 4

Your boss asks for the name, year, and status of all category 5 storms that have happened in the 2000s. Carry out the operations that would deliver what they’re looking for.

storms |>
  filter(category == 5,
         between(year, 2000, 2009)) |>
  select(name, year, status) |>
  distinct(name, year, .keep_all=TRUE)
# A tibble: 8 × 3
  name     year status   
  <chr>   <dbl> <fct>    
1 Isabel   2003 hurricane
2 Ivan     2004 hurricane
3 Emily    2005 hurricane
4 Katrina  2005 hurricane
5 Rita     2005 hurricane
6 Wilma    2005 hurricane
7 Dean     2007 hurricane
8 Felix    2007 hurricane

Exercise 5

Filter these data to only include storms that occurred during your lifetime (your code and results may differ from your classmates!). Among storms that have occurred during your lifetime, what’s the mean and median air pressure across all measurements taken?

my_storms <- storms |>
  filter(between(year, 1988, 2023)) # alternatively filter(year >= 1988)

my_storms |>
  summarise(median_pressure = median(pressure),
            mean_pressure = mean(pressure))
# A tibble: 1 × 2
  median_pressure mean_pressure
            <dbl>         <dbl>
1            1000          993.
  • Median: 999 millibars
  • Mean: 991 millibars

Exercise 6

Which decade (of the storms included in the dataset) had the largest number of unique reported storms?

storms |> 
  distinct(name, year) |>
  mutate(decade = year - year %% 10) |> # there are MANY different ways to approach this!
  group_by(decade) |>
  count()
# A tibble: 6 × 2
# Groups:   decade [6]
  decade     n
   <dbl> <int>
1   1970    40
2   1980    90
3   1990   127
4   2000   169
5   2010   163
6   2020    50

The 2000s.

(Note: we want to be sure to only count each storm once. Could also arrange by desc(n) to have 2000 at top.)

Exercise 7

Among the subset of storms occurring in your lifetime, which storm lasted the longest? Include your code and explain your answer.

my_storms |>  
  group_by(name, year) |> 
  count() |> 
  arrange(desc(n))
# A tibble: 532 × 3
# Groups:   name, year [532]
   name      year     n
   <chr>    <dbl> <int>
 1 Nadine    2012    96
 2 Ivan      2004    94
 3 Kyle      2002    90
 4 Leslie    2018    89
 5 Paulette  2020    88
 6 Alberto   2000    87
 7 Jose      2017    85
 8 Nicholas  2003    80
 9 Florence  2018    79
10 Marilyn   1995    79
# ℹ 522 more rows

Nadine lasted the longest (unless you were born after 2012).

(Note: The logic here is that storms are reported every six hours, per the description of the dataset, so the storm that has the most rows/entries would have lasted the longest)