library(tidyverse)
Lab 02 - Wrangling (Ans)
Load packages
Exercise 1
How many unique hurricanes are included in this dataset?
(Note the specific value may differ based on the version of the dataset you’re using, but the code would not change.)
<- storms |>
n_unique filter(status == "hurricane") |>
distinct(name, year, .keep_all = TRUE) |>
count() |>
pull(n)
# OR
|>
storms filter(status == "hurricane") |>
group_by(year, name) |>
count() |>
nrow()
[1] 310
There are 310 unique hurricanes.
(Note that we need to group by name *and* year, as certain storms have the same name...in different years. There might still be instances where the same storm is double-counted if it extends from December into January, which if you accounted for, excellent work!)
Exercise 2
Note: If you used storms
on datahub, the ts_diameter
column has missing information and were likely unable to complete this question. Otherwise…this would have been the approach…
Which tropical storm affected the largest area experiencing tropical storm strength winds? And, what was the maximum sustained wind speed for that storm?
|>
storms filter(status == "tropical storm",
!is.na(ts_diameter)) |>
slice_max(ts_diameter)
OR
|>
storms filter(status == "tropical storm",
!is.na(ts_diameter)) |>
filter(ts_diameter == max(ts_diameter, na.rm=TRUE))
Sandy (2012) had the largest area affected.
Exercise 3
Among all storms in this dataset, in which month are storms most common? Does this depend on the status of the storm? (In other words, are hurricanes more common in certain months than tropical depressions? or tropical storms?)
# most common month
|>
storms distinct(name, year, .keep_all=TRUE) |>
group_by(month) |>
summarise(n = n()) |> # could alternatively use count() here
arrange(desc(n))
# A tibble: 10 × 2
month n
<dbl> <int>
1 9 208
2 8 173
3 10 99
4 7 67
5 6 41
6 11 29
7 5 13
8 12 5
9 1 2
10 4 2
September is the most common month.
# depend on status?
|>
storms group_by(status, month) |>
summarise(n = n()) |> # could alternatively use count() here
slice_max(n)
`summarise()` has grouped output by 'status'. You can override using the
`.groups` argument.
# A tibble: 9 × 3
# Groups: status [9]
status month n
<fct> <dbl> <int>
1 disturbance 7 45
2 extratropical 9 732
3 hurricane 9 2380
4 other low 9 446
5 subtropical depression 8 36
6 subtropical storm 9 72
7 tropical depression 9 1315
8 tropical storm 9 2448
9 tropical wave 8 55
It does not depend on status. September is the most common for all three storm types.
Exercise 4
Your boss asks for the name, year, and status of all category 5 storms that have happened in the 2000s. Carry out the operations that would deliver what they’re looking for.
|>
storms filter(category == 5,
between(year, 2000, 2009)) |>
select(name, year, status) |>
distinct(name, year, .keep_all=TRUE)
# A tibble: 8 × 3
name year status
<chr> <dbl> <fct>
1 Isabel 2003 hurricane
2 Ivan 2004 hurricane
3 Emily 2005 hurricane
4 Katrina 2005 hurricane
5 Rita 2005 hurricane
6 Wilma 2005 hurricane
7 Dean 2007 hurricane
8 Felix 2007 hurricane
Exercise 5
Filter these data to only include storms that occurred during your lifetime (your code and results may differ from your classmates!). Among storms that have occurred during your lifetime, what’s the mean and median air pressure across all measurements taken?
<- storms |>
my_storms filter(between(year, 1988, 2023)) # alternatively filter(year >= 1988)
|>
my_storms summarise(median_pressure = median(pressure),
mean_pressure = mean(pressure))
# A tibble: 1 × 2
median_pressure mean_pressure
<dbl> <dbl>
1 1000 993.
- Median: 999 millibars
- Mean: 991 millibars
Exercise 6
Which decade (of the storms included in the dataset) had the largest number of unique reported storms?
|>
storms distinct(name, year) |>
mutate(decade = year - year %% 10) |> # there are MANY different ways to approach this!
group_by(decade) |>
count()
# A tibble: 6 × 2
# Groups: decade [6]
decade n
<dbl> <int>
1 1970 40
2 1980 90
3 1990 127
4 2000 169
5 2010 163
6 2020 50
The 2000s.
(Note: we want to be sure to only count each storm once. Could also arrange
by desc(n)
to have 2000 at top.)
Exercise 7
Among the subset of storms occurring in your lifetime, which storm lasted the longest? Include your code and explain your answer.
|>
my_storms group_by(name, year) |>
count() |>
arrange(desc(n))
# A tibble: 532 × 3
# Groups: name, year [532]
name year n
<chr> <dbl> <int>
1 Nadine 2012 96
2 Ivan 2004 94
3 Kyle 2002 90
4 Leslie 2018 89
5 Paulette 2020 88
6 Alberto 2000 87
7 Jose 2017 85
8 Nicholas 2003 80
9 Florence 2018 79
10 Marilyn 1995 79
# ℹ 522 more rows
Nadine lasted the longest (unless you were born after 2012).
(Note: The logic here is that storms are reported every six hours, per the description of the dataset, so the storm that has the most rows/entries would have lasted the longest)