library(tidyverse)
Lab 03 - Data Visualization (Ans)
Load packages
Load data
<- read_csv("data/recent-grads.csv") college_recent_grads
Rows: 173 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): major, major_category
dbl (19): rank, major_code, total, sample_size, men, women, sharewomen, empl...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(college_recent_grads)
Rows: 173
Columns: 21
$ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ major_code <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
$ major <chr> "Petroleum Engineering", "Mining And Miner…
$ major_category <chr> "Engineering", "Engineering", "Engineering…
$ total <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
$ sample_size <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
$ men <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
$ women <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, …
$ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132…
$ employed <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
$ employed_fulltime <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
$ employed_parttime <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
$ employed_fulltime_yearround <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
$ unemployed <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
$ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.0…
$ p25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
$ median <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
$ p75th <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
$ college_jobs <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
$ non_college_jobs <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
$ low_wage_jobs <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…
Data wrangling and visualization
Which major has the lowest unemployment rate?
|>
college_recent_grads arrange(unemployment_rate) |>
select(rank, major, unemployment_rate)
# A tibble: 173 × 3
rank major unemployment_rate
<dbl> <chr> <dbl>
1 53 Mathematics And Computer Science 0
2 74 Military Technologies 0
3 84 Botany 0
4 113 Soil Science 0
5 121 Educational Administration And Supervision 0
6 15 Engineering Mechanics Physics And Science 0.00633
7 20 Court Reporting 0.0117
8 120 Mathematics Teacher Education 0.0162
9 1 Petroleum Engineering 0.0184
10 65 General Agriculture 0.0196
# ℹ 163 more rows
Display only 4 decimal places using mutate
and round
:
|>
college_recent_grads arrange(unemployment_rate) |>
select(rank, major, unemployment_rate) |>
mutate(unemployment_rate = round(unemployment_rate, digits = 4))
# A tibble: 173 × 3
rank major unemployment_rate
<dbl> <chr> <dbl>
1 53 Mathematics And Computer Science 0
2 74 Military Technologies 0
3 84 Botany 0
4 113 Soil Science 0
5 121 Educational Administration And Supervision 0
6 15 Engineering Mechanics Physics And Science 0.0063
7 20 Court Reporting 0.0117
8 120 Mathematics Teacher Education 0.0162
9 1 Petroleum Engineering 0.0184
10 65 General Agriculture 0.0196
# ℹ 163 more rows
Display 2 scientific digits using options
:
options(digits = 2)
Exercise 1
Which of these options, changing the input data or altering the number of digits displayed without touching the input data, is the better option? Explain your reasoning. Then, implement the option you chose.
I prefer the options
approach because it will enforce consistency each time we display variables, without having to make the call to round
each time (without changing the underlying data).
Which major has the highest percentage of women?
Exercise 2
Using what you’ve learned so far, arrange the data in descending order with respect to proportion of women in a major, and display only the major, the total number of people with major, and proportion of women. Show only the top 3 majors by adding head(3)
at the end of the pipeline.
The sharewomen
column lists the proportion of women in each major. So we can arrange
our data in descending order using desc(sharewomen)
, then select
the columns we want: major
, total
, and sharewomen
. We then just display the top 3 majors using head(3)
.
|>
college_recent_grads arrange(desc(sharewomen)) |>
select(major, total, sharewomen) |>
head(3)
# A tibble: 3 × 3
major total sharewomen
<chr> <dbl> <dbl>
1 Early Childhood Education 37589 0.969
2 Communication Disorders Sciences And Services 38279 0.968
3 Medical Assisting Services 11123 0.928
How do the distributions of median income compare across major categories?
Exercise 3
Why do we often choose the median, rather than the mean, to describe the typical income of a group of people?
The mean is more affected by the presence of outliers and by the skew of the distribution. Thus, the presence of a single person with very high income can increase the mean substantively, such that it’s no longer a good impression of the overall distribution. In contrast, the median (or the “middle number”) tells us the income exactly in the middle of the distribution.
Exercise 4
Try binwidths of $1000 and $5000 and choose one. Explain your reasoning for your choice. Note that the binwidth is an argument for the geom_histogram
function. So to specify a binwidth of $1000, you would use geom_histogram(binwidth = 1000)
.
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram(binwidth=5000)
I ended up choosing binwidth=5000
. When binwidth=1000
, there were too many small differences in income that were thus not grouped together, and it was harder to see the overall shape of the distribution.
We can also calculate summary statistics for this distribution using the summarise
function:
|>
college_recent_grads summarise(min = min(median), max = max(median),
mean = mean(median), med = median(median),
sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
# A tibble: 1 × 7
min max mean med sd q1 q3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22000 110000 40151. 36000 11470. 33000 45000
Exercise 5
Based on the shape of the histogram you created in the previous exercise, determine which of these summary statistics is useful for describing the distribution. Write up your description (remember shape, center, spread, any unusual observations) and include the summary statistic output as well.
The underlying distribution of median
incomes is somewhat right-skewed, with at least 1-2 majors making a lot of money. Because of this, I’m going to use the median
median income.
It would be probably be fine to use the mean
median income as well—judging by the distribution of median
incomes in each major_category
, which are relatively normal-ish.
Exercise 6
Plot the distribution of median
income using a histogram, faceted by major_category
. Use the binwidth
you chose in the earlier exercise.
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram(binwidth = 5000) +
facet_wrap( ~ major_category, ncol = 4)
Exercise 7
Which major category has the highest typical (you’ll need to decide what this means) median income? Use the partial code below, filling it in with the appropriate statistic and function. Also note that we are looking for the highest statistic, so make sure to arrange in the correct direction.
|>
college_recent_grads group_by(major_category) |>
summarise(median = median(median)) |>
arrange(desc(median))
# A tibble: 16 × 2
major_category median
<chr> <dbl>
1 Engineering 57000
2 Computers & Mathematics 45000
3 Business 40000
4 Physical Sciences 39500
5 Social Science 38000
6 Biology & Life Science 36300
7 Law & Public Policy 36000
8 Agriculture & Natural Resources 35000
9 Communications & Journalism 35000
10 Health 35000
11 Industrial Arts & Consumer Services 35000
12 Interdisciplinary 35000
13 Education 32750
14 Humanities & Liberal Arts 32000
15 Arts 30750
16 Psychology & Social Work 30000
Engineering.
Exercise 8
Which major category is the least popular in this sample? To answer this question we can take advantage of the total
column:
|>
college_recent_grads filter(is.na(total) == FALSE) |>
group_by(major_category) |>
summarise(popularity = sum(total)) |>
arrange(popularity)
# A tibble: 16 × 2
major_category popularity
<chr> <dbl>
1 Interdisciplinary 12296
2 Agriculture & Natural Resources 75620
3 Law & Public Policy 179107
4 Physical Sciences 185479
5 Industrial Arts & Consumer Services 229792
6 Computers & Mathematics 299008
7 Arts 357130
8 Communications & Journalism 392601
9 Biology & Life Science 453862
10 Health 463230
11 Psychology & Social Work 481007
12 Social Science 529966
13 Engineering 537583
14 Education 559129
15 Humanities & Liberal Arts 713468
16 Business 1302376
All STEM fields aren’t the same
First, let’s create a new vector called stem_categories
that lists the major categories that are considered STEM fields.
<- c("Biology & Life Science",
stem_categories "Computers & Mathematics",
"Engineering",
"Physical Sciences")
Then, we can use this to create a new variable in our data frame indicating whether a major is STEM or not.
<- college_recent_grads |>
college_recent_grads mutate(major_type = case_when(
%in% stem_categories ~ "stem",
major_category TRUE ~ "not stem"
))
We can use the logical operators to also filter
our data for STEM majors whose median earnings is less than median for all majors’s median earnings, which we found to be $36,000 earlier.
|>
college_recent_grads filter(
== "stem",
major_type < median(college_recent_grads$median)
median )
# A tibble: 10 × 22
rank major_code major major_category total sample_size men women
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 93 1301 Environment… Biology & Lif… 25965 225 10787 15178
2 98 5098 Multi-Disci… Physical Scie… 62052 427 27015 35037
3 102 3608 Physiology Biology & Lif… 22060 99 8422 13638
4 106 2001 Communicati… Computers & M… 18035 208 11431 6604
5 109 3611 Neuroscience Biology & Lif… 13663 53 4944 8719
6 111 5002 Atmospheric… Physical Scie… 4043 32 2744 1299
7 123 3699 Miscellaneo… Biology & Lif… 10706 63 4747 5959
8 124 3600 Biology Biology & Lif… 280709 1370 111762 168947
9 133 3604 Ecology Biology & Lif… 9154 86 3878 5276
10 169 3609 Zoology Biology & Lif… 8409 47 3050 5359
# ℹ 14 more variables: sharewomen <dbl>, employed <dbl>,
# employed_fulltime <dbl>, employed_parttime <dbl>,
# employed_fulltime_yearround <dbl>, unemployed <dbl>,
# unemployment_rate <dbl>, p25th <dbl>, median <dbl>, p75th <dbl>,
# college_jobs <dbl>, non_college_jobs <dbl>, low_wage_jobs <dbl>,
# major_type <chr>
Exercise 9
Which STEM majors have median salaries equal to or less than the median for all majors’s median earnings? Your output should only show the major name and median, 25th percentile, and 75th percentile earning for that major as and should be sorted such that the major with the highest median earning is on top.
|>
college_recent_grads filter(
== "stem",
major_type < median(college_recent_grads$median)
median |>
) select(major, median, p25th, p75th) |>
arrange(desc(median))
# A tibble: 10 × 4
major median p25th p75th
<chr> <dbl> <dbl> <dbl>
1 Environmental Science 35600 25000 40200
2 Multi-Disciplinary Or General Science 35000 24000 50000
3 Physiology 35000 20000 50000
4 Communication Technologies 35000 25000 45000
5 Neuroscience 35000 30000 44000
6 Atmospheric Sciences And Meteorology 35000 28000 50000
7 Miscellaneous Biology 33500 23000 48000
8 Biology 33400 24000 45000
9 Ecology 33000 23000 42000
10 Zoology 26000 20000 39000
What types of majors do women tend to major in?
Exercise 10
Create a scatterplot of median income vs. proportion of women in that major, colored by whether the major is in a STEM field or not. Describe the association between these three variables.
|>
college_recent_grads drop_na(sharewomen) |> ## This will drop rows for which sharewomen has NA value
ggplot(aes(x = median,
y = sharewomen,
color = major_type)) +
geom_point(alpha = .6) +
labs(x = "Median income for major",
y = "Share of women in major",
color = "STEM major?")
In general, there appears to be a negative relationship between median
income and the proportion of women (sharewomen
) in a major. Both variables are also correlated with major_type
: stem
majors tend to have a lower proportion of women (and lower median
income), while not stem
majors have a higher proportion (and higher median
income).
(Note that the negative relationship with median
income is somewhat clearer if log(median
) is used instead.)