library(tidyverse)
Lab 03 - Data Visualization
Introduction
A note on expectations: For each exercise, include any relevant output (tables, summary statistics, plots) in your answer along with text to guide the reader. Place any relevant R code in a code chunk, any relevant text outside of code chunks, and hit Knit HTML.
Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe raw information - the data. In this lab we explore data on college majors and earnings, specifically the data behind the FiveThirtyEight story “The Economic Guide To Picking A College Major”.
These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While outside the scope of this lab, if you are curious about how raw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used.
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
Getting started
To get started, accept the lab03 assignment (link on Canvas), clone the repo (using SSH) into RStudio on datahub. Update the author name at the top of the .Rmd file in the YAML to be your name. And, then you’re ready to go!
Heads up…this lab is longer than what you can likely accomplish in 1h. This is by design. If you fly through labs b/c you’ve got experience, we want to make sure you’ve got things to explore and continue to learn! If you’re new to R and the tidyverse
, you’re not going to get through the whole lab. That’s OK! Do what you can, incomplete responses are fine, incomplete questions are OK, and just be sure to check the answer key for places you were stuck. That said, be sure to try the visualization exercises (4, 6, & 10)!
Setup
Go ahead and add your name to the YAML at top.
And, in this lab we will work with the tidyverse
package. So we need to load it:
Note that this package is also loaded in your R Markdown document.
Load the data
In this lab we use a new method of loading data into R – read the data in from a comma separated values (CSV) file.
CSV is a delimited text file that uses a comma to separate values, though many implementations of CSV import/export tools allow other separators to be used as well. (Source: Wikipedia
We will use the read_csv
function to do this:
The dataset is in a folder called data
. We read it in with the read_csv
function, and save the result as a new data frame called college_recent_grads
.
<- read_csv("data/recent-grads.csv") college_recent_grads
college_recent_grads
is a tidy data frame, with each row representing an observation and each column representing a variable.
To view the data, click on the name of the data frame in the Environment tab.
You can also take a quick peek at your data frame and view its dimensions with the glimpse
function.
glimpse(college_recent_grads)
The description of the variables, i.e. the codebook, is given below.
Header | Description |
---|---|
rank |
Rank by median earnings |
major_code |
Major code, FO1DP in ACS PUMS |
major |
Major description |
major_category |
Category of major from Carnevale et al |
total |
Total number of people with major |
sample_size |
Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
men |
Male graduates |
women |
Female graduates |
sharewomen |
Women as share of total |
employed |
Number employed (ESR == 1 or 2) |
employed_full_time |
Employed 35 hours or more |
employed_part_time |
Employed less than 35 hours |
employed_full_time_yearround |
Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
unemployed |
Number unemployed (ESR == 3) |
unemployment_rate |
Unemployed / (Unemployed + Employed) |
median |
Median earnings of full-time, year-round workers |
p25th |
25th percentile of earnigns |
p75th |
75th percentile of earnings |
college_jobs |
Number with job requiring a college degree |
non_college_jobs |
Number with job not requiring a college degree |
low_wage_jobs |
Number in low-wage service jobs |
The college_recent_grads
data frame is a trove of information. Let’s think about some questions we might want to answer with these data:
- Which major has the lowest unemployment rate?
- Which major has the highest percentage of women?
- How do the distributions of median income compare across major categories?
- Do women tend to choose majors with lower or higher earnings?
In the next section we aim to answer these questions.
Exercises
Which major has the lowest unemployment rate?
In order to answer this question all we need to do is sort the data. We use the arrange
function to do this, and sort it by the unemployment_rate
variable. By default arrange
sorts in ascending order, which is what we want here – we’re interested in the major with the lowest unemployment rate.
|>
college_recent_grads arrange(unemployment_rate)
This gives us what we wanted, but not in an ideal form. First, the name of the major barely fits on the page. Second, some of the variables are not that useful (e.g. major_code
, major_category
) and some we might want front and center are not easily viewed (e.g. unemployment_rate
).
We can use the select
function to choose which variables to display, and in which order:
Note that your output here likely has a whole bunch of decimal places in the unemployment variable? You likely don’t want all those values to be displayed.
There are two ways we can address this problem. One would be to round the unemployment_rate
variable in the dataset or we can change the number of digits displayed, without touching the input data.
Below are instructions for how you would do both of these:
Round
unemployment_rate
: We create a new variable with themutate
function. In this case, we’re overwriting the existingunemployment_rate
variable, byround
ing it to4
decimal places.For example, the call tomutate
would be:mutate(unemployment_rate = round(unemployment_rate, digits = 4))
Change displayed number of digits without touching data: We can add an option to our R Markdown document to change the displayed number of digits in the output. To do so, add a new chunk, and set:
options(digits = 2)
Note that the digits
in options
is scientific digits, and in round
they are decimal places. If you’re thinking “Wouldn’t it be nice if they were consistent?”, you’re right…
You don’t need to do both of these; that would be redundant. The next exercise asks you to choose one.
Exercise 1
Which of these options, changing the input data or altering the number of digits displayed without touching the input data, is the better option? Explain your reasoning. Then, implement the option you chose.
Which major has the highest percentage of women?
To answer such a question we need to arrange the data in descending order. For example, if earlier we were interested in the major with the highest unemployment rate, we would use the following:
The desc
function specifies that we want unemployment_rate
in descending order.
|>
college_recent_grads arrange(desc(unemployment_rate)) |>
select(rank, major, unemployment_rate)
Exercise 2
Using what you’ve learned so far, arrange the data in descending order with respect to proportion of women in a major, and display only the major, the total number of people with major, and proportion of women. Show only the top 3 majors by adding head(3)
at the end of the pipeline.
How do the distributions of median income compare across major categories?
A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia
There are three types of incomes reported in this data frame: p25th
, median
, and p75th
. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.
Exercise 3
Why do we often choose the median, rather than the mean, to describe the typical income of a group of people?
The question we want to answer “How do the distributions of median income compare across major categories?”. We need to do a few things to answer this question: First, we need to group the data by major_category
. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data.
We use the ggplot
function to do this. The first argument is the data frame, and the next argument gives the mapping of the variables of the data to the aes
thetic elements of the plot.
Let’s start simple and take a look at the distribution of all median incomes, without considering the major categories.
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram()
Along with the plot, we get a message:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is telling us that we might want to reconsider the binwidth we chose for our histogram – or more accurately, the binwidth we didn’t specify. It’s good practice to always think in the context of the data and try out a few binwidths before settling on a binwidth. You might ask yourself: “What would be a meaningful difference in median incomes?” $1 is obviously too little, $10000 might be too high.
Exercise 4
Try binwidths of $1000 and $5000 and choose one. Explain your reasoning for your choice. Note that the binwidth is an argument for the geom_histogram
function. So to specify a binwidth of $1000, you would use geom_histogram(binwidth = 1000)
.
We can also calculate summary statistics for this distribution using the summarise
function. Note here that you can calculate multiple summary statistics within a single summarise
call:
|>
college_recent_grads summarise(min = min(median), max = max(median),
mean = mean(median), med = median(median),
sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
Exercise 5
Based on the shape of the histogram you created in the previous exercise, determine which of these summary statistics is useful for describing the distribution. Write up your description (remember shape, center, spread, any unusual observations) and include the summary statistic output as well.
Exercise 6
Now, plot the distribution of median
income using a histogram, faceted by major_category
. Use the binwidth
you chose in the earlier exercise.
Now that we’ve seen the shapes of the distributions of median incomes for each major category, we should have a better idea for which summary statistic to use to quantify the typical median income.
Exercise 7
Which major category has the highest typical (you’ll need to decide what this means) median income? Also note that we are looking for the highest statistic, so make sure if you arrange
to do so in the correct direction.
Exercise 8
Which major category is the least popular in this sample?
All STEM fields aren’t the same
One of the sections of the FiveThirtyEight story is “All STEM fields aren’t the same”. Let’s see if this is true.
First, let’s create a new vector called stem_categories
that lists the major categories that are considered STEM fields.
<- c("Biology & Life Science",
stem_categories "Computers & Mathematics",
"Engineering",
"Physical Sciences")
Then, we can use this to create a new variable in our data frame indicating whether a major is STEM or not.
<- college_recent_grads |>
college_recent_grads mutate(major_type = case_when(major_category %in% stem_categories ~ "stem",
TRUE ~ "not stem"))
Let’s unpack this: with mutate
we create a new variable called major_type
, which is defined as "stem"
if the major_category
is in the vector called stem_categories
we created earlier, and as "not stem"
otherwise.
%in%
is a logical operator. Other logical operators that are commonly used are
Operator | Operation |
---|---|
x < y |
less than |
x > y |
greater than |
x <= y |
less than or equal to |
x >= y |
greater than or equal to |
x != y |
not equal to |
x == y |
equal to |
x %in% y |
contains |
x | y |
or |
x & y |
and |
!x |
not |
We can use the logical operators to also filter
our data for STEM majors whose median earnings is less than median for all majors’s median earnings, which we found to be $36,000 earlier.
|>
college_recent_grads filter(
== "stem",
major_type < 36000
median )
Exercise 9
Which STEM majors have median salaries equal to or less than the median for all majors’ median earnings? Your output should only show the major name and median, 25th percentile, and 75th percentile earning for that major as and should be sorted such that the major with the highest median earning is on top.
What types of majors do women tend to major in?
Exercise 10
Create a scatterplot of median income vs. proportion of women in that major, colored by whether the major is in a STEM field or not. Describe the association between these three variables.
Submit
You’ll always want to knit your RMarkdown document to HTML and review that HTML document to ensure it includes all the information you want and looks as you intended, as we grade from the knit HTML.
Yay, you’re done! To finish up and submit, first knit your file to HTML. Be sure to select both your .Rmd and .html documents when choosing what to commit! Then, commit all remaining changes and push. Before you wrap up the assignment, make sure all documents are updated on your GitHub repo.