library(OCSdata)
library(tidyverse)
library(here)
library(pdftools)
library(tesseract)
library(magick)
library(knitr)
library(Kendall)
library(broom)
CS: Example
Load packages
I would recommend packages be imported after the introduction, but this is very specific and not something anyone would lose credit for.
Introduction
In this case study, we seek to analyze the youth disconnection among minority groups within the United States. The disconnected youth are the young people who are between the ages of 16 and 24 who are neither working nor in school. Following the aftermath of the Great Recession in 2008, we witnessed a peak youth disconnection rate of 14.7 percent in 2010 which has since been trending downwards to as low as a youth disconnection rate of 11.5 percent in 2017. While analyzing racial and ethnic groups is imperative for viewing any evident disparities, we also intend on analyzing how gender may play a role in affecting the rates of disconnection. Seeing as how we can already see existing disparities between genders within the workforce, we also want to see to what extent, if at all, gender plays a role in youth disconnection among groups.
This specific period of young adulthood is one of the most critical in developing the necessary skills and capabilities required of them to be successful in adulthood and their professional careers as youth disconnection during this period stunts development and limits their potential. Measure of America, a nonpartisan project of the nonprofit Social Science Research, claims that “people who experience a period of disconnection as young adults go on to earn less and are less likely to be employed, own a home, or report good health by the time they reach their thirties”. Understanding how these trends among groups arise may be critical for finding solutions to preventing youth disconnection, and analyzing what groups may need more resources in order to prevent further prevent youth disconnection.
This introduction is missing citations! Statements that are not your own original thoughts must be cited. See Effective Communication lecture for possible ways to cite. We don’t care what format you use. We do care that you cite others’ work. Also, definitely feel free to find information on the topic beyond what was presented in class.
Questions
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
- Does removing 2008 (pre-Great Recession time period) data points change our answer?
The third question here was their extension. Note that this is an ok extension all things considered. This was earlier in the quarter, a more straightforward dataset, there wasn’t as much to do as there is in our dataset. Explaining the why this is their focus in introduction would have been great.
The Data
The data set that we will be using for our case study is provided by two reports from the Measure of America project related to youth disconnect. The data provided will be in the form of images.
Data Import
In order to begin our analysis, we imported our raw data from the OCSdata
library on youth disconnection.
# Data is imported so I get overwrite error when not commented - Adrian
# Download raw data files
# load_raw_data("ocs-bp-youth-disconnection", outpath = '.')
Reminder that you can control chunk behavior for each chunk within the curly braces. This also applies for the warnings that are displayed later on. Best to supress those in final HTML.
Since the raw data is in images, we need to convert to a usable R data set. We started with the major racial ethnic groups.
## Import data for major racial/ethnic groups
<- magick::image_read("data/Major_ethnic_groups_screenshot.png")
major_racial_ethnic_groups <- magick::image_ocr(major_racial_ethnic_groups) major_groups
For Asian subgroups, we had to import the raw data for 2017 and 2018.
## Import data for Asian subgroups
# 2017 data
<- image_read("data/asian_subgroups_2017.png")
asian_sub_2017 <- image_read("data/asian_sub_2017_A.png")
asian_sub_2017_A <- image_read("data/asian_sub_2017_B.png")
asian_sub_2017_B <- image_read("data/asian_sub_2017_C.png")
asian_sub_2017_C <- image_ocr(asian_sub_2017)
asian_sub_2017 <- image_ocr(asian_sub_2017_A)
asian_sub_2017_A <- image_ocr(asian_sub_2017_B)
asian_sub_2017_B <- image_ocr(asian_sub_2017_C)
asian_sub_2017_C
# 2018 data
<- image_read("data/asian_sub_2018_A.png")
asian_sub_2018_A <- image_ocr(asian_sub_2018_A)
asian_sub_2018_A <- image_read("data/asian_sub_2018_B.png")
asian_sub_2018_B <- image_ocr(asian_sub_2018_B) asian_sub_2018_B
For Latinx subgroups, we also imported 2017 and 2018 data.
## Import data for Latinx subgroups
# 2017 data
<- image_read("data/latinx_sub_2017_A.png")
latinx_imageA <- image_read("data/latinx_sub_2017_B.png")
latinx_imageB <- image_read("data/latinx_sub_2017_C.png")
latinx_imageC <- image_ocr(latinx_imageA)
latinx_sub_2017_A <- image_ocr(latinx_imageB)
latinx_sub_2017_B <- image_ocr(latinx_imageC)
latinx_sub_2017_C
# 2018 data
<- image_read("data/latinx_subgroups_2018.png")
latinx_sub_2018 <- image_ocr(latinx_sub_2018) latinx_sub_2018
Once the data import is complete, we saved our data.
# Save data
save(
major_groups,
asian_sub_2017,
asian_sub_2017_A,
asian_sub_2017_B,
asian_sub_2017_C,
latinx_sub_2017_A,
latinx_sub_2017_B,
latinx_sub_2017_C,
asian_sub_2018_A,
asian_sub_2018_B,
latinx_sub_2018,file = "data/imported_data.rda")
Data Wrangling
Now that we have our data properly imported, we need to wrangle our data in R so we can begin our analysis.
There is text to guide the viewer, clear code, and code comments as needed. I like this.
Major Group Data
To begin our data wrangling, we took the imported data that is in a single string and separated the data by new line characters and transformed it into a table.
# Separates string by new line and transforms into a table
<- major_groups |>
major_groups ::str_split(pattern = "\n") |>
stringrunlist() |>
::as_tibble() tibble
Now that our data has rows, we created columns Group
and Year
and assigned the corresponding values according to the data as well as normalizing the capitalization throughout the data set.
# Separate into columns Group and Year
<-
major_groups |>
major_groups ::separate(col = value,
tidyrinto = c("Group", "Years"), # Set column names
sep = "(?<=[[:alpha:]])\\s(?=[0-9])") # Separate after letter string; beginning numerics
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [19].
# Make capitalization format the same
<- major_groups |>
major_groups mutate(Group = stringr::str_to_title(Group))
We have our data in two columns but need to separate our Years
column into each individual year. We thus separated the column by every space character, assigned appropriate years for its corresponding values, and made the data values numeric.
# Separate `Years` into columns by individual years
<- major_groups |>
major_groups ::separate(col = Years,
tidyrinto = c("2008", "2010", "2012", "2014", "2016", "2017"), # Set column names
sep = " ") # Separate by spaces
# Remove empty rows
<- major_groups |>
major_groups ::drop_na()
tidyr
# Make data numeric
<- major_groups |>
major_groups mutate(
across(.cols = -Group,
~ str_remove(string = ., pattern = "\\.")), # Remove decimal points from string
across(.cols = -Group, as.numeric), # Convert to numeric
across(.cols = -Group, ~ . * 0.1) # Add decimal point back
)
We have a lot of information in our Group
column, so we separated that information and created two new columns, Race_Ethnicity
and Gender
, to make it easier to analyze our data.
# Create Race_Ethnicity column
<- major_groups |>
major_groups # First, create entries for `All_races`
mutate(Race_Ethnicity = dplyr::recode(Group, "United States" = "All_races",
"Female" = "All_races",
"Male" = "All_races"),
# Second, create entries for all other races by importing data from group column and removing gender
Race_Ethnicity = str_remove(string = Race_Ethnicity,
pattern = "Female|Male"))
# Create Gender column
<- major_groups |>
major_groups # Extract gender from `Group` column
mutate(Gender = str_extract(string = Group,
pattern = "Female|Male")) |>
# Assign `All' if gender was not in `Group`
mutate(Gender = replace_na(Gender, replace = "All"))
Now that we have all of our data, we moved the columns containing years into rows and added a new column Percent
to log the value contained in the, now former, year column.
# Puts years in rows instead of columns with percent values
<- major_groups |>
major_groups_long ::pivot_longer(cols = contains("20"), # any column containing 20
tidyrnames_to = "Year", # Assigns column for year
values_to = "Percent", # Assigns column for percent
names_prefix = "Perc_") |>
::mutate(Year = as.numeric(Year)) dplyr
Subgroup Data
Data Wrangling Functions
We wrangled the major group data but now have to wrangle the subgroup data, which is a little more complicated. We first created a make_rows
function to take the imported raw data and convert it into a table with rows separated by new line characters.
# Create function to separate string into rows
<- function(text){
make_rows |>
text str_split("\n") |>
unlist() |>
as_tibble()
}
We also created two functions clean_table
and clean_table_2018
to clean our subgroup data just like the major group data – separating Group
and Years
columns, converting data to numeric, and creating Race_Ethnicity
and Gender
columns.
# Create function to clean 2017 subgroup data
<- function(table){
clean_table |>
table separate(col = value,
into = c("Group", "Percentage"), # Create column names
sep = "(?<=[[:alpha:]])\\s(?=[0-9])") |> # Split into columns after string of letters
drop_na() |> # Remove NA rows
# Make percentage data numeric
mutate(Group = str_to_title(Group)) |>
mutate(Percentage = str_remove(string = Percentage,
pattern = "\\.")) |> # Remove decimal points from string
separate(Percentage, c("Percent"), sep = " ") |> # Separate by space
mutate(Percent = as.numeric(Percent)) |> # Convert to numeric
mutate(Percent = Percent * 0.1) |> # Add back decimal point
# Create Race_Ethnicity column
mutate(Race_Ethnicity = recode(Group,
# Create entries for All_races
"United States" = "All_races",
"Female" = "All_races",
"Male" = "All_races")) |>
# Create entries for all other races by importing data from group column and removing gender
mutate(Race_Ethnicity = str_remove(string = Race_Ethnicity,
pattern = " Female| Male")) |>
# Create Gender column
mutate(Gender = str_extract(string = Group,
pattern ="Female|Male")) |> # Extract gender from `Group` column
mutate(Gender = replace_na(Gender, replace = "All")) # Assign `All' if gender was not in `Group`
}
# Create function to clean 2018 subgroup data
<- function(table){
clean_table_2018 |>
table separate(col = value,
into = c("Group", "Percent"), # Create column names
sep = "(?<=[[:alpha:]])\\s:\\s|\\s(?=[0-9])") |> # Split into columns after colon
mutate(Group = str_remove(string = Group,
pattern = ":")) |> # Remove colon
drop_na() |> # Remove NA rows
# Make percentage data numeric
mutate(Group = str_to_title(string = Group)) |>
mutate(Percent = str_remove(string = Percent,
pattern = "\\.")) |> # Remove decimal points from string
mutate(Percent = as.numeric(Percent)) |> # Convert to numeric
mutate(Percent = Percent * 0.1) |> # Add back decimal point
# Create Race_Ethnicity column
mutate(Race_Ethnicity = str_replace(string = Group,
pattern = "Men|Women",
replacement = "missing")) |> # Replace gender with missing
mutate(Race_Ethnicity = na_if(Race_Ethnicity, "missing")) |> # Make missing values NA
fill(Race_Ethnicity, .direction = "down") |> # Fill Race/Ethnicity in missing fields
# Create Gender column
mutate(Gender = str_extract(string = Group,
pattern = "Men|Women")) |> # Extract gender from `Group` column
mutate(Gender = replace_na(Gender, replace = "All")) # Assign `All' if gender was not in `Group`
}
Asian Subgroup Data
We ran our make_rows
function on our raw data and combined the initially wrangled 2017 data.
## Asian subgroup data
# 2017 data
# Apply make_rows function to subgroup data
<- make_rows(asian_sub_2017)
asian_sub_2017 <- make_rows(asian_sub_2017_A)
asian_sub_2017_A <- make_rows(asian_sub_2017_B)
asian_sub_2017_B <- make_rows(asian_sub_2017_C)
asian_sub_2017_C
# Combine data
<- bind_rows(asian_sub_2017_A,
asian_sub_2017
asian_sub_2017_B, asian_sub_2017_C)
Similarly, we applied make_rows
to 2018 data.
# 2018 data
# Combine data
<- str_c(asian_sub_2018_A, asian_sub_2018_B)
asian_sub_2018
# Apply make_rows function
<- make_rows(asian_sub_2018) asian_sub_2018
We next cleaned both the 2017 and 2018 data using our clean_table
and clean_table_2018
functions.
# Apply clean table function to both years data
<- clean_table(asian_sub_2017) asian_sub_2017
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [17, 22,
28].
<- clean_table_2018(asian_sub_2018) asian_sub_2018
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [4, 8, 15, 19, 21,
23].
The imported data was missing part of the data, so we manually added it back in as well as created a Year
column.
# Add missing data
<- asian_sub_2018 |>
asian_sub_2018 add_row(Group = "Asian", Percent = 6.2,
Race_Ethnicity = "Asian", Gender = "All") |>
add_row(Group = "Asian", Percent = 6.4,
Race_Ethnicity = "Asian", Gender = "Men") |>
add_row(Group = "Asian", Percent = 6.1,
Race_Ethnicity = "Asian", Gender = "Women")
# Add years to data
<- asian_sub_2017 |>
asian_sub_2017 mutate(Year = 2017)
<- asian_sub_2018 |>
asian_sub_2018 mutate(Year = 2018)
Since the naming convention for males and females different across the 2017 and 2018 data, we standardized the Gender
column to only contain Male
and Female
entries and then combined the data into one data set, and once again moved the years into rows and added a Percent
column, just like our major group data.
# Make Male/Female the values for gender across both datasets
<- asian_sub_2018 |>
asian_sub_2018 mutate(across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Men",
replacement = "Male")),
across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Women",
replacement = "Female")))
# Combine 2017 and 2018 data
<- bind_rows(asian_sub_2017, asian_sub_2018)
asian_subgroups
# Add missing categories
<- asian_subgroups |>
asian_subgroups select(-Group) |>
pivot_wider(names_from = Year,
values_from = Percent) |>
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year",
values_to= "Percent")
Latinx Subgroup Data
To wrangle our Latinx subgroup data, we followed the same steps as our Asian subgroup data – applying make_rows
, clean_table
, and clean_table_2018
functions, standardizing the Gender
naming convention, combining 2017 and 2018 data, and moving the years into rows while adding a Percent
column. For the Latinx data, we also appropriately labeled the Puerto Rican, Dominican, Cuban groups and cleaned the names for Latino/Latina group.
## Latinx subgroup data
# 2017 data
# Combine data
<- stringr::str_c(latinx_sub_2017_A,
latinx_sub_2017
latinx_sub_2017_B,
latinx_sub_2017_C)
# Fix typo
<- latinx_sub_2017 |>
latinx_sub_2017 str_replace(pattern = "DR, Cuban Female 15.7\nPR", # Identify typo
replacement = "DR, Cuban Male 15.7\nPR") # Replace gender
# Apply functions to Latinx data
<- make_rows(latinx_sub_2017)
latinx_sub_2017 <- clean_table(table = latinx_sub_2017) latinx_sub_2017
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [19].
# 2018 data
# Clean the data string
<- str_replace_all(string = latinx_sub_2018,
latinx_sub_2018 pattern = "\\s:\n{2}|\n{2}", #remove two newline characters
replacement = " ")
# Apply make_rows function
<- make_rows(latinx_sub_2018)
latinx_sub_2018
# Apply clean table function
<- clean_table_2018(latinx_sub_2018) latinx_sub_2018
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [12].
# Create function to fix naming issues
<- function(table){
fix_latinx_naming |>
table # Appropriately label Puerto Rican, Dominican, Cuban group
mutate(Group = str_replace(string = Group,
pattern = "Pr, Dr, Cuban",
replacement = "Puerto Rican, Dominican, Cuban"),
Race_Ethnicity = str_replace(string = Race_Ethnicity,
pattern = "Pr, Dr, Cuban",
replacement = "Puerto Rican, Dominican, Cuban"))
}
# Apply function to both data sets
<- fix_latinx_naming(latinx_sub_2017)
latinx_sub_2017 <- fix_latinx_naming(latinx_sub_2018)
latinx_sub_2018
# Add missing data
<- latinx_sub_2018 |>
latinx_sub_2018 add_row(Group = "Latinx", Percent = 12.8,
Race_Ethnicity = "Latinx", Gender = "All") |>
add_row(Group = "Latinx", Percent = 12.3,
Race_Ethnicity = "Latinx", Gender = "Men") |>
add_row(Group = "Latinx", Percent = 13.3,
Race_Ethnicity = "Latinx", Gender = "Women")
# Make Male/Female the values for gender across both datasets
<- latinx_sub_2018 |>
latinx_sub_2018 mutate(across(.cols = c(Gender, Group),
~ str_replace(string = ., pattern = "Men", replacement = "Male")),
across(.cols = c(Gender, Group),
~ str_replace(string = ., pattern = "Women", replacement = "Female")))
# Add years to data
<- latinx_sub_2017 |>
latinx_sub_2017 mutate(Year = 2017)
<- latinx_sub_2018 |>
latinx_sub_2018 mutate(Year = 2018)
# Combine 2017 and 2018 data
<- bind_rows(latinx_sub_2017, latinx_sub_2018)
latinx_subgroups
# Add missing categories
<- latinx_subgroups |>
latinx_subgroups select(-Group) |>
pivot_wider(names_from = Year,
values_from = Percent) |>
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year" ,
values_to="Percent")
# Clean up Latinx group names
<- latinx_subgroups |>
latinx_subgroups # Convert Latino/Latina to Latinx
mutate(Race_Ethnicity = replace(Race_Ethnicity, Race_Ethnicity == "Latina", "Latinx")) |>
mutate(Race_Ethnicity = replace(Race_Ethnicity, Race_Ethnicity == "Latino", "Latinx")) |>
drop_na()
Exploratory Data Analysis
Note that plots are clear, have titles, it’s clear what’s plotted, and there’s a corresponding interpretation guiding the reader. This is all good! These could be improved by having more informative titles that make clear to the viewer what the take-home message is and by handling years as factors.
To get a glance of our overall data set, we plotted overall youth disconnection over time in the United States.
|>
major_groups_long filter(Gender == "All", Race_Ethnicity == "All_races") |>
ggplot(aes(x = Year, y = Percent)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
labs(
title = "Youth Disconnection over Time",
x = "Year")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
We are able to visualize that from 2008 to 2010 there was an increase in youth disconnect; however, after 2010, there seems to be a consistent downtrend in youth disconnect.
To account for differences in gender, we plotted overall youth disconnection over time by gender.
|>
major_groups_long filter(Gender != "All", Race_Ethnicity == "All_races") |>
ggplot(aes(x = Year, y = Percent, color = Gender)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
scale_color_viridis_d() +
labs(
title = "Youth Disconnection over Time by Gender"
)
Initially in 2008, females faced more disconnect than males. However, throughout the next years, males consistently faced greater disconnect that females. Both male and female disconnect trended downwards 2010 and after.
We then separated major groups by race/ethnicity and visualized youth disconnect over time.
|>
major_groups_long filter(Gender == "All", Group != "United States") |>
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
scale_color_viridis_d() +
labs(
title = "Youth Disconnection over Time by Race/Ethnicity",
color = "Race/Ethnicity"
)
We can see that Native Americans faced the most youth disconnect at about 25% and Asian faced the least with about 7.5% throughout all years in our data set. Overall, there is a small downtrend throughout the pass years in all races/ethnicities 2010 and after.
We visualized male/female youth disconnection by race/ethnicity across the observed years.
|>
major_groups_long filter(Gender != "All") |> # Filter for only Male and Female
# Combine Latino and Latina into Latinx
mutate(Race_Ethnicity = replace(Race_Ethnicity, Race_Ethnicity == "Latina ", "Latinx")) |>
mutate(Race_Ethnicity = replace(Race_Ethnicity, Race_Ethnicity == "Latino ", "Latinx")) |>
#renaming all races column
mutate(Race_Ethnicity = replace(Race_Ethnicity, Race_Ethnicity == "All_races", "All Races")) |>
# Plot scatter line plot of Percent vs Year by Gender and Ethnicity
ggplot(aes(x = Year, y = Percent, color = Gender)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
facet_wrap(~Race_Ethnicity, nrow = 2) +
scale_color_viridis_d() +
labs(
title = "Youth Disconnection over Time by Race/Ethnicity and Gender"
)
Gender does not seem to affect the percentage much except in Black, Latinx, and Native American groups where there is a marginal difference. However, the main indicator in determining youth disconnection seems to be stronger aligned to Race/Ethnicity as that is where we see the largest disparity, while gender has smaller differences within a group.
Taking a deeper dive into the Latinx group, we visualized youth disconnection of the different subgroups within the Latinx group.
|>
latinx_subgroups filter(Gender == "All", Race_Ethnicity != "Latinx") |> # Filter by overall not specific gender
# Plot Percent vs Year by Ethnicity
ggplot(aes(x = as.numeric(Year), y = Percent, color = Race_Ethnicity)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
scale_color_viridis_d() +
labs(
title = "Youth Disconnection of Latinx Subgroups over Time by Race/Ethnicity",
color = "Race/Ethnicity",
x = "Year"
)
Most groups have a relatively similar youth disconnection percentage, however, the South American group clearly has the lowest rate of youth disconnection at around 8% while Puerto Rico, Cuba, Dominican Republic have the highest with around 14.5% across 2017 and 2018.
Taking a deeper dive into the Asian group, we visualized youth disconnection of the different subgroups within the Asian group.
|>
asian_subgroups filter(Gender == "All", Race_Ethnicity != "All_races") |> # Filter by overall not specific gender
# Plot Percent vs Year by Ethnicity
ggplot(aes(x = as.numeric(Year), y = Percent, color = Race_Ethnicity)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
scale_color_viridis_d() +
labs(
title = "Youth Disconnection of Asian Subgroups over Time by Race/Ethnicity",
color = "Race/Ethnicity",
x = "Year"
)
Warning: Removed 3 rows containing missing values (`geom_line()`).
Warning: Removed 3 rows containing missing values (`geom_point()`).
Most groups are in the 4-7% range for youth disconnection percentage; however, China has the lowest at about 4% and Hmong and Cambodia have the highest at about 13% across 2017 and 2018.
Data Analysis
We graphed youth disconnection over time with a line of best-fit layered on top of it.
|>
major_groups_long # Scatterplot with linear model for overall youth disconnection over time
filter(Gender == "All", Race_Ethnicity == "All_races") |>
ggplot(aes(x = Year, y = Percent)) +
geom_point(size = 3) +
geom_smooth(method = "lm") +
labs(
title = "Youth Disconnection over Time",
subtitle = "Includes linear model and standard error",
x = "Year")
`geom_smooth()` using formula = 'y ~ x'
The line of best-fit exhibits a slight negative relationship, showing there is some correlation with youth disconnection and time; however, there is a decent amount of standard error throughout, which may be due to the 2008 data point defying the trend of the other data points.
To test our hypothesis that youth disconnection does not have a decreasing trend over time, we did a Mann-Kendall test for monotonicity.
# M-K test
|>
major_groups_long filter(Gender == "All", Race_Ethnicity == "All_races") |>
pull(Percent) |>
MannKendall() |>
tidy()
# A tibble: 1 × 5
statistic p.value kendall_score denominator var_kendall_score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.600 0.133 -9 15.0 28.3
We see there the p-value is 13.3% percent, which is not statistically significant at the traditional 5% level. This means our data is likely not monotonic and we would fail to reject our null hypothesis that youth disconnection is decreasing over time.
Knowing that the Great Recession occurred in 2008 and could provide an anomaly in our data, we ran a linear model on the data for 2010 and after.
# Scatterplot with linear model for overall youth disconnection over time after 2010
|>
major_groups_long filter(Gender == "All", Race_Ethnicity == "All_races", Year != 2008) |>
ggplot(aes(x = Year, y = Percent)) +
geom_point(size = 3) +
geom_smooth(method = "lm") +
labs(
title = "Youth Disconnection over Time (2010 and after)",
subtitle = "Includes linear model and standard error",
x = "Year")
`geom_smooth()` using formula = 'y ~ x'
This graph seems to show a stronger linear relationship between youth disconnection over time as there is a steeper slope and less standard error than our first linear model visualization.
Again, to test our hypothesis that youth disconnection does not have a decreasing trend over time for 2010 and after, we did another Mann-Kendall test.
# M-K test for data after 2008
|>
major_groups_long filter(Gender == "All", Race_Ethnicity == "All_races", Year != 2008) |>
pull(Percent) |>
MannKendall() |>
tidy()
# A tibble: 1 × 5
statistic p.value kendall_score denominator var_kendall_score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 0.0275 -10 10 16.7
This time, the p-value is 2.75% and significant at the 5% level. Thus, this data set is likely monotonic and we would reject the null hypothesis that there is no decreasing trend between youth disconnection and time.
Model is not fully quantified nor explained in the context of the data. Be sure that if you’re stating numbers, you’re contextualizing them. Be sure if you’re stating things that are true, that you’re quantifying them when possible.
Results/Discussion
Through our exploratory data analysis, we observed that there does seem to be a decreasing trend of youth disconnection in the United States over the time frame of our data with the 2008 data being a slight anomaly. The decreasing trend with the exception of the initial 2010 spike remained generally consistent regardless of separation by gender, race/ethnicity, and both as seen through out first four visualizations. Although the trend is fairly consistent, there are additional takeaways from our visualizations.
Race/ethnicity is a major determinant in the magnitude of youth disconnection rates while gender is a much weaker determinant. Through our third visualization we see that the Native American group has a youth disconnection percentage of about 25%, which is over three time the lowest group, Asian, at about 7.5%. The Black group, although not as high as Native American, has a youth disconnection rate higher than the overall average across the years at about 20% versus 13%. Additionally, every group’s rate is 2-5% different from the nearest group with no two groups being similar to each other. This leads us to believe that although a decreasing trend is apparent across all groups, groups still have differences in how disconnected their youth are. Gender, on the other hand, does not show the same disparity. Our second visualization shows that males may be more disconnected, but when broken down into race/ethnicity in our fourth visualization we see the effect of gender depends greatly on the race/ethnic group. Females are more disconnected in Latinx and Asian groups, while the youth disconnection rate is fairly similar across gender overall and in the Asian and White group. The Black group has the most interesting data where males are fairly more disconnected than females but even in other groups such as Latinx and Native American where there is a difference in female and male disconnection, it is not a major difference.
In taking a deeper dive into the Latinx and Asian subgroups in visualizations five and six, it can be seen that there are not big differences in youth disconnection between subgroups except in two cases. The South American subgroup has a substantially lower rate at about 8% versus the 12-15% range that the other Latinx subgroups are in. Oppositely, Hmong and Cambodia groups have a much higher youth disconnection rate at about 13% versus its Asian peers in the 4-7% range.
To quantify our observations in our exploratory data analysis, we created a linear model for our major group data and performed a Mann-Kendall test. The linear model shows a decreasing trend in youth disconnection rate over time but it does not seem to be a very strong relationship and has relatively high standard error. When performing the Mann-Kendall test, the p-value was 13.3%, which is not statistically significant at the 5% level, implying that there is not a monotonic trend in the data. We decided to create a second linear model and Mann-Kendall test without the 2008 data since youth disconnection likely increased in 2010 due to the Great Recession that the United States faced. When doing this, a much stronger decreasing trend can be seen in the linear model and the Mann-Kendall test has a p-value of 2.75%, which is statistically significant at the 5%, implying that there is a monotonic trend in the data. In this case, a decreasing trend over time.
Conclusion
When taking account our entire data set, we cannot mathematically state that youth disconnection rates have changed over time, although visually and when excluding 2008 data, we can see that youth disconnection has been trending downwards starting in 2010 and after. This trend remains consistent across gender and race/ethnicity; however, there are still differences between groups in how disconnected are youth. Native American and Black are the most disconnected groups with Native American clearly have the highest youth disconnection rate. Although gender is not a major factor in determining youth disconnection in most cases, Black males are notably more disconnected than Black females. All other Latinx groups are more disconnected than the South American group and Hmong and Combodian groups are notably more disconnected than the other Asian groups. In conclusion, youth disconnection rates are experiencing a downward trend in 2010 and after and its magnitude is most heavily influenced by race/ethnicity in most cases.