<- function(input){
function_name # operations using input
}
12-cs01-eda
CS01: Biomarkers of Recent Use (EDA)
Q&A (Tu 11/7)
Q: was curious about the sensitivity/cut-offs/specificity, but we mainly discussed it in class already; was still slightly confused about it?
A: We’ll discuss this in detail later this week and early next week!
Q: Are Youden’s indices related to ROC curves?
A: Related, yes! We’ll discuss both soon!
Q: I wasn’t sure how to interpret some of the visuals towards the end of lecture.
A: That’s OK! We’ll be recreating these and discussing them more as we do this case study in class.
Q: Did they actually use intravenous blood draws or just thumb pricks because multiple intravenous would not be fun.
A: It was venous blood from the arm. This unfunness is one of the reasons participants were compensated.
Q: Did this study (or other studies on THC) impairment end up influencing any legislation at the local or state level?
A: Great question! The state is currently reviewing these and other study’s data. The state was definitely aware of this study and waited (im)patiently while we analyzed and worked to publish.
Q: How long should our reports be?
A: It’s hard to say. We’ll discuss an example today so you have a sense!
Q: Regarding the final project, will there be a Google form that we can fill out that will help us form groups if we can’t form one ourselves? A: Yup - I’d say try to find a group using Piazza, in class, or during lab. However, if you’re unable, when you fill out the form to indicate your group next week, you’ll select that you’d like to be placed into a group.
Course Announcements (Tu 11/7)
Due Dates:
- 🧪 No Lab this week (holiday) - all students will receive full credit for Lab05 (MLR)
- ❓ Mid-course survey “due” (for EC) Friday
- 💻 HW03 (MLR) due Mon 11/20
Notes:
- CS01 Groups have been sent out
- email for contact
- GitHub repo <- please accept and open; make sure you have access
- group mate feedback is required
- if you made changes to repo yesterday, be sure to pull to get data in your repo
- Final Project: can use Piazza to help find group mates
. . .
The CS01 data are data for you only. My collaborator is excited that y’all will be working on this…but these are still research data, so please do not share with others or post publicly.
Q&A (Th 11/9)
Q: Are you allowed to share the average midterm scores for the past few quarters?
A: IIRC, they were in the mid-high 80%s
Q: When describing the dataset in our CSs, would it be okay to format it as a data card rather than a paragraph explanation of the variables structure? Additionally, would it be okay to store this as its own read.me in the repo or should it be a part of the main report?
A: Yes - like the data card idea. And a detailed README in the repo is great. A short description should still be included in the report and can point to the readme.
Q: When should we cite in our case study? It is just whenever we look up and use code from the internet?
A: There AND any time you get information elsewhere that’s not general knowledge. For example, in your background section, you’ll likely cite a bunch of sources.
Course Announcements (Th 11/9)
Due Dates:
- 🔬 No Lab this week (holiday) - all students will receive full credit for Lab05 (MLR)
- ❓ Mid-course survey “due” (for EC) Friday
- 💻 HW03 (MLR) due Mon 11/20
- ✅ HW02 Scores/Feedback Posted
Functions in R
- “You should consider writing a function whenever you’ve copied and pasted a block of code more than twice” -Hadley
. . .
For example…
<- function(val){
double_value * 2
val }
. . .
To use/execute:
double_value(3)
[1] 6
. . .
In what we’ve done so far, we’ve seen functions that operate on and return the whole dataframe (DF in DF out) (drop_dups
) and those that carry out operations on each row of a dataframe with a number of inputs (i.e. assign_timepoint
; these require the function to be map
-ed)
Additional resource: https://r4ds.had.co.nz/functions.html
Agenda
- Previous Projects
- Exploring the Data
Previous Case Studies
Example Case Study
See & Discuss: https://cogs137.github.io/website/content/cs/cs-example.html
Feedback & Scores
Feedback to other students here
You cannot see the projects, but can read all of the comments and see the associated score. Also, note that the same row is not the same group.
. . .
Common comments:
- context/explanation/guidance/lacking
- missing citations
- failure to introduce/describe the data
- making statements without evidence
- need to edit for cohesiveness, story, clarity
An (Example) Rubric
This is NOT the rubric for your case study, but it will be similar:
Notes
- Lots of code/plots will be provided here
- You are free to include any of it in your own case study (no attribution needed)
- You probably should NOT include all of them in your final report
- For any of the “basic” plots you include in your report, you’ll want to clean them up/improve their design.
- Your final report should be polished from start to finish
Data & Files
Packages
Two additional packages required for these notes:
library(tidyverse)
library(purrr)
library(rstatix)
Our Datasets
Three matrices:
- Blood (WB): 8 compounds; 190 participants
- Oral Fluid (OF): 7 compounds; 192 participants
- Breath (BR): 1 compound; 191 participants
. . .
Variables:
ID
| participants identifierTreatment
| placebo, 5.90%, 13.40%Group
| Occasional user, Frequent userTimepoint
| indicator of which point in the timeline participant’s collection occurredtime.from.start
| number of minutes from consumption- & measurements for individual compounds
The Data
Reading in the .RData we wrote at the end of the last set of notes…(using load
)
load("data/compounds.RData")
load("data/timepoints.RData")
load("data/data_clean.RData")
This reads the objects stored in these files into your Environment for use.
What to do with all of these functions?
For example…we discussed this function in the last set of notes.
<- function(dataset){
drop_dups <- dataset |>
out filter(!is.na(timepoint_use)) |>
group_by(timepoint_use) |>
distinct(id, .keep_all=TRUE) |>
ungroup()
return(out)
}
. . .
We’re going to have a lot of functions throughout…like this helper function to clean up names
# helper function to clean up name of two compounds
<- function(df){
clean_gluc <- df |>
df mutate(compound=gsub('GLUC', 'gluc',gsub("_","-",toupper(compound))),
compound=gsub('THCOH', '11-OH-THC', compound))
return(df)
}
. . .
Functions can/should be stored in a separate .R
file, probably in a src/
directory.
. . .
To have access to the functions in that file…
source("path/to/file")
. . .
source("src/cs01_functions.R")
EDA
Single Variable (basic) plots
For a single compound…
ggplot(WB, aes(x=thc)) + geom_histogram()
. . .
But, with wide data, that’s not easy to do for all compounds, so you may want to pivot those data….
<- WB |>
WB_long pivot_longer(6:13) |>
rename("fluid"="fluid_type")
. . .
Distribtions across all compounds (WB):
ggplot(WB_long, aes(x=value)) +
geom_histogram() +
facet_wrap(~name)
. . .
Now the same for OF and BR:
<- OF |> pivot_longer(6:12)
OF_long <- BR |> pivot_longer(6) BR_long
. . .
Combining long datasets:
<- bind_rows(WB_long, OF_long, BR_long) df_full
. . .
Plotting some of these data…
|>
df_full mutate(group_compound=paste0(fluid,": ", name)) |>
ggplot(aes(x=value)) +
geom_histogram() +
facet_wrap(~group_compound, scales="free")
Two variables at a time
THC & Frequency
|>
df_full filter(name=="thc") |>
ggplot(aes(x=group, y=value)) +
geom_boxplot() +
facet_wrap(~fluid, scales="free")
THC & Treatment Group
|>
df_full filter(name=="thc") |>
ggplot(aes(x=treatment, y=value)) +
geom_boxplot() +
facet_wrap(~fluid, scales="free")
Focus on a specific timepoint…
|>
df_full filter(name=="thc", timepoint=="T2A") |>
ggplot(aes(x=treatment, y=value)) +
geom_boxplot() +
facet_wrap(~fluid, scales="free")
At this point…
We start to get a sense of the data with these quick and dirty plots, but we’re really only scratching the surface of what’s going on in these data.
. . .
These data require a lot of exploration due to the number of compounds, multiple matrices, and data over time aspects.
Compounds across time
<- function(dataset, compound, timepoints){
compound_scatterplot_group if(max(dataset[,compound],na.rm=TRUE)==0){
print(
|>
dataset filter(!is.na(time_from_start)) |>
ggplot(aes_string(x="time_from_start",
y=compound,
color="group")) +
geom_point() +
geom_vline(data=timepoints, aes(xintercept=as.numeric(stop)),
linetype="dashed",
color="gray28") +
scale_color_manual(values=c("#19831C", "#A27FC9")) +
scale_y_continuous(limits=c(0,3)) +
theme_classic() +
theme(legend.position="bottom",
legend.title=element_blank()) +
labs(x='Time From Start (min)',
y=gsub('GLUC', 'gluc',gsub("_", "-", toupper(compound))))
else{
)}print(
|>
dataset filter(!is.na(time_from_start)) |>
ggplot(aes_string(x="time_from_start",
y=compound,
color="group")) +
geom_point() +
geom_vline(data=timepoints, aes(xintercept=as.numeric(stop)),
linetype="dashed",
color="gray28") +
scale_color_manual(values=c("#19831C", "#A27FC9")) +
theme_classic() +
theme(legend.position="bottom",
legend.title=element_blank()) +
labs(x='Time From Start (min)',
y=gsub('GLUC', 'gluc', gsub("_", "-", toupper(compound))))
)
} }
<- map(compounds_WB, ~ compound_scatterplot_group(
scatter_wb dataset=WB_dups,
compound=.x,
timepoints=timepoints_WB))
<- map(compounds_OF, ~ compound_scatterplot_group(
scatter_of dataset=OF_dups,
compound=.x,
timepoints=timepoints_OF))
<- map(compounds_BR, ~ compound_scatterplot_group(
scatter_br dataset=BR_dups,
compound=.x,
timepoints=timepoints_BR))
Pairplots
pairs(WB[,unlist(compounds_WB)],
pch=19,
cex=0.3,
cex.labels=0.6,
labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(WB[,unlist(compounds_WB)])))))
pairs(OF[,unlist(compounds_OF)],
pch=19,
cex=0.4,
cex.labels=0.6,
labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(OF[,unlist(compounds_OF)])))))
❓ Why is there no pairplot for Breath?
Group Differences: Frequency of Use
<- function(dataset, compounds, tissue, legend=TRUE, y_lab=TRUE){
compound_boxplot_group_only =levels(dataset$timepoint_use)[1]
timepoint_to_use<- dataset |>
df filter(timepoint_use == timepoint_to_use) |>
select(group, all_of(compounds))
<- df |>
df gather(compound, value, -group) |>
clean_gluc() |>
group_by(compound) |>
mutate(y_max=1.2*max(value)) |>
group_by(compound, group) |>
mutate(n=n(),
my_label=paste0(group, ' N=', n),
my_label= gsub(" ", "\n", my_label))
if(tissue == "Blood"){
$compound=factor(df$compound, levels=c("THC","11-OH-THC","THCCOOH","THCCOOH-gluc"))
df
}
<- df |>
y_pos group_by(compound) |>
summarize(y.position=mean(y_max))
<- df |>
stat.test group_by(compound) |>
t_test(value ~ my_label) |>
adjust_pvalue(method="bonferroni") |>
add_significance()
<- stat.test |>
test left_join(y_pos) |>
mutate(p.adj.signif=ifelse(p.adj.signif=='?', 'ns', p.adj.signif),
p.adj=ifelse(p.adj < 0.001, "<0.001", p.adj))
if(legend){
='right'
leg_positionelse{
}='none'
leg_position
}
if(y_lab){
="Concentration (ng/mL)"
y_textelse{
}=''
y_text
}
<- function(x){
medianFunction return(data.frame(y=round(median(x),1),label=round(median(x,na.rm=T),1)))}
<- ggplot(df, aes(x=my_label, y=value, fill=my_label)) +
p2 geom_jitter(position=position_jitter(width=.3, height=0), size=0.8, color="gray65") +
geom_boxplot(outlier.shape=NA, alpha=0.6) +
stat_summary(fun="median", geom="point", shape=19, size=3, fill="black") +
stat_summary(fun.data=medianFunction, geom ="text", color="black", size=3.5, vjust=-0.65) +
facet_wrap(~compound, scales="free_y", ncol=4) +
geom_blank(aes(y=y_max)) +
scale_y_continuous(limits=c(0, NA), expand=expansion(mult=c(0, 0.1))) +
scale_fill_manual(values=c("#19831C", "#A27FC9")) +
theme_classic() +
theme(text=element_text(size=14),
legend.position=leg_position,
legend.title=element_blank(),
panel.grid=element_blank(),
strip.background=element_blank()) +
labs(title=tissue,
x='',
y=y_text)
<- test |>
ann_text select(compound, p.adj, value=y.position, my_label=group1) |>
filter(p.adj < 0.05) |>
mutate(x1=1, x2=2)
if(tissue == "Whole Blood"){
$compound=factor(ann_text$compound,
ann_textlevels=c("THC","11-OH-THC","THCCOOH","THCCOOH-gluc"))
}
+ geom_text(data=ann_text, label=ann_text$p.adj, nudge_x=0.5) +
p2 geom_segment(data=ann_text, aes(x=x1, xend=x2,
y=value - (0.04 * value),
yend=value - (0.04*value)))
}
compound_boxplot_group_only(WB_dups, compounds=unlist(compounds_WB), tissue="Whole Blood")
compound_boxplot_group_only(OF_dups, compounds=unlist(compounds_OF), tissue="Oral Fluid")
compound_boxplot_group_only(BR_dups, compounds=unlist(compounds_BR), tissue="Breath")
Group Differences: Treatment
<- function(dataset, compounds, tissue){
compound_boxplot_treatment =levels(dataset$timepoint_use)[2]
timepoint_to_use<- dataset |>
df filter(timepoint_use == timepoint_to_use) |>
select(treatment, group, compounds)
<- df |>
df gather(compound, value, -treatment, -group) |>
clean_gluc()
|>
df ggplot(aes(x=treatment, y=value, fill=group)) +
# geom_jitter(color="gray36") +
geom_boxplot(outlier.size=0.1) +
scale_fill_manual(values=c("#19831C", "#A27FC9")) +
facet_wrap(~compound, scales="free_y", ncol=4) +
scale_x_discrete(labels=function(x) str_wrap(x, width=11)) +
theme_classic(base_size=10) +
theme(legend.position="bottom",
legend.title=element_blank(),
panel.grid=element_blank(),
strip.background=element_blank()) +
labs(title=tissue,
x="Treatment",
y="Measurement (ng/mL)")
}
compound_boxplot_treatment(WB_dups, compounds=unlist(compounds_WB), tissue="Whole Blood")
compound_boxplot_treatment(OF_dups, compounds=unlist(compounds_OF), tissue="Oral Fluid")
compound_boxplot_treatment(BR_dups, compounds=unlist(compounds_BR), tissue="Breath")
Compound Summaries
<- function(dataset, compound, timepoint_use=2){
T2A_plot =levels(factor(dataset$timepoint_use))[timepoint_use]
timepoint_to_useif(max(dataset[,compound],na.rm=TRUE)==0){
print(
ggplot(subset(dataset, timepoint_use==timepoint_to_use),
aes_string(x="group",
y=compound,
fill="group")) +
geom_boxplot(outlier.size=0.1) +
scale_fill_manual(values=c("#19831C", "#A27FC9")) +
scale_x_discrete(labels=function(x) str_wrap(x, width=10)) +
scale_y_continuous(limits=c(0,3)) +
facet_grid(~treatment) +
theme_classic() +
theme(legend.position="none",
panel.grid=element_blank(),
strip.background=element_blank(),
plot.title.position="plot") +
labs(title=paste0('Timepoint: ',
levels(dataset$timepoint_use)[timepoint_use],
' post-smoking'),
x='Group',
y=gsub('GLUC', 'gluc',gsub("_", "-", toupper(compound))))
else{
)}print(
ggplot(subset(dataset, timepoint_use==timepoint_to_use),
aes_string(x="group",
y=compound,
fill="group")) +
geom_boxplot(outlier.shape=NA, alpha=0.8) +
scale_fill_manual(values=c("#19831C", "#A27FC9")) +
scale_x_discrete(labels=function(x) str_wrap(x, width=10)) +
facet_grid(~treatment) +
theme_classic() +
theme(legend.position="none",
panel.grid=element_blank(),
strip.background=element_blank(),
plot.title.position="plot") +
labs(title=paste0('Timepoint: ',
levels(dataset$timepoint_use)[timepoint_use],
' post-smoking'),
x='Group',
y=gsub('GLUC', 'gluc',gsub("_","-",toupper(compound))))
)
} }
<- map(compounds_WB, ~ T2A_plot(
post_wb dataset=WB_dups,
compound=.x))
<- map(compounds_OF, ~ T2A_plot(
post_of dataset=OF_dups,
compound=.x))
<- map(compounds_BR, ~ T2A_plot(
post_br dataset=BR_dups,
compound=.x))
HW02 (Revisited)
HW02 : Part II
Imitation is the highest form of flattery
Chess Players
<- read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/covid-19-polls/master/covid_approval_polls.csv"))
COVID_approval
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/chess-transfers/transfers.csv")
chess
<- chess |>
chess_trans count(Federation) |>
arrange(desc(n)) |>
slice_head(n = 10)
<- chess_trans |>
chess_trans mutate(rank = row_number()) |>
mutate(Federation = case_when(
== "USA" ~ "United States",
Federation == "GER" ~ "Germany",
Federation == "CAN" ~ "Canada",
Federation == "ESP" ~ "Spain",
Federation == "RUS" ~ "Russia",
Federation == "FRA" ~ "France",
Federation == "BIH" ~ "Bosnia and Herzegovina",
Federation == "CRO" ~ "Croatia",
Federation == "TUR" ~ "Turkey",
Federation == "AUT" ~ "Austria",
Federation
TRUE ~ Federation # Keep the name as-is for other cases
))
<- chess |>
chess_trans count(Federation) |>
arrange(desc(n)) |>
slice_head(n = 10)
<- chess_trans |>
chess_trans mutate(rank = row_number()) |>
mutate(Federation = case_when(
== "USA" ~ "United States",
Federation == "GER" ~ "Germany",
Federation == "CAN" ~ "Canada",
Federation == "ESP" ~ "Spain",
Federation == "RUS" ~ "Russia",
Federation == "FRA" ~ "France",
Federation == "BIH" ~ "Bosnia and Herzegovina",
Federation == "CRO" ~ "Croatia",
Federation == "TUR" ~ "Turkey",
Federation == "AUT" ~ "Austria",
Federation
TRUE ~ Federation # Keep the name as-is for other cases
))
ggplot(chess_trans, aes(y = reorder(Federation, n), x = n)) +
geom_bar(stat = "identity",fill = "#1c9099") +
geom_text(aes(x = -3, y = Federation, label = n), size = 4) + #add count at the left side of the bars
labs(title = bquote(bold("More players transfer to the U.S. than to any other country")),
subtitle = "Nations that received the highest number of player transfers, 2000-17",
x = "NUMBER OF TRANSFERS",
y = "COUNTRY") +
scale_fill_identity() +
theme_minimal() +
theme(plot.title.position = "plot",
panel.grid.major.y = element_blank()) +
theme(axis.text.y = element_text(size = 10, angle = 0, hjust = 0), #align to the left
plot.title = element_text(size=18)) #change font size
Kushi: Fandango
# Kushi also got font working and stars on x-axis that I'd have to spend more time to get working
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv")
fandango_score_comparison
<- fandango_score_comparison |>
all_scores_comp select(FILM, RT_norm_round, RT_user_norm_round, Metacritic_norm_round, Metacritic_user_norm_round, IMDB_norm_round, Fandango_Stars) |>
pivot_longer(c(RT_norm_round, RT_user_norm_round, Metacritic_norm_round, Metacritic_user_norm_round, IMDB_norm_round, Fandango_Stars), names_to = "Site", values_to = "Score")
<- all_scores_comp |>
counts_of_scores group_by(Site, Score) |>
summarize(Count = n())
<- counts_of_scores |>
counts_of_scores group_by(Site) |>
mutate(Total = sum(Count))
<- counts_of_scores |>
scores_with_percents mutate(Percent = Count / Total * 100)
# making percents 0 for any scores that don't have any values
<- tibble(Site = c("RT_norm_round", "RT_user_norm_round", "Metacritic_norm_round", "Metacritic_user_norm_round", "IMDB_norm_round", "Fandango_Stars"))
sites
<- tibble(Score = seq(0, 5, by = 0.5))
ratings
<- expand.grid(Site = sites$Site, Score = ratings$Score)
all_scores
<- all_scores |>
all_scores full_join(scores_with_percents, by = c("Site", "Score")) |>
mutate(Percent = case_when(is.na(Count) ~ 0,
TRUE ~ Percent))
<- ggplot(all_scores, aes(x = Score, y = Percent, color = Site)) +
scores geom_line() +
geom_hline(yintercept = 0, size = 0.7, color = "black") +
labs(x = NULL, y = NULL,
title = "Fandango LOVES Movies",
subtitle = "Normalized ratings distrubution of 146 films in theaters in 2015 that\n had 30+ reviews on Fandango.com") +
scale_y_continuous(labels = c("0", "10", "20", "30", "40%")) +
scale_x_continuous(labels = c("☆", "★", "★★", "★★★", "★★★★", "★★★★★")) +
scale_color_manual(values = c("Fandango_Stars" = "#fa6d54",
"IMDB_norm_round" = "#e5c66a",
"Metacritic_user_norm_round" = "#aeca91",
"RT_user_norm_round" = "#76bde0",
"Metacritic_norm_round" = "#b87eb5",
"RT_norm_round" = "#a3a3a3")) +
geom_ribbon(data = filter(all_scores, Site != "Fandango_Stars"), aes(ymin = 0, ymax = Percent), alpha = 0.1) +
geom_ribbon(data = filter(all_scores, Site == "Fandango_Stars"), aes(ymin = 0, ymax = Percent, fill = "#fa6d54", alpha = 0.24)) +
guides(color = "none", fill = "none", alpha = "none")
+
scores annotate("text", x = 4.9, y = 35, label = "Fandango", size = 5, color = "#fa6d54", fontface = "bold") +
annotate("text", x = 2.9, y = 37, label = "IMDb users", size = 5, color = "#e5c66a") +
annotate("text", x = 2.7, y = 27, label = "Metacritic\nusers", size = 5, color = "#aeca91") +
annotate("text", x = 2.2, y = 20, label = "Rotten\nTomatoes\nusers", size = 5, color = "#76bde0") +
annotate("text", x = 1.5, y = 16, label = "Metacritic", size = 5, color = "#b87eb5") +
annotate("text", x = 0.7, y = 13, label = "Rotten\nTomatoes", size = 5, color = "#a3a3a3") +
theme(#text = element_text(family = "NimbusSan"),
plot.title = element_text(face = "bold", size = 25, hjust = -0.1),
plot.subtitle = element_text(size = 15, hjust = -0.1),
plot.background = element_rect(fill = "#f0f0f0"),
panel.background = element_rect(fill = "#f0f0f0"),
panel.grid.major = element_line(color = "gray75", size = 0.2),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 14)
)
Markus: Congress
<- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-demographics/data_aging_congress.csv")
congress_data
=
congress_ages |>
congress_data select(congress, chamber, age_years) |>
mutate(congress_year = case_when(TRUE ~ 1787 + (2 * as.integer(congress)))) |>
group_by(congress_year, chamber) |>
mutate(mean_age = case_when(TRUE ~ mean(age_years))) |>
mutate(chamber = fct_recode(chamber,
SENATE = "Senate",
HOUSE = "House",))
ggplot(data = congress_ages,
mapping = aes(y = mean_age,
x = congress_year,
color = fct_rev(chamber),
+
)) geom_step(size = 1) +
guides() +
labs(title = "The House and Senate are older than ever before",
subtitle = "Median age of the U.S. Senate and U.S. House by Congress, 1919 to 2023",
caption = "Data is based on all members who served in either the Senate or House in each Congress, which is notated\nby the year in which it was seated. Any member who served in bothchambers in the same Congress was\nassigned to the chamber in which they cast more votes.\n FiveThirtyEight\nSOURCES: BIOGRAPHICAL DIRECTORY OF THE U.S. CONGRESS, U.S. HOUSE OF REPRESENTATIVES,\nU.S. SENATE, UNITEDSTATES GITHUB, VOTEVIEW.COM",
y = NULL,
x = NULL
+
) scale_color_manual(values=c("#6b4ddd","#29ae53")) +
theme_minimal(base_size = 13) +
theme(plot.title.position = "plot",
plot.title = element_text(face = "bold"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title=element_blank(),
legend.position = "top")
Voter Demographics
library(ggpubr)
= read.csv("data/nonvoters_data.csv")
voters
#creating subplots
= ggplot(data = voters, mapping = aes(y = factor(race), fill = factor(voter_category))) +
race geom_bar(position = "fill") +
ggtitle("Race") +
scale_fill_manual(values = c("always" = "#ffbc00","sporadic"="#e082ad", "rarely/never" ="#ae4dff")) +
theme_minimal() +
theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), line = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank())+
guides(fill = guide_legend(override.aes = list(shape = 16, key_width = 1, key_height = 1))) #trying to change the shape of the legend key
= ggplot(data = voters, mapping = aes(y = factor(income_cat), fill = factor(voter_category))) +
income geom_bar(position = "fill") +
ggtitle("Income") +
scale_fill_manual(values = c("always" = "#ffbc00","sporadic"="#e082ad", "rarely/never" ="#ae4dff")) +
theme_minimal() +
theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), line = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank())
= ggplot(data = voters, mapping = aes(y = factor(race), fill = factor(voter_category))) +
age geom_bar(position = "fill", show.legend = FALSE) +
ggtitle("Age") +
scale_fill_manual(values = c("always" = "#ffbc00","sporadic"="#e082ad", "rarely/never" ="#ae4dff")) +
theme_minimal() +
theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), line = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank())
= ggplot(data = voters, mapping = aes(y = factor(educ), fill = factor(voter_category))) +
education geom_bar(position = "fill", show.legend = FALSE) +
ggtitle("Education") +
scale_fill_manual(values = c("always" = "#ffbc00","sporadic"="#e082ad", "rarely/never" ="#ae4dff")) +
theme_minimal() +
theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), line = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank())
= ggplot(data = voters, mapping = aes(y = factor(race), fill = factor(voter_category))) +
partyID geom_bar(position = "fill", show.legend = FALSE) +
ggtitle("party ID") +
scale_fill_manual(values = c("always" = "#ffbc00","sporadic"="#e082ad", "rarely/never" ="#ae4dff")) +
theme_minimal() +
theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), line = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank())
#combining subplots
= ggarrange(race, income, age, ncol = 3, common.legend = TRUE, legend = "top") +
ria theme(plot.margin = margin(0.5, 0,-2, -1,"cm"))
= ggarrange(education, partyID) +
ep theme(plot.margin = margin(2, 4, -2.5, 1, "cm"))
= ggarrange(ria, ep, nrow = 3)
voterplot
#titles
= annotate_figure(annotate_figure(voterplot,
voterplot top = text_grob("Demographic information of survey respondants, by voting history")),
top = text_grob("Those who always vote and those who sometimes vote aren't that different", face = "bold")
)
voterplot
Fouls
<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/foul-balls/foul-balls.csv"
raw_csv_file <- read.csv(url(raw_csv_file))
foulballs
<- foulballs |>
foulballs mutate(over90 = case_when(
< 90 ~ "no",
exit_velocity >= 90 ~ "yes"
exit_velocity
))
ggplot(foulballs,
aes (y = used_zone)) +
geom_bar(aes(fill = over90),
position = position_stack(reverse = TRUE),
show.legend = FALSE) +
scale_fill_manual(labels = c("< 90 mph", "≥ 90 mph", "Unknown exit velocity"),
values = c("#97c16d", "#63abb0"), na.value = "#d3d3d3") +
scale_y_discrete(limits = rev(unique(foulballs$used_zone)))+
scale_x_continuous(expand = expansion(mult = c(0, 0))) +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "#ececec"),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(color = "#9e9e9e"),
plot.caption = element_text(color = "#b5b5b5"),
axis.line.y = element_line(colour = "#343434"),
axis.title.y = element_text(angle = 0, vjust = 0.12, color = "#343434", size = 9),
axis.ticks = element_blank(),
legend.title = element_blank()) +
labs(title = "The hardest-hit fouls seem to land in unprotected areas",
subtitle = str_wrap("Foul balls by the stadium zone they landed in and their exit velocity, among 906 fouls hit this season in the most foul-heavy day at the 10 MLB stadiums that produced the most fouls as of June 5", 85),
x = "", y = "Zone",
caption = "SOURCE: BASEBALL SAVANT") +
annotate("text", x = 75, y = 1, label = "< 90 mph", col = "white") +
annotate("text", x = 140, y = 3.3, label = "≥ 90 mph", col = "#63abb0") +
annotate("text", x = 215, y = 1, label = "Unknown exit velocity", col = "white")
<- foulballs |>
ball_data mutate(category = case_when(
< 90 ~ "<90",
exit_velocity >= 90 ~ ">=90",
exit_velocity is.na(exit_velocity) ~ "Unknown"
))<- ggplot(ball_data, aes(y = fct_rev(as.character(used_zone)), fill = category)) +
my_plot geom_bar(position = position_stack(reverse = TRUE)) +
labs(title = "The hardest-hit fouls seem \nto land in unprotected areas",
subtitle = "The 906 foul balls hit this season from \nthe most foul-heavy day at each of the \n10 MLB stadiums that produced the \nthe most fouls as of June 5, by zone where \nthe balls landed and their exit velocities",
y = "Zone", x = NULL) +
scale_fill_manual(values = c("Unknown" = "#DEDEDE", "<90" = "#9ECE88", ">=90" = "#17AFAD")) +
scale_x_continuous(breaks = c(0, 50, 100, 150, 200, 250),position = "top") +
theme(
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(size = 0.5, linetype="solid", color="#CECECE"),
panel.grid.minor = element_blank(),
axis.text.y = element_text(color = "black", face = "bold"),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(size = 10, margin = margin(b = 20))
)
my_plot
Colin: NBA RAPTOR ratings
library(plotly)
<- read.csv("data/latest_RAPTOR_by_player.csv")
raptor
<- raptor |>
raptor_rounded filter(mp >= 1137) |>
mutate(across(where(is.numeric), round, 1))
<- raptor_rounded |>
raptor_plot ggplot(aes(x=raptor_offense, y = raptor_defense)) +
# Annotations
# The colored rectangles for the 1 & 3 quadrants
annotate("rect", xmin=0, xmax=10, ymin=0, ymax=10, fill = '#c5ecee', alpha = .85) +
annotate("rect", xmin=-10, xmax=0, ymin=-10, ymax=0, fill = '#fecada', alpha = .85) +
# The 3rd quadrant text rectangles
annotate("rect", xmin=-9.8, xmax=-5.5, ymin = -7.8, ymax = -6.8, fill = '#fd97b6') +
annotate("rect", xmin=-9.8, xmax=-5.5, ymin = -9.4, ymax = -8.4, fill = '#fd97b6') +
annotate("text", x = -7.8, y = -7.4, label = " - offense") +
annotate("text", x = -7.8, y = -9, label = " - defense") +
# The 1st quadrant text rectangles
annotate("rect", xmin=5.5, xmax=9.8, ymin = 6.8, ymax = 7.8, fill = '#8cdadf') +
annotate("rect", xmin=5.5, xmax=9.8, ymin = 8.4, ymax = 9.4, fill = '#8cdadf') +
annotate("text", x = 7.5, y = 8.8, label = " + offense") +
annotate("text", x = 7.55, y = 7.1, label = " + defense") +
# The 2nd quadrant text rectangles
annotate("rect", xmin=-9.8, xmax=-5.5, ymin = 6.8, ymax = 7.8, fill = '#8cdadf') +
annotate("rect", xmin=-9.8, xmax=-5.5, ymin = 8.4, ymax = 9.4, fill = '#fd97b6') +
annotate("text", x = -7.8, y = 8.8, label = " - offense") +
annotate("text", x = -7.8, y = 7.1, label = " + defense") +
# The 4th quadrant text rectangles
annotate("rect", xmin=5.5, xmax=9.8, ymin = -7.8, ymax = -6.8, fill = '#8cdadf') +
annotate("rect", xmin=5.5, xmax=9.8, ymin = -9.4, ymax = -8.4, fill = '#fd97b6') +
annotate("text", x = 7.55, y = -7.4, label = " + offense") +
annotate("text", x = 7.5, y = -9, label = " - defense") +
geom_point(shape= 21, colour = "black", fill = "white", size = 4) +
labs(
x = "Offensive RAPTOR rating",
y = "Defensive RAPTOR rating",
title = paste0('Nikola Jokic is the Best NBA Player Based on Overall RAPTOR Rating',
'<br>',
'<sup>',
'An Analytical Approach to the 2022 - 2023 NBA Season','</sup>')
+
)
# Theme settings
theme_light() +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "#cdcdcd", linewidth = 0.5),
plot.margin = margin(l = 100, r = 100, b = 20, t = 10),
plot.title = element_text(hjust = 0.5),
axis.text = element_text(color = "#cdcdcd", size = 12),
axis.title.x = element_text(margin = margin(t=3)),
axis.title.y = element_text(margin = margin(r=3)),
+
) coord_fixed(ratio = 1)
# interactive
# ggplotly(raptor_plot, text=player_name, hoverinfo='text')
raptor_plot
Biden Approval
<- read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/covid-19-polls/master/covid_approval_polls.csv"))
COVID_approval
#getting only data from beginning of COVID to Jan 19, 2021 and getting rid of the "all" category in the party column
<-
year_20 |>
COVID_approval separate(end_date, into = c("end_year", "end_month", "end_day"),convert = TRUE) |>
filter(party != "all",
<= 2020)|>
end_year unite(end_date, end_year, end_month, end_day, sep = "-")
<-
year_21 |>
COVID_approval separate(end_date, into = c("end_year", "end_month", "end_day"),convert = TRUE) |>
filter(party != "all",
== 2021 & end_month == 1 & end_day <= 19) |>
end_year unite(end_date, end_year, end_month, end_day, sep = "-")
# if (end_year == 2021 & end_month == 1 & end_day <= 19))
<-
select_COVID_approval full_join(year_20, year_21)
ggplot(select_COVID_approval, aes(x = as.Date(end_date),
y = approve,
color = party)) +
geom_smooth(aes(group = party), span = 0.05, se = FALSE) +
scale_color_manual(values = c("D" = "#008fd5",
"R" = "#ff2700",
"I" = "#a55330")) +
#Important Dates
geom_vline(xintercept = as.Date("2020-02-29"), linetype=3) +
annotate("text", x= as.Date("2020-02-29"), y = 95, label="First U.S. \n Death \n Reported", size=3, color="black") +
geom_vline(xintercept = as.Date("2020-05-20"), linetype=3) +
annotate("text", x= as.Date("2020-05-20"), y = 95, label="U.S. Deaths \n surpass \n 100,000", size=3, color="black") +
geom_vline(xintercept = as.Date("2020-10-02"), linetype=3) + #Trump Diagnosed with COVID-19
annotate("text", x= as.Date("2020-10-02"), y = 95, label="Trump \n Diagnosed \n with \n COVID-19", size=3, color="black") +
geom_vline(xintercept = as.Date("2020-11-07"), linetype=3) +
annotate("text", x= as.Date("2020-11-07"), y = 95, label="Biden \n declared \n election \n winner", size=3, color="black") +
geom_vline(xintercept = as.Date("2021-01-19"), linetype=3) +
annotate("text", x= as.Date("2021-01-19"), y = 95, label="Biden \n sworn \n into \n office", size=3, color="black") +
labs(title = "Approval of Trump’s response varies widely by party",
subtitle = "A calculation of the share of Democrats, Republicans and independents who approve of the president’s \n handling of the coronavirus outbreak") +
scale_y_continuous(limits = c(0, 100)) +
theme(plot.title = element_text(face = "bold", hjust = 0.5 ),
plot.subtitle = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position="none",
panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour="gray")
)
Recap
- Can you explain/describe the plots generated in the context of these data?
- Can you generate EDA plots of your own for these data
- Can you understand/work through the more complicated code provided (even if you couldn’t have come up with it on your own)