13-cs01-analysis
CS01: Biomarkers of Recent Use (Analysis)
Q&A
Q: How extensive does our extension component need to be?
A: A bit hard to answer in certain terms. We’ll discuss some examples today to hopefully set expectaions well. To explain in writing here, the most typical extension is students using the data provided to ask and answer a question not directly presented in class. Thus, simply generating a visualization not presented in class would NOT be sufficient. At the other end, finding external data on the topic and analyzing that data, while certainly allowed, would far exceed expectations. In between those extremes is what we expect: add significantly to the analysis, beyond what was presented in class.
Course Announcements
Due Dates:
- HW03 (MLR) due Mon 11/20
- Project Proposal (it will be a Google Form) due 11/20
- CS01 Deadlines:
- Lab06 due Friday - cs01-focused
- Report & “General Communication” due 11/27
- survey about how working with group went - due 11/28
. . .
Notes:
Midterm scores & Feedback posted
- overall, did very well
- avg: 13.85/15 (92%)
- 6 perfect scores
- answer key on course website
I am behind on emails and Piazza posts.
Mid-course Survey Summary
- N=73 (~75%)
- Pacing workload (so far) about right
- Course notes most helpful in the course overall
- Also helpful: completing labs, doing homework,
- Many are not checking labs against answer keys; most are not doing suggested readings
- Of those that attend lecture, most find it helpful
Mid-course: Time Spent
Mid-course: What would you change?
Agenda
- Debugging/Understanding Code Strategies
- Sensitivity & Specificity
- Cross-compound correlations
- Extensions
Summary: Figuring out what’s going on in code
Suggestions (as discussed in class):
- Look up documentation (i.e.
?...
) / Google the function - Run it on different input; see how output changing
- Run the code line-by-line, understanding output at each step
- Ask ChatGPT
Question
Which compound, in which matrix, and at what cutoff is the best biomarker of recent use?
. . .
Data & Files
Packages
Three additional packages required for these notes:
library(tidyverse)
library(purrr)
library(rstatix)
library(cowplot)
The Data
Reading in the data from the end of data wrangling notes:
load("data/compounds.RData")
load("data/timepoints.RData")
load("data/data_clean.RData")
And the functions…
source("src/cs01_functions.R")
Analysis
Sensitivity & Specificity
Sensitivity | the ability of a test to correctly identify patients with a disease/trait/condition. \[TP/(TP + FN)\]
. . .
Specificity | the ability of a test to correctly identify people without the disease/trait/condition. \[TN/(TN + FP)\]
. . .
❓ For this analysis, do you care more about sensitivity? about specificity? equally about both?
What is a TP here? TN? FP? FN?
Post-smoking (cutoff > 0)
- TP = THC group, value >= cutoff
- FN = THC group, value < cutoff
- FP = Placebo group, value >= cutoff
- TN = Placebo group, value < cutoff
. . .
Post-smoking (cutoff == 0)
Cannot be a TP or FP if zero…
- TP = THC group, value > cutoff),
- FN = THC group, value <= cutoff),
- FP = Placebo group, value > cutoff),
- TN = Placebo group, value < cutoff)
. . .
Pre-smoking
Cannot be a TP or FN before consuming…
- TP = 0
- FN = 0
- FP = value >= cutoff
- TN = value < cutoff
ROC
Receiver-Operator Characteristic (ROC) Curve: TPR (Sensitivity) vs FPR (1-Specificity)
Image Credit: By cmglee, MartinThoma - Roc-draft-xkcd-style.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109730045
Calculating Sensitivity and Specificity
<- function(dataset, dataset_removedups, split, compound,
make_calculations start = start, stop = stop, tpt_use = tpt_use){
## remove NAs
<- dataset_removedups %>%
df ::select(treatment, compound, timepoint_use) %>%
dplyrrename(compound = 2) %>%
filter(complete.cases(.))
if(nrow(df)>0){
if(stop <= 0){
<- df %>%
output summarise(TP = 0,
FN = 0,
FP = sum(compound >= split),
TN = sum(compound < split))
else{
}if(split == 0){
<- df %>%
output_pre filter(tpt_use == "pre-smoking") %>%
summarise(TP = 0,
FN = 0,
FP = sum(compound >= split),
TN = sum(compound < split))
<- df %>%
output filter(tpt_use != "pre-smoking") %>%
summarise(TP = sum(treatment != "Placebo" & compound > split),
FN = sum(treatment != "Placebo" & compound <= split),
FP = sum(treatment == "Placebo" & compound > split),
TN = sum(treatment == "Placebo" & compound < split))
<- output + output_pre
output else{
}## calculate values if pre-smoking
<- df %>%
output_pre filter(tpt_use == "pre-smoking") %>%
summarise(TP = 0,
FN = 0,
FP = sum(compound >= split),
TN = sum(compound < split))
<- df %>%
output filter(tpt_use != "pre-smoking") %>%
summarise(TP = sum(treatment != "Placebo" & compound >= split),
FN = sum(treatment != "Placebo" & compound < split),
FP = sum(treatment == "Placebo" & compound >= split),
TN = sum(treatment == "Placebo" & compound < split))
<- output + output_pre
output
}
}
}# clean things up; make calculations on above values
<- output %>%
output mutate(detection_limit = split,
compound = compound,
time_start = start,
time_stop = stop,
time_window = tpt_use,
NAs = nrow(dataset) - nrow(df),
N = nrow(dataset_removedups),
N_removed = nrow(dataset) - nrow(dataset_removedups),
Sensitivity = (TP/(TP + FN)),
Specificity = (TN /(TN + FP)),
PPV = (TP/(TP+FP)),
NPV = (TN/(TN + FN)),
Efficiency = ((TP + TN)/(TP + TN + FP + FN))*100
)
return(output)
}
- determine what cutoff values to try
- carry out above function on those cutoffs
<- function(dataset, compound, start, stop, tpt_use,
sens_spec lowest_value = 0.5, splits = NULL, ...){
# if it's not all NAs...
if(sum(is.na(dataset[,compound])) != nrow(dataset)){
# specify what splits should be used for calculations
if(is.null(splits)){
<- dataset[is.finite(rowSums(dataset[,compound])),compound]
limits ## define lower and upper limits
= min(limits, na.rm=TRUE)
lower = max(limits, na.rm=TRUE)
upper ## determine splits to use for calculations
= pull(limits[,1])[limits[,1]>0]
tosplit ## only split if there are detectable limits:
if(length(tosplit)>=1){
= c(lowest_value, quantile(tosplit, probs=seq(0, 1, by = 0.01), na.rm=TRUE))
splits = unique(splits)
splits else{
}= 0
splits
}else{
}= splits
splits
}# filter to include timepoint of interest
<- dataset %>%
dataset filter(time_from_start > start & time_from_start <= stop & !is.na(timepoint_use))
<- dataset %>%
dataset_removedups filter(!is.na(timepoint_use)) %>%
group_by(timepoint_use) %>%
distinct(id, .keep_all = TRUE) %>%
ungroup()
## create empty output variable which we'll fill in
## iterate through each possible dose and calculate
<- map_dfr(as.list(splits), ~make_calculations(dataset,
output
dataset_removedups, split = .x,
compound,start = start,
stop = stop,
tpt_use = tpt_use))
}
return(output)
}
Map the above for each matrix…
<- function(dataset, cpd, timepoints, splits = NULL){
sens_spec_cpd <- list(start = timepoints$start,
args2 stop = timepoints$stop,
tpt_use = timepoints$timepoint)
<- args2 %>%
out pmap_dfr(sens_spec, dataset, compound = cpd, splits = splits)
return(out)
}
This takes a few minutes to run… (reminder: cache=TRUE
)
<- map_dfr(compounds_WB,
output_WB ~sens_spec_cpd(dataset = WB, cpd = all_of(.x),
timepoints = timepoints_WB)) %>% clean_gluc()
<- map_dfr(compounds_BR,
output_BR ~sens_spec_cpd(dataset = BR, cpd = all_of(.x),
timepoints = timepoints_BR)) %>% clean_gluc()
<- map_dfr(compounds_OF,
output_OF ~sens_spec_cpd(dataset = OF, cpd = all_of(.x),
timepoints = timepoints_OF)) %>% clean_gluc()
ROC
<- function(output, tpts=8, tissue){
ss_plot = output %>%
to_include group_by(compound) %>%
summarize(mean_detection = mean(detection_limit)) %>%
filter(mean_detection > 0)
<- output %>%
output mutate(iszero = ifelse(time_start<0,TRUE,FALSE),
Sensitivity = round(Sensitivity*100,0),
Specificity = round(Specificity*100,0)) %>%
filter(compound %in% to_include$compound,
!= "pre-smoking") %>%
time_window clean_gluc() %>%
mutate(compound = fct_relevel(as.factor(compound), "THC"))
<- output %>% mutate(
output legend = paste0(time_window,' (N=', N,')'))
= c('#C2F8FF', '#A2DDED', '#86BEDC', '#6C9FCA',
blue_colors '#547EB9', '#3F5EA8', '#2D4096', '#1E2385',
'#181173', '#180762', '#180051')
= c(blue_colors[1:tpts])
values
print(ggplot(output, aes(x = detection_limit, y = Sensitivity, group = fct_inorder(legend))) +
geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +
geom_path(aes(color=fct_inorder(legend)), size=1.2) +
facet_grid(~compound, scales = "free_x") +
labs(x = 'Detection Limit',
y = 'Sensitivity') +
ylim(0,1) +
scale_color_manual(values = values, name = 'Time Window') +
theme_classic(base_size = 12) +
theme(axis.title = element_text(size=16),
panel.grid = element_blank(),
strip.background = element_blank(),
strip.text.x = element_text(size = 12))
)print(
ggplot(output, aes(x = detection_limit, y = Specificity, group = fct_inorder(legend))) +
geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +
geom_path(aes(color=fct_inorder(legend)), size=1.2) +
facet_grid(~compound, scales = "free_x") +
ylim(0,100) +
labs(title = tissue,
x = 'Detection Limit',
y = 'Specificity') +
scale_color_manual(values = values, name = 'Time Window') +
theme_classic(base_size = 12) +
theme(axis.title = element_text(size=16),
panel.grid = element_blank(),
strip.background = element_blank(),
strip.text.x = element_text(size = 12))
)print(
ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +
geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +
geom_path(aes(color=fct_inorder(legend)), size=1.2) +
facet_grid(~compound) +
xlim(0, 100) +
ylim(0, 100) +
labs(title = tissue,
x = '(100-Specificity)',
y = 'Sensitivity') +
scale_color_manual(values = values, name = 'Time Window') +
theme_classic(base_size = 12) +
theme(axis.title = element_text(size=16),
panel.grid = element_blank(),
strip.background = element_blank(),
strip.text.x = element_text(size = 12),
axis.text = element_text(size=12))
)
}
<- function(output, tpts=8, tissue){
roc_plot = output %>%
to_include group_by(compound) %>%
summarize(mean_detection = mean(detection_limit)) %>%
filter(mean_detection > 0)
<- output %>%
output mutate(iszero = ifelse(time_start<0,TRUE,FALSE),
Sensitivity = round(Sensitivity*100,0),
Specificity = round(Specificity*100,0)) %>%
filter(compound %in% to_include$compound,
!= "pre-smoking") %>%
time_window clean_gluc() %>%
mutate(compound = fct_relevel(as.factor(compound), "THC"))
<- output %>% mutate(
output legend = paste0(time_window,' (N=', N,')'))
= c('#C2F8FF', '#86BEDC',
blue_colors '#547EB9', '#2D4096',
'#181173', '#180051')
= c(blue_colors[1:tpts])
values print(
ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +
geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +
geom_path(aes(color=fct_inorder(legend)), size=1.2) +
facet_grid(~compound) +
xlim(0, 100) +
ylim(0, 100) +
labs(title = tissue,
x = '(100-Specificity)',
y = 'Sensitivity') +
scale_color_manual(values = values, name = 'Time Window') +
theme_classic(base_size = 12) +
theme(axis.title = element_text(size=16),
panel.grid = element_blank(),
strip.background = element_blank(),
strip.text.x = element_text(size = 12),
axis.text = element_text(size=12) )
) }
<- ss_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = "Blood") ss1_a
<- ss_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = "Oral Fluid") ss2_a
<- roc_plot(output_BR, tpts = length(unique(output_BR$time_start)), tissue = "Breath") ss3_a
<- plot_grid(ss2_a, ss3_a, labels = c('B', 'C'), label_size = 12, ncol = 2, rel_widths = c(0.66, .33))
bottom_row plot_grid(ss1_a, bottom_row, labels = c('A', ''), label_size = 12, ncol = 1)
Calculate: THC
Reminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL
= c(0.5, 1, 2, 5, 10)
cutoffs <- sens_spec_cpd(dataset = WB, cpd = 'thc',
WB_THC timepoints = timepoints_WB,
splits = cutoffs) %>% clean_gluc()
WB_THC
# A tibble: 50 × 17
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <dbl> <chr> <dbl> <dbl>
1 0 0 81 108 0.5 THC -400 0
2 0 0 61 128 1 THC -400 0
3 0 0 45 144 2 THC -400 0
4 0 0 10 179 5 THC -400 0
5 0 0 1 188 10 THC -400 0
6 124 2 28 33 0.5 THC 0 30
7 123 3 22 39 1 THC 0 30
8 119 7 15 46 2 THC 0 30
9 106 20 4 57 5 THC 0 30
10 101 25 0 61 10 THC 0 30
# ℹ 40 more rows
# ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>
<- sens_spec_cpd(dataset = OF, cpd = 'thc',
OF_THC timepoints = timepoints_OF,
splits = cutoffs) %>% clean_gluc()
OF_THC
# A tibble: 40 × 17
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <dbl> <chr> <dbl> <dbl>
1 0 0 35 157 0.5 THC -400 0
2 0 0 20 172 1 THC -400 0
3 0 0 9 183 2 THC -400 0
4 0 0 0 192 5 THC -400 0
5 0 0 0 192 10 THC -400 0
6 129 0 39 24 0.5 THC 0 30
7 129 0 30 33 1 THC 0 30
8 128 1 19 44 2 THC 0 30
9 128 1 3 60 5 THC 0 30
10 125 4 1 62 10 THC 0 30
# ℹ 30 more rows
# ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>
Why is there no calculation for breath with these cutoffs?
Cutoffs
<- function(dataset, timepoint_use_variable, tissue, labels = c("A", "B"), vertline, cpd, x_labels){
plot_cutoffs = c("#D9D9D9", "#BDBDBD", "#969696", "#636363", "#252525")
col_val = rep("solid", 5)
lines
<- dataset %>%
df_ss mutate(time_window = fct_relevel(as.factor(time_window),
levels(timepoint_use_variable)),
detection_limit = as.factor(detection_limit),
Sensitivity = round(Sensitivity*100,0),
Specificity = round(Specificity*100,0),
my_label = paste0(time_window, ' N=', N),
my_label = gsub(" ", "\n", my_label),
my_label = fct_relevel(as.factor(my_label), x_labels)) #%>%
<- df_ss %>%
p1 ggplot(aes(x = my_label, y = Sensitivity,
colour = detection_limit)) +
geom_line(size = 1.2, aes(group = detection_limit,
linetype = detection_limit)) +
geom_vline(xintercept=vertline, linetype = 'dotted') +
geom_point(show.legend=FALSE) +
ylim(0,100) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
scale_linetype_manual(values=lines) +
scale_color_manual(values = col_val, name = "Cutoff \n (ng/mL)",
guide = guide_legend(override.aes = list(linetype = c(1),
shape = rep(NA, length(lines))) )) +
theme_classic() +
theme( axis.title = element_text(size=16),
axis.text = element_text(size=10),
legend.position = c(0.08, 0.4),
panel.grid = element_blank(),
strip.background = element_blank()
+
) guides(linetype = FALSE) +
labs(x = "Time Window",
y = "Sensitivity",
title = paste0(tissue,": ", cpd) )
<- df_ss %>%
p2 ggplot(aes(x = my_label, y = Specificity,
group = detection_limit,
colour = detection_limit,
linetype = detection_limit)) +
geom_line(size = 1.2) +
geom_vline(xintercept=vertline, linetype = 'dotted') +
geom_point() +
ylim(0,100) +
scale_color_manual(values = col_val) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
scale_linetype_manual(values = lines,
guide = guide_legend(override.aes = list(linetype = "solid",
shape = rep(NA, length(lines))) )) +
theme_classic() +
theme(axis.title = element_text(size=16),
axis.text = element_text(size=10),
legend.position = "none",
panel.grid = element_blank(),
strip.background = element_blank()) +
labs(x = "Time Window",
y = "Specificity",
title = "" )
<- ggdraw() +
title draw_label(
tissue,x = 0.05,
hjust = 0
)
<- plot_grid(p1, p2, labels = labels, label_size = 12)
plot_row
plot_grid(
title, plot_row,ncol = 1,
# rel_heights values control vertical title margins
rel_heights = c(0.1, 1)
)
return(list(plot_row, df_ss))
}
<- c("pre-smoking\nN=189", "0-30\nmin\nN=187", "31-70\nmin\nN=165",
blood_levels "71-100\nmin\nN=157", "101-180\nmin\nN=168", "181-210\nmin\nN=103",
"211-240\nmin\nN=127", "241-270\nmin\nN=137", "271-300\nmin\nN=120",
"301+\nmin\nN=88")
plot_cutoffs(dataset=WB_THC,
timepoint_use_variable=WB$timepoint_use,
tissue="Blood",
vertline=levels(WB$timepoint_use)[5],
cpd="THC",
x_labels=blood_levels)
[[1]]
[[2]]
# A tibble: 50 × 18
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
1 0 0 81 108 0.5 THC -400 0
2 0 0 61 128 1 THC -400 0
3 0 0 45 144 2 THC -400 0
4 0 0 10 179 5 THC -400 0
5 0 0 1 188 10 THC -400 0
6 124 2 28 33 0.5 THC 0 30
7 123 3 22 39 1 THC 0 30
8 119 7 15 46 2 THC 0 30
9 106 20 4 57 5 THC 0 30
10 101 25 0 61 10 THC 0 30
# ℹ 40 more rows
# ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>, my_label <fct>
<- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
of_levels "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
"241-270\nmin\nN=90", "271+\nmin\nN=76")
plot_cutoffs(OF_THC, OF$timepoint_use, tissue = "Oral Fluid", labels = c("A", "B"), vertline=levels(OF$timepoint_use)[4], cpd="THC", x_labels=of_levels)
[[1]]
[[2]]
# A tibble: 40 × 18
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
1 0 0 35 157 0.5 THC -400 0
2 0 0 20 172 1 THC -400 0
3 0 0 9 183 2 THC -400 0
4 0 0 0 192 5 THC -400 0
5 0 0 0 192 10 THC -400 0
6 129 0 39 24 0.5 THC 0 30
7 129 0 30 33 1 THC 0 30
8 128 1 19 44 2 THC 0 30
9 128 1 3 60 5 THC 0 30
10 125 4 1 62 10 THC 0 30
# ℹ 30 more rows
# ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>, my_label <fct>
Calculate: CBN
Reminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL
= sens_spec_cpd(dataset = WB, cpd = 'cbn',
WB_CBN timepoints = timepoints_WB,
splits = cutoffs) %>% clean_gluc()
<- c("pre-smoking\nN=189", "0-30\nmin\nN=187", "31-70\nmin\nN=165",
blood_levels "71-100\nmin\nN=157", "101-180\nmin\nN=168", "181-210\nmin\nN=103",
"211-240\nmin\nN=127", "241-270\nmin\nN=137", "271-300\nmin\nN=120",
"301+\nmin\nN=88")
plot_cutoffs(WB_CBN, WB$timepoint_use, tissue = "Blood", vertline=levels(WB$timepoint_use)[5], cpd="CBN", x_labels=blood_levels)
[[1]]
[[2]]
# A tibble: 50 × 18
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
1 0 0 1 188 0.5 CBN -400 0
2 0 0 0 189 1 CBN -400 0
3 0 0 0 189 2 CBN -400 0
4 0 0 0 189 5 CBN -400 0
5 0 0 0 189 10 CBN -400 0
6 106 20 7 54 0.5 CBN 0 30
7 97 29 0 61 1 CBN 0 30
8 82 44 0 61 2 CBN 0 30
9 40 86 0 61 5 CBN 0 30
10 9 117 0 61 10 CBN 0 30
# ℹ 40 more rows
# ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>, my_label <fct>
= sens_spec_cpd(dataset = OF, cpd = 'cbn',
OF_CBN timepoints = timepoints_OF,
splits = cutoffs) %>% clean_gluc()
<- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
of_levels "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
"241-270\nmin\nN=90", "271+\nmin\nN=76")
plot_cutoffs(OF_CBN, OF$timepoint_use, tissue = "Oral Fluid", labels = c("A", "B"), vertline=levels(OF$timepoint_use)[4], cpd="CBN", x_labels=of_levels)
[[1]]
[[2]]
# A tibble: 40 × 18
TP FN FP TN detection_limit compound time_start time_stop
<dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
1 0 0 5 187 0.5 CBN -400 0
2 0 0 1 191 1 CBN -400 0
3 0 0 1 191 2 CBN -400 0
4 0 0 1 191 5 CBN -400 0
5 0 0 0 192 10 CBN -400 0
6 127 2 41 22 0.5 CBN 0 30
7 125 4 32 31 1 CBN 0 30
8 122 7 18 45 2 CBN 0 30
9 116 13 7 56 5 CBN 0 30
10 107 22 3 60 10 CBN 0 30
# ℹ 30 more rows
# ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
# Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
# Efficiency <dbl>, my_label <fct>
Compound Correlations
<- function (x, y, xlab, ylab, x_text, y_text, y_text2, title) {
ggplotRegression <- lm(y ~ x)
fit if(max(fit$model[,1],na.rm=TRUE)!=0){
ggplot(fit$model, aes_string(x = names(fit$model)[2],
y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "#B73239", size = 1.5, se = FALSE) +
annotate("text", x=x_text, y=y_text,
label = paste("R^2 == ", format(signif(summary(fit)$adj.r.squared, 5),
digits=2)),
vjust=1, hjust=0, parse=TRUE,size=4.5) +
labs(x = xlab,
y = ylab,
title = title ) +
annotate("text", x=x_text, y=y_text2, label = paste(
"y = ", format(signif(fit$coef[[2]], 5),digits=2),
"x + ",
format(signif(fit$coef[[1]],5 ),digits=2),
paste0("\nN = ", length(x))),
vjust=1, hjust=0, size=4.5) +
theme_minimal(base_size=14) +
theme(panel.grid = element_blank(),
axis.line = element_line(size = 0.5, linetype = "solid",
colour = "black"),
legend.position="none")
else{
} ggplot(fit$model, aes_string(x = names(fit$model)[2],
y = names(fit$model)[1])) +
geom_point() +
scale_y_continuous(limits = c(0,3)) +
stat_smooth(method = "lm", col = "#B73239", size = 1.5, se = FALSE) +
annotate("text", x=x_text, y=y_text,
label = paste("R^2 == ", format(signif(summary(fit)$adj.r.squared, 5), digits=2)),vjust=1, hjust=1, parse=TRUE,size=4.5) +
labs(x = xlab,
y = ylab,
title = title ) +
annotate("text", x=x_text, y=y_text2, label = paste(
"y = ", format(signif(fit$coef[[2]], 5),digits=2),
"x + ",
format(signif(fit$coef[[1]],5 ),digits=2),
paste0("\nN = ", length(x))), vjust=1, hjust=1,size=4.5) +
theme_minimal(base_size = 14) +
theme(panel.grid = element_blank(),
axis.line = element_line(size = 0.5, linetype = "solid",
colour = "black"),
legend.position="none")
} }
<- ggplotRegression(WB$thc, WB$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 150, y_text = 7, y_text2 = 5, title = "Blood")
wb_reg
<- ggplotRegression(OF$thc, OF$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 12500, y_text = 750, y_text2 = 500, title = "Oral Fluid")
of_reg
plot_grid(wb_reg, of_reg, labels = 'AUTO', label_size = 12, ncol = 2, scale = 1)
Possible Extensions
Our current question asks for a single compound…and you’ll need to decide that.
. . .
…but you could imagine a world where more than one compound or more than one matrix could be measured at the roadside.
. . .
So:
- combination of the oral fluid and blood that would better predict recent use? (For example if an officer stopped a driver and got a high oral fluid, but could not get a blood sample for a couple of hours and got a relatively low result would this predict recent use better than blood (or OF) alone?
- Is there a ratio of OF/blood that predicts recent use?
- Machine learning model to determine optimal combination of measurements/cutoffs to detect recent use?
. . .
Things to keep in mind:
- some matrices are easier to get at the roadside
- time from use matters (trying to detect recent use)
- we may not care equally about sensitivity and specificity
cs01: what to do now?
- Communicate with your group!
- Discuss possible extensions
- Make a plan; figure out who’s doing what; set deadlines
- Implement the plan!
What has to be done:
- Question | include in Rmd; add extension if applicable
- Background | summarize and add to what was discussed in classed
- Data
- Describe data & variables
- Data wrangling | likely copy + paste from notes; add explanation as you go
- Analysis
- EDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader
- Analysis | likely borrowing most/all from class; interpretations/guiding reader/contextualizing is essential
- Extension | must be completed
- Conclusion | summarize
- Proofread | ensure it makes sense from top to bottom
- General Audience communication (submit on Canvas; 1 submission per group)
Collaborating on GitHub
- Be sure to pull changes every time you sit down to work
- Avoid working on the same part of the same file as another teammate OR work in separate files and combine at the end
- push your changes once you’re ready to add them to the group
Recap
- Can you describe sensitivity? Specificity?
- Can you explain how TP, TN, FP, and FN were calculated/defined in this experiment?
- Can you describe the code used to carry out the calculations?
- Can you interpret the results from these data?