Welcome to Practical 2, Advanced Plotting. In this practical we will:
As in the previous practical the answers to the task sets are at the bottom of the file. Just click on the arrows to see them, but don’t look until you’ve attempted the questions yourself. If you haven’t finished practical 1 I would recommend going back to that first and completing those tasks
The CO2 data is located in the data
directory. It
contains monthly CO2 measurements from Mauna Loa in Hawaii from NOAA.
The three columns are year, month, and CO2 value in parts per
million.
Tasks:
Load in the data (make sure it’s the .rds
file) and
create a quick ggplot.
Remove the missing values and then use scale_x_date
to create a nice ggplot of the data with date breaks/labels every 5
years and CO2 on the y axis. Make sure to give it nice labels (if you
want to be fancy make sure to write \(\mbox{CO}_2\) correctly).
Create a ggiraph
plot that, when you hover over the
points, show the date and the value of the CO2 (hint: have a close look
at the crimes example in the high_dim_code.R
file for some
ways of doing this)
Create linear and quadratic regression models of CO2 on year (and
year^2 for the quadratic model). Plot the data and the two fitted models
on an interactive ggiraph
plot that identifies which model
you are hovering over and also makes the other lines more transparent
(hint: see the time series plot in high_dim_code.R
for how
to do this).
We will use the tidymodels
package to fit a machine
learning model to the pollen data, and then use some of the DALEX tools
to create variable importance and partial dependence plots.
Tasks:
Load in the pollen data. Use ggpairs
and/or
corrplot
to look at the relationship between MTCO and the 7
pollen taxa counts.
Use the tidymodels
commands in class 6 to create
training and test sets, and thus train a ranger
random
forest model to predict MTCO from the 7 pollen varieties. Evaluate the
performance of the model by plotting the predicted values in the test
set against the true MTCO values.
Use the DALEX package to create partial dependence plots for the
different pollen varieties and some instance level explanations. Note
down any interesting relationships you find. (Hint: use the function
explain_tidymodels
just as we used
explain_mlr3
).
# 1 and 2. Load in the data and plot
library(latex2exp)
CO2 <- readRDS(file = 'data/CO2.rds')
CO2 %>% glimpse
ggplot(CO2 %>% filter(CO2_ppm > 0),
aes(x = date, y = CO2_ppm)) +
geom_line() +
scale_x_date(date_breaks = "5 years",
date_labels = "%Y") +
labs(x = "Year", y = TeX("$CO_2$ (ppm)"),
title = TeX("$CO_2$ in Mauna Loa, Hawaii")) +
theme_minimal()
# 3. ggiraph
p <- ggplot(
data = CO2 %>% filter(CO2_ppm > 0) %>%
unite("dat", c("date", "CO2_ppm"),
sep = ", ",
remove = FALSE),
aes(
x = date,
y = CO2_ppm
)
) +
geom_line() +
geom_point_interactive(
aes(tooltip = dat,
data_id = dat),
size = 0.5) + # Make small to
labs(x = "Year", y = TeX("$CO_2$ (ppm)"),
title = TeX("$CO_2$ in Mauna Loa, Hawaii"))
girafe(ggobj = p, width = 10)
# 4. Linear regression diagnostics
CO2_mod_1 <- lm(CO2_ppm ~ date,
data = CO2%>% filter(CO2_ppm > 0))
CO2_mod_2 <- lm(CO2_ppm ~ poly(date, 2),
data = CO2%>% filter(CO2_ppm > 0))
p <- ggplot(data = CO2 %>% filter(CO2_ppm > 0) %>%
mutate(lin_fit = fitted(CO2_mod_1),
quad_fit = fitted(CO2_mod_2)) %>%
pivot_longer(-date,
names_to = "Type",
values_to = "Value"),
aes(x = date, y = Value,
colour = Type, tooltip = Type,
data_id = Type)) +
geom_line_interactive() +
theme(legend.position = 'none')
girafe(ggobj = p, width = 10,
options = list(
opts_hover_inv(css = "opacity:0.4;"),
opts_hover(css = "stroke-width:1.5;")
))
# 1. Load in and use ggpairs and corrplot
library(GGally)
pollen <- read_csv("data/pollen.csv")
ggpairs(pollen,
upper = list(continuous = "density"),
lower = list(
continuous = "smooth",
combo = "facetdensity"
)
) # Pinus D looks like the interesting one
corrplot(cor(pollen)) # Though Picea has the strongest linear correlation
# 2. Fit the ranger model
library(tidymodels)
library(ranger)
# Split the data into training and testing sets
set.seed(123)
# DON'T FORGET TO GET RID OF GDD5!
pollen_split <- initial_split(pollen %>%
select(-GDD5),
prop = 0.8)
pollen_train <- training(pollen_split)
pollen_test <- testing(pollen_split)
# Define the model specification
rf_pollen <-
rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
# Fit the model to the training data
rf_fit <- rf_pollen %>%
fit(MTCO ~ .,
data = pollen_train)
# Make predictions on the test data
rf_preds <- rf_fit %>% predict(pollen_test)
# Evaluate the model performance
rf_preds %>%
bind_cols(pollen_test) %>%
dplyr::select(.pred, MTCO) %>%
ggplot(aes(x = MTCO, y = .pred)) +
labs(x = "True", y = "Predicted",)
geom_point() +
geom_abline()
#3. Use DALEX
library(DALEX)
library(DALEXtra)
ranger_exp = explain_tidymodels(rf_fit,
data = pollen %>% select(-MTCO, -GDD5),
y = pollen$MTCO,
label = "Ranger RF",
colorize = TRUE)
ranger_vi <- model_parts(ranger_exp)
plot(ranger_vi) # Feature importance
# Partial dependence
num_features = names(pollen)[-c(1:2)]
pollen_pd <- model_profile(ranger_exp,
variables = num_features)$agr_profiles
plot(pollen_pd)
# Instance-level explanations
pollen_case1 = pollen[1, ]
ile_ranger = predict_parts(ranger_exp,
new_observation = pollen_case1)
plot(ile_ranger)