Introduction

Welcome to Practical 2, Advanced Plotting. In this practical we will:

  • Create some more interactive plots
  • Fit some machine learning models and plot the output

As in the previous practical the answers to the task sets are at the bottom of the file. Just click on the arrows to see them, but don’t look until you’ve attempted the questions yourself. If you haven’t finished practical 1 I would recommend going back to that first and completing those tasks

Task set 1: CO2 data

The CO2 data is located in the data directory. It contains monthly CO2 measurements from Mauna Loa in Hawaii from NOAA. The three columns are year, month, and CO2 value in parts per million.

Tasks:

  1. Load in the data (make sure it’s the .rds file) and create a quick ggplot.

  2. Remove the missing values and then use scale_x_date to create a nice ggplot of the data with date breaks/labels every 5 years and CO2 on the y axis. Make sure to give it nice labels (if you want to be fancy make sure to write \(\mbox{CO}_2\) correctly).

  3. Create a ggiraph plot that, when you hover over the points, show the date and the value of the CO2 (hint: have a close look at the crimes example in the high_dim_code.R file for some ways of doing this)

  4. Create linear and quadratic regression models of CO2 on year (and year^2 for the quadratic model). Plot the data and the two fitted models on an interactive ggiraph plot that identifies which model you are hovering over and also makes the other lines more transparent (hint: see the time series plot in high_dim_code.R for how to do this).

Task set 2: pollen

We will use the tidymodels package to fit a machine learning model to the pollen data, and then use some of the DALEX tools to create variable importance and partial dependence plots.

Tasks:

  1. Load in the pollen data. Use ggpairs and/or corrplot to look at the relationship between MTCO and the 7 pollen taxa counts.

  2. Use the tidymodels commands in class 6 to create training and test sets, and thus train a ranger random forest model to predict MTCO from the 7 pollen varieties. Evaluate the performance of the model by plotting the predicted values in the test set against the true MTCO values.

  3. Use the DALEX package to create partial dependence plots for the different pollen varieties and some instance level explanations. Note down any interesting relationships you find. (Hint: use the function explain_tidymodels just as we used explain_mlr3).

Answers

Task set 1
# 1 and 2. Load in the data and plot
library(latex2exp)
CO2 <- readRDS(file = 'data/CO2.rds')
CO2 %>% glimpse

ggplot(CO2 %>% filter(CO2_ppm > 0), 
       aes(x = date, y = CO2_ppm)) + 
  geom_line() + 
  scale_x_date(date_breaks = "5 years",
               date_labels = "%Y") +
  labs(x = "Year", y = TeX("$CO_2$ (ppm)"),
       title = TeX("$CO_2$ in Mauna Loa, Hawaii")) + 
  theme_minimal()

# 3. ggiraph
p <- ggplot(
  data = CO2 %>% filter(CO2_ppm > 0) %>% 
    unite("dat", c("date", "CO2_ppm"), 
          sep = ", ",
          remove = FALSE),
  aes(
    x = date,
    y = CO2_ppm
  )
) +
  geom_line() +
  geom_point_interactive(
    aes(tooltip = dat,
        data_id = dat),
    size = 0.5) + # Make small to 
    labs(x = "Year", y = TeX("$CO_2$ (ppm)"),
       title = TeX("$CO_2$ in Mauna Loa, Hawaii"))
girafe(ggobj = p, width = 10)

# 4. Linear regression diagnostics
CO2_mod_1 <- lm(CO2_ppm ~ date, 
                data = CO2%>% filter(CO2_ppm > 0))
CO2_mod_2 <- lm(CO2_ppm ~ poly(date, 2), 
                data = CO2%>% filter(CO2_ppm > 0))

p <- ggplot(data = CO2 %>% filter(CO2_ppm > 0) %>%
         mutate(lin_fit = fitted(CO2_mod_1),
                quad_fit = fitted(CO2_mod_2)) %>% 
         pivot_longer(-date,
                      names_to = "Type",
                      values_to = "Value"),
       aes(x = date, y = Value, 
           colour = Type, tooltip = Type,
           data_id = Type)) + 
  geom_line_interactive() + 
  theme(legend.position = 'none')
girafe(ggobj = p, width = 10,
       options = list(
         opts_hover_inv(css = "opacity:0.4;"),
         opts_hover(css = "stroke-width:1.5;")
       ))
Task set 2
# 1. Load in and use ggpairs and corrplot
library(GGally)
pollen <- read_csv("data/pollen.csv")
ggpairs(pollen,
  upper = list(continuous = "density"),
  lower = list(
    continuous = "smooth",
    combo = "facetdensity"
  )
) # Pinus D looks like the interesting one

corrplot(cor(pollen)) # Though Picea has the strongest linear correlation

# 2. Fit the ranger model
library(tidymodels)
library(ranger)

# Split the data into training and testing sets
set.seed(123)
# DON'T FORGET TO GET RID OF GDD5!
pollen_split <- initial_split(pollen %>%
                                select(-GDD5), 
                              prop = 0.8)
pollen_train <- training(pollen_split)
pollen_test <- testing(pollen_split)

# Define the model specification
rf_pollen <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

# Fit the model to the training data
rf_fit <- rf_pollen %>% 
  fit(MTCO ~ ., 
      data = pollen_train)

# Make predictions on the test data
rf_preds <- rf_fit %>% predict(pollen_test)

# Evaluate the model performance
rf_preds %>% 
  bind_cols(pollen_test) %>% 
  dplyr::select(.pred, MTCO) %>% 
  ggplot(aes(x = MTCO, y = .pred)) + 
  labs(x = "True", y = "Predicted",)
  geom_point() + 
  geom_abline()
  
#3. Use DALEX
library(DALEX)
library(DALEXtra)
ranger_exp = explain_tidymodels(rf_fit,
  data     = pollen %>% select(-MTCO, -GDD5),
  y        = pollen$MTCO,
  label    = "Ranger RF",
  colorize = TRUE)
ranger_vi <- model_parts(ranger_exp)
plot(ranger_vi) # Feature importance

# Partial dependence
num_features = names(pollen)[-c(1:2)]
pollen_pd <- model_profile(ranger_exp,
  variables = num_features)$agr_profiles
plot(pollen_pd)

# Instance-level explanations
pollen_case1 = pollen[1, ]
ile_ranger = predict_parts(ranger_exp,
  new_observation = pollen_case1)
plot(ile_ranger)