MATH 427: Bagging and Boosting

Eric Friedlander

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(knitr)
library(kableExtra)

tidymodels_prefer()

set.seed(427)

Ensemble Methods

Single regression or classification trees usually have poor predictive performance.
Ensemble Methods: use a collection of models (in this case, decision trees) to improve the predictive performance
- Downside: Interpretability
Today:
- Bagging
- Random Forests
- Boosting

Flexibility vs. Interpretability

Adapted from ISLR, James et al.

Bagging

Bootstrap aggregation or bagging is a general-purpose procedure for reducing the variance of a statistical learning method.
Idea: Build multiple trees and average their results.
Result: Given a set of \(n\) independent observations (random variables) \(Z_1, \ldots, Z_n\), each with variance \(\sigma^2\), the variance of the mean/average \(\bar{Z} = \displaystyle \dfrac{Z_1 + Z_2 + \cdots + Z_n}{n}\) of the observations is \(\sigma^2/n\).
- In other words, averaging a set of observations reduces variance.
In reality, we do not have multiple training datasets.

Bootstrapping

Adapted from ISLR, James et al.

Bagging

Take repeated bootstrap samples (say \(B\)) from the original dataset.
Build tree on each bootstrap sample and obtain predictions \(\hat{f}^{*b}(x), \ b=1, 2, \ldots, B\).
Average all the predictions: \[\hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum_{i=1}^B\hat{f}^{*b}(x)\]
Trees not pruned: They have high variance, but low bias.
Classification: majority vote the overall prediction is the most commonly occurring class among the \(B\) predictions

Out-of-Bag Error Estimation

Bagging \(\Rightarrow\) fitting lots of models \(\Rightarrow\) computationally taxing
Bagging + Cross-validation \(\Rightarrow\) EXTREMELY COMPUTATIONALLY TAXING
On average, each bagged tree (constructed on each bootstrap sample) makes use of around two-thirds of the observations.
Remaining third observations referred to as out-of-bag (OOB) observations
For \(i^{th}\) observation, use the trees in which that observation was OOB. This will yield around \(B/3\) predictions for the \(i^{th}\) observation. Take their average to obtain a single prediction
Equivalent to LOOCV if \(B\) is large

Variable Importance Measures

Bagging improves prediction accuracy at expense of interpretability
Can still obtain overall summary of importance of each predictor
- Reduction in the loss function (e.g., SSE) attributed to each variable at each split is tabulated
- A single variable could be used multiple times in a tree
- Total reduction in the loss function across all splits by a variable are summed up and used as the total feature importance
- A large value indicates an important predictor.

Bagging Implementation in R

Data: Voter Frequency

Info about data
Goal: Identify individuals who are unlikely to vote to help organization target “get out the vote” effort.

voter_data <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv')

voter_clean <- voter_data |> 
  select(-RespId, -weight, -Q1) |>
  mutate(
    educ = factor(educ, levels = c("High school or less", "Some college", "College")),
    income_cat = factor(income_cat, levels = c("Less than $40k", "$40-75k ",
                                               "$75-125k", "$125k or more")),
    voter_category = factor(voter_category, levels = c("rarely/never", "sporadic", "always"))
  ) |> 
  filter(Q22 != 5 | is.na(Q22)) |> 
  mutate(Q22 = as_factor(Q22),
         Q22 = if_else(is.na(Q22), "Not Asked", Q22),
         across(Q28_1:Q28_8, ~if_else(.x == -1, 0, .x)),
         across(Q28_1:Q28_8, ~ as_factor(.x)),
         across(Q28_1:Q28_8, ~if_else(is.na(.x) , "Not Asked", .x)),
         across(Q29_1:Q29_10, ~if_else(.x == -1, 0, .x)),
         across(Q29_1:Q29_10, ~ as_factor(.x)),
         across(Q29_1:Q29_8, ~if_else(is.na(.x) , "Not Asked", .x)),
        Party_ID = as_factor(case_when(
          Q31 == 1 ~ "Strong Republican",
          Q31 == 2 ~ "Republican",
          Q32 == 1  ~ "Strong Democrat",
          Q32 == 2 ~ "Democrat",
          Q33 == 1 ~ "Lean Republican",
          Q33 == 2 ~ "Lean Democrat",
          TRUE ~ "Other"
        )),
        Party_ID = factor(Party_ID, levels =c("Strong Republican", "Republican", "Lean Republican",
                                                "Other", "Lean Democrat", "Democrat", "Strong Democrat")),
        across(!ppage, ~as_factor(if_else(.x == -1, NA, .x))))

Split Data

set.seed(427)

voter_splits <- initial_split(voter_clean, prop = 0.7, strata = voter_category)
voter_train <- training(voter_splits)
voter_test <- testing(voter_splits)

Define Model

trees: kind of a tuning parameter… want to select value that is large enough but doesn’t matter past that

bagged_model <- rand_forest(trees = 500, mtry = .cols()) |> 
  set_engine("ranger", importance = "impurity") |> 
  set_mode("classification")

Define Recipe

bag_recipe <- recipe(voter_category ~ . , data = voter_train) |> 
  step_indicate_na(all_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_integer(educ, income_cat, Party_ID, Q2_2:Q4_6, Q6, Q8_1:Q9_4, Q14:Q17_4,
               Q25:Q26) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |> 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

Define Workflow and Fit

# need to install ranger
bag_wf <- workflow() |> 
  add_model(bagged_model) |> 
  add_recipe(bag_recipe)

bagged_fit <- bag_wf |> 
  fit(voter_train)

Evaluate performance

augment(bagged_fit, new_data = voter_test) |> 
  accuracy(truth = voter_category, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.648

Variable Importance

library(vip)
vip(bagged_fit)

Bagging: Disadvantages

improves prediction performance but reduces interpretability
trees in bagging not completely independent of each other since all the original features are considered at every split of every tree
tree correlation: trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree)
- prevents bagging from further reducing the variance of the individual models
Random forests extend and improve upon bagged decision trees by reducing this correlation and thereby improving the accuracy of the overall ensemble.

Bagging: Disadvantages

Adapted from Hands-On Machine Learning, Boehmke & Greenwell

Random Forests

De-correlates bagged trees \(\Rightarrow\) reducing variance
As in bagging, we build a number of decision trees on bootstrapped training samples.
Algorithm: do the following to build each tree
- Generate bootstrapped sample of training data
- Randomly select \(m\) predictors
- Pick best variable/split-oint from these \(m\)
- Split node into two child nodes
- Stop when typical stopped criteria is hit (not pruning)
Note: A fresh sample of \(m\) predictors is taken at each split
Typically \(m = p/3\) for regression and \(m = \sqrt{p}\) for classification but this should be considered a tuning parameter

Define Model

trees: kind of a tuning parameter… want to select value that is large enough but doesn’t matter past that

rf_model <- rand_forest(trees = 100, mtry = tune()) |> 
  set_engine("ranger", importance = "impurity") |> 
  set_mode("classification")

Define Recipe

rf_recipe <- recipe(voter_category ~ . , data = voter_train) |> 
  step_indicate_na(all_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_integer(educ, income_cat, Party_ID, Q2_2:Q4_6, Q6, Q8_1:Q9_4, Q14:Q17_4,
               Q25:Q26) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |> 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

Define Workflow and Fit

# need to install ranger
rf_wf <- workflow() |> 
  add_model(rf_model) |> 
  add_recipe(rf_recipe)

Upper Limit for `mtry`

rf_model |> extract_parameter_set_dials()

Collection of 1 parameters for tuning

 identifier type    object
       mtry mtry nparam[?]

Model parameters needing finalization:
   # Randomly Selected Predictors ('mtry')

See `?dials::finalize` or `?dials::update.parameters` for more information.

Extract Number of Features

rf_param <- rf_model |> extract_parameter_set_dials() |> finalize(voter_train)
rf_param

Collection of 1 parameters for tuning

 identifier type    object
       mtry mtry nparam[+]

Tune `mtry`

voter_folds <- vfold_cv(voter_train, v = 5, repeats = 10, strata = voter_category)

mtry_grid <- grid_regular(rf_param, levels = 10)

tuning_results <- tune_grid(
  rf_wf,
  resamples= voter_folds,
  grid = mtry_grid
)

Results

autoplot(tuning_results)

Evaluate performance

rf_fit <-  rf_wf |> 
  finalize_workflow(select_best(tuning_results)) |> 
  fit(voter_train)

augment(rf_fit, new_data = voter_test) |> 
  accuracy(truth = voter_category, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.658

Variable Importance

library(vip)
vip(rf_fit)

MATH 427: Bagging and Boosting

Computational Set-Up

Ensemble Methods

Flexibility vs. Interpretability

Bagging

Bootstrapping

Bagging

Out-of-Bag Error Estimation

Variable Importance Measures

Bagging Implementation in R

Data: Voter Frequency

Split Data

Define Model

Define Recipe

Define Workflow and Fit

Evaluate performance

Variable Importance

Bagging: Disadvantages

Bagging: Disadvantages

Random Forests

Random Forests

Define Model

Define Recipe

Define Workflow and Fit

Upper Limit for mtry

Extract Number of Features

Tune mtry

Results

Evaluate performance

Variable Importance

Upper Limit for `mtry`

Tune `mtry`