MATH 427: Regularization & Model Tuning

Eric Friedlander

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(knitr)
library(fivethirtyeight) # for candy rankings data

tidymodels_prefer()

set.seed(427)

Data: Candy

The data for this lecture comes from the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy match-ups (e.g. Reese’s vs. Skittles). Click here to check out some of the match ups.

The data set contains 12 characteristics and win percentage from 85 candies in the experiment.

Data: Candy

glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Cleaning

candy_rankings_clean <- candy_rankings |> 
  select(-competitorname) |> 
  mutate(sugarpercent = sugarpercent*100, # convert proportions into percentages
         pricepercent = pricepercent*100, # convert proportions into percentages
         across(where(is.logical), ~ factor(.x, levels = c("FALSE", "TRUE")))) # convert logicals into factors

Data Cleaning

glimpse(candy_rankings_clean)
Rows: 85
Columns: 12
$ chocolate        <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <fct> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 73.2, 60.4, 1.1, 1.1, 90.6, 46.5, 60.4, 31.3, 90.6, 6…
$ pricepercent     <dbl> 86.0, 51.1, 11.6, 51.1, 51.1, 76.7, 76.7, 51.1, 32.5,…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Splitting

candy_split <- initial_split(candy_rankings_clean, strata = winpercent)
candy_train <- training(candy_split)
candy_test <- testing(candy_split)

Model Tuning

Tuning Parameters

  • Tuning Parameters or Hyperparameters are parameters that cannot (or should not) be estimated when a model is being trained
  • These parameters control something about the learning process and changing them will result in a different model when fit to the full training data (i.e. after cross-validation)
  • Frequently: tuning parameters control model complexity
  • Example: \(\lambda\) in LASSO and Ridge regression
  • Today: How to choose our tuning parameters?

Question

  • Which of the following are tuning parameters:
    • \(\beta_0\): the intercept of linear regression
    • \(k\) in KNN
    • step size in gradient descent
    • The number of folds in cross-validation
    • Type of distance to use in KNN (i.e. rectangular vs. Gower’s vs. weighted etc)

Basic Idea

  • Use CV to try out a bunch of different tuning parameters and choose the “best” one
  • How do we choose which tuning parameters to try?
  • Two general approaches:
    • Grid Search
    • Iterative Search

Grid Search

  • Create a grid of tuning parameters and try out each combination
  • Types of grids:
    • Regular Grid: tuning parameter values are spaced deterministically using a linear or logarithmic scale and all combinations of parameters are used (mostly what you want to use in this class)
    • Irregular Grids: tuning parameter values are chosen stochastically
      • Use when you have A LOT of parameters

Grid Search in R

  • Take advantage of package dials which is part of the tidyverse
    • Set every tuning variable equal to tune()

LASSO and Ridge in R

ols <- linear_reg() |> 
  set_engine("lm")

ridge <- linear_reg(mixture = 0, penalty = tune()) |> # penalty set's our lambda
  set_engine("glmnet")

lasso <- linear_reg(mixture = 1, penalty = tune()) |> # penalty set's our lambda
  set_engine("glmnet")

Create Recipe

  • Note: no tuning variables in this case
lm_preproc <- recipe(winpercent ~ . , data = candy_train) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_numeric_predictors())

Create workflows

generic_wf <- workflow() |> add_recipe(lm_preproc)
ols_wf <- generic_wf |> add_model(ols)
ridge_wf <- generic_wf |> add_model(ridge)
lasso_wf <- generic_wf |> add_model(lasso)

Create Metric Set

candy_metrics <- metric_set(rmse, rsq)

Create Folds

# Since sample size is so small, not using stratification
candy_folds <- vfold_cv(candy_train, v = 2, repeats = 10)

Grid Search in R

  • Take advantage of package dials which is part of the tidyverse
    • Set every tuning variable equal to tune()
    • Generate grid for hyperparameters

Generate Grid

# note that dials treats penalty on a log10 scale
penalty_grid <- grid_regular(penalty(range = c(-10, 2)), # I had to play around with these 
                             levels = 10)
penalty_grid |> head() |>  kable(digits = 11, format.args = list(scientific = TRUE))
penalty
1.000000e-10
2.150000e-09
4.642000e-08
1.000000e-06
2.154435e-05
4.641589e-04
penalty_grid |> tail() |>  kable()
penalty
0.0000215
0.0004642
0.0100000
0.2154435
4.6415888
100.0000000

Regular Grid Search in R

  • Take advantage of package dials which is part of the tidyverse
    • Set every tuning variable equal to tune()
    • Generate grid for hyperparameters using grid_regular
    • Tune your model: fit all hyperparameter combination on resamples using tune_grid

Tune Models

tuning_ridge_results <- tune_grid(
  ridge_wf,
  resamples= candy_folds,
  grid = penalty_grid
)

tuning_lasso_results <- tune_grid(
  lasso_wf,
  resamples = candy_folds,
  grid = penalty_grid
)

Regular Grid Search in R

  • Take advantage of package dials which is part of the tidyverse
    • Set every tuning variable equal to tune()
    • Generate grid for hyperparameters using grid_regular
    • Tune your model: fit all hyperparameter combination on resamples using tune_grid
    • Visualize and Choose final model

Visualizing Results: Ridge

autoplot(tuning_ridge_results)

Selecting Best Model: Ridge

Best RMSE

best_rmse_ridge <- tuning_ridge_results |> 
  select_best(metric = "rmse")
best_rmse_ridge |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09

Best \(R^2\)

best_rsq_ridge <- tuning_ridge_results |> 
  select_best(metric = "rsq",)
best_rsq_ridge |> kable(digits = 2)
penalty .config
100 Preprocessor1_Model10
  • Which should be use?

Finalize Model

  • RMSE estimate is less flexible and seems to be sacrificing less \(R^2\) than the opposite
ridge_rmse_final <- finalize_workflow(ridge_wf, best_rmse_ridge)
ridge_rmse_fit <- fit(ridge_rmse_final, data = candy_train)
tidy(ridge_rmse_fit) |> 
  kable()
term estimate penalty
(Intercept) 49.8221817 4.641589
sugarpercent 1.6105235 4.641589
pricepercent -0.8755351 4.641589
chocolate_TRUE. 5.5626786 4.641589
fruity_TRUE. 0.8290067 4.641589
caramel_TRUE. 0.9577679 4.641589
peanutyalmondy_TRUE. 2.8412296 4.641589
nougat_TRUE. 0.5018516 4.641589
crispedricewafer_TRUE. 1.6202899 4.641589
hard_TRUE. -1.2156466 4.641589
bar_TRUE. 1.4669984 4.641589
pluribus_TRUE. 0.4265386 4.641589

Visualizing Results: LASSO

autoplot(tuning_lasso_results)

Selecting Best Model: LASSO

Best RMSE

best_rmse_lasso <- tuning_lasso_results |> 
  select_best(metric = "rmse")
best_rmse_lasso |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09

Best \(R^2\)

best_rsq_lasso <- tuning_lasso_results |> 
  select_best(metric = "rsq")
best_rsq_lasso |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09
  • Which should be use?

Finalize Model

  • RMSE estimate is less flexible and seems to be sacrificing less \(R^2\) than the opposite
lasso_rmse_final <- finalize_workflow(lasso_wf, best_rmse_lasso)
lasso_rmse_fit <- fit(lasso_rmse_final, data = candy_train)
tidy(lasso_rmse_fit) |> 
  filter(estimate != 0) |> 
  kable()
term estimate penalty
(Intercept) 49.822182 4.641589
chocolate_TRUE. 5.165031 4.641589

Using Parsimony as a Tie-Breaker

  • Good heuristic: One-Standard Error Rule
    • Use resampling to estimate error metrics
    • Compute standard error for error metrics
    • Select most parsimonious model that is within one standard error of the best performance metric

Selecting Best Model: Ridge

Best RMSE

best_ose_rmse_ridge <- tuning_ridge_results |> 
  select_by_one_std_err(metric = "rmse", desc(penalty)) # why are we descending
best_ose_rmse_ridge |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09

Best \(R^2\)

bse_ose_rsq_ridge <- tuning_ridge_results |> 
  select_by_one_std_err(metric = "rsq", desc(penalty))
bse_ose_rsq_ridge |> kable(digits = 2)
penalty .config
100 Preprocessor1_Model10
  • Which should be use?

Visualizing Results: Ridge

autoplot(tuning_ridge_results)

Finalize Model

  • RMSE estimate is less flexible and seems to be sacrificing less \(R^2\) than the opposite
ridge_rmse_final <- finalize_workflow(ridge_wf, best_ose_rmse_ridge)
ridge_rmse_fit <- fit(ridge_rmse_final, data = candy_train)
tidy(ridge_rmse_fit) |> 
  kable()
term estimate penalty
(Intercept) 49.8221817 4.641589
sugarpercent 1.6105235 4.641589
pricepercent -0.8755351 4.641589
chocolate_TRUE. 5.5626786 4.641589
fruity_TRUE. 0.8290067 4.641589
caramel_TRUE. 0.9577679 4.641589
peanutyalmondy_TRUE. 2.8412296 4.641589
nougat_TRUE. 0.5018516 4.641589
crispedricewafer_TRUE. 1.6202899 4.641589
hard_TRUE. -1.2156466 4.641589
bar_TRUE. 1.4669984 4.641589
pluribus_TRUE. 0.4265386 4.641589

Visualizing Results: LASSO

autoplot(tuning_lasso_results)

Selecting Best Model: Ridge

Best RMSE

bse_ose_rmse_lasso <- tuning_lasso_results |> 
  select_by_one_std_err(metric = "rmse", desc(penalty)) # why are we descending
bse_ose_rmse_lasso |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09

Best \(R^2\)

bse_ose_rsq_lasso <- tuning_lasso_results |> 
  select_by_one_std_err(metric = "rsq", desc(penalty)) # why are we descending
bse_ose_rsq_lasso |> kable(digits = 2)
penalty .config
4.64 Preprocessor1_Model09
  • Which should be use?

Finalize Model

  • RMSE estimate is less flexible and seems to be sacrificing less \(R^2\) than the opposite
lasso_rmse_final <- finalize_workflow(lasso_wf, bse_ose_rmse_lasso)
lasso_rmse_fit <- fit(lasso_rmse_final, data = candy_train)
tidy(lasso_rmse_fit) |> 
  filter(estimate != 0) |> 
  kable()
term estimate penalty
(Intercept) 49.822182 4.641589
chocolate_TRUE. 5.165031 4.641589

Using the Test Set as a Tie Breaker

  • Once you’ve found you “best” candidate from several different classes of model, it’s ok to compare on test set
  • In this case, we have our best ridge and our best lasso model
  • Main this to avoid… LOTS of comparisons on your test set

Using the Test Set as a Tie Breaker

candy_test_wpreds <- candy_test |> 
  mutate(ridge_preds = predict(ridge_rmse_fit, new_data = candy_test)$.pred,
         lasso_preds = predict(lasso_rmse_fit, new_data = candy_test)$.pred)
candy_test_wpreds |> rmse(estimate = ridge_preds, truth = winpercent)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        10.6
candy_test_wpreds |> rmse(estimate = lasso_preds, truth = winpercent)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        12.0
candy_test_wpreds |> rsq(estimate = ridge_preds, truth = winpercent)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.476
candy_test_wpreds |> rsq(estimate = lasso_preds, truth = winpercent)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.341
  • Ridge Wins!

Good strategies

  • First use coarse grid with large range to find general area
    • Then use fine grid with small range to fine tune
    • Use iterative method to fine tune

Iterative Methods (Bonus Content)

Brief Overview

  • Basic Idea: iterative select parameter values to choose based on how previous ones have done
  • TMWR gives two examples:
    • Bayesian search
    • Stochastic Annealing
  • Both require an understanding of multivariate probability distributions

Bayesian Search: GP

  • Assume performance metrics follow a Gaussian process… dafuq
  • Consider two sets of tuning parameters: \(x_1 = (\lambda_1, k_1)\) and \(x_2 = (\lambda_2, k_2)\)
  • Let \(Y_1\) and \(Y_2\) be random variables representing the performance metric at these \(x\)’s
  • Assume \(\text{Cov}(Y_1, Y_2)\) depends on how far apart \(x_1\) and \(x_2\)
    • Covariance should decrease the further apart \(x_1\) and \(x_2\) are
    • This covariance function is typically parameterized and estimated from an initial tuning grid

Bayesian Search: Acquisition Function

  • GP process gives us predicted mean and variance of performance metrics
  • Imagine two scenario’s:
    • 1: Predicted mean is slightly better than current parameter choice and variance is low
      • Low-Risk, Low-Reward
    • 2: Predicted mean is slightly worst than current parameter choice but variance is high
      • High-Risk, High-Reward
  • How do we balance?
  • Two competing goals: Exploration (got toward high variance) and Exploitation (go toward best mean)
  • Acquisition function: balances these two goals (several to choose from)
  • tmwr

Bayesian Search: tidymodels

ridge_params <- ridge_wf |> 
  extract_parameter_set_dials() |> 
  update(penalty = penalty(c(-2, 1)))

bayes_ridge <- ridge_wf |> 
  tune_bayes(
    resamples = candy_folds,
    metrics = candy_metrics, # first metrics is what's optimized (rmse in this case)
    initial = tuning_ridge_results,
    param_info = ridge_params,
    iter = 25
  )

Bayesian Search: tidymodels

autoplot(bayes_ridge, type = "performance")

Bayesian Search: tidymodels

autoplot(bayes_ridge, type = "parameters")

Bayesian Search: tidymodels

show_best(bayes_ridge) |> kable()
penalty .metric .estimator mean n std_err .config .iter
9.999186 rmse standard 13.20055 20 0.2720582 Iter11 11
9.999034 rmse standard 13.20055 20 0.2720587 Iter8 8
9.996720 rmse standard 13.20057 20 0.2720676 Iter7 7
9.996350 rmse standard 13.20057 20 0.2720690 Iter9 9
9.995326 rmse standard 13.20058 20 0.2720729 Iter10 10

Bayesian Search: tidymodels

lasso_params <- lasso_wf |> 
  extract_parameter_set_dials() |> 
  update(penalty = penalty(c(-2,1)))

bayes_lasso <- lasso_wf |> 
  tune_bayes(
    resamples = candy_folds,
    metrics = candy_metrics, # first metrics is what's optimized (rmse in this case)
    initial = tuning_lasso_results,
    param_info = lasso_params,
    iter = 25
  )

Bayesian Search: tidymodels

autoplot(bayes_lasso, type = "performance")

Bayesian Search: tidymodels

autoplot(bayes_lasso, type = "parameters")

Bayesian Search: tidymodels

show_best(bayes_lasso) |> kable()
penalty .metric .estimator mean n std_err .config .iter
2.936401 rmse standard 12.76520 20 0.3059454 Iter16 16
2.937813 rmse standard 12.76520 20 0.3059406 Iter20 20
2.934824 rmse standard 12.76520 20 0.3059506 Iter19 19
2.940832 rmse standard 12.76521 20 0.3059305 Iter25 25
2.932606 rmse standard 12.76521 20 0.3059580 Iter22 22

Simulated Annealing

  • Does anyone know what annealing is in Physics/Material Science?
  • Start with initial parameter combination \(x_0\)(think gradient descent)
  • Embark on random walk… in each step \(i\)
    • Consider small perturbation from \(x_{i-1}\)… let’s call it \(x_{i-1}^{\epsilon}\)
    • Get performance for \(x_{i-1}^\epsilon\)
    • If performance is better set \(x_i = x_{i-1}^\epsilon\)
    • If performance is worse set \(x_i = x_{i-1}^\epsilon\) with probability \(\exp(-c\times D_i\times i)\)
      • \(c\) is user specified constant called cooling coefficient
      • \(D_i\) percent difference between old and new
    • Otherwise \(x_i = x_{i-1}\)
  • Idea:
    • start hot: likely to accept suboptimal parameter combinations and move around
    • cools over time: decrease number sub-optimal acceptances

Simulated Annealing: tidymodels

library(finetune)
ridge_sim_anneal <- ridge_wf |> 
  tune_sim_anneal(
    resamples = candy_folds,
    metrics = candy_metrics, # first metric is what's optimized (rmse in this case)
    initial = tuning_ridge_results,
    param_info = ridge_params,
    iter = 50
  )

Simulated Annealing tidymodels

autoplot(ridge_sim_anneal, type = "performance")

Simulated Annealing tidymodels

autoplot(ridge_sim_anneal, type = "parameters")

Simulated Annealing tidymodels

show_best(ridge_sim_anneal) |> kable()
penalty .metric .estimator mean n std_err .config .iter
10.000000 rmse standard 13.20054 20 0.2720551 Iter19 19
9.513817 rmse standard 13.20542 20 0.2739433 Iter42 42
9.079220 rmse standard 13.21026 20 0.2758191 Iter31 31
8.818478 rmse standard 13.21363 20 0.2770621 Iter9 9
8.415657 rmse standard 13.21916 20 0.2790967 Iter45 45

Simulated Annealing: tidymodels

lasso_sim_anneal <- ridge_wf |> 
  tune_sim_anneal(
    resamples = candy_folds,
    metrics = candy_metrics, # first metrics is what's optimized (rmse in this case)
    initial = tuning_lasso_results,
    param_info = lasso_params,
    iter = 50
  )

Simulated Annealing tidymodels

autoplot(lasso_sim_anneal, type = "performance")

Simulated Annealing tidymodels

autoplot(lasso_sim_anneal, type = "parameters") +
  scale_x_continuous(trans = 'log10')

Simulated Annealing tidymodels

show_best(lasso_sim_anneal) |> kable()
penalty .metric .estimator mean n std_err .config .iter
4.641589 rmse standard 13.09389 20 0.3774972 initial_Preprocessor1_Model09 0
9.934398 rmse standard 13.20114 20 0.2723052 Iter39 39
9.189444 rmse standard 13.20896 20 0.2752983 Iter25 25
8.786187 rmse standard 13.21407 20 0.2772191 Iter37 37
8.333199 rmse standard 13.22037 20 0.2795509 Iter26 26