MATH 427: Feature Selection and Regularization

Eric Friedlander

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(knitr)
library(fivethirtyeight) # for candy rankings data

tidymodels_prefer()

set.seed(427)

Data: Candy

The data for this lecture comes from the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reese’s vs. Skittles). Click here to check out some of the match ups.

The data set contains 12 characteristics and win percentage from 85 candies in the experiment.

Data: Candy

glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Cleaning

candy_rankings_clean <- candy_rankings |> 
  select(-competitorname) |> 
  mutate(sugarpercent = sugarpercent*100, # convert proportions into percentages
         pricepercent = pricepercent*100, # convert proportions into percentages
         across(where(is.logical), ~ factor(.x, levels = c("FALSE", "TRUE")))) # convert logicals into factors

Data Cleaning

glimpse(candy_rankings_clean)
Rows: 85
Columns: 12
$ chocolate        <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <fct> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 73.2, 60.4, 1.1, 1.1, 90.6, 46.5, 60.4, 31.3, 90.6, 6…
$ pricepercent     <dbl> 86.0, 51.1, 11.6, 51.1, 51.1, 76.7, 76.7, 51.1, 32.5,…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Splitting

candy_split <- initial_split(candy_rankings_clean, prop = 0.75, strata = winpercent)
candy_train <- training(candy_split)
candy_test <- testing(candy_split)

Feature Selection for Linear Regression

What is feature selection?

  • How do we choose what variables to include in our model?
  • Up to now… include all of them… probably not the best
  • Advantage of linear regression: interpretability
  • Including every feature decreases interpretability
  • Reasons for feature selection:
    • Improve model performance
    • Improve model interpretability
  • Parsimony: simpler models are called more parsimonious
  • Occam’s Razor: more parsimonious models are better than less parsimonious models, holding all else constant

Types of feature selection

  • Subset selection: Forward/Backward/Best-Subset Selection
  • Shrinkage-based methods: LASSO and Ridge Regression
  • Dimension reduction: consider linear combinations of predictors

Subset Selection

Exercise

  • With your group, write out the steps for the following algorithms on the board
    • Group 1: Forward selection
    • Group 2: Backward elimination
    • Group 3: Step-wise selection
    • Group 4: Best-subset selection

Subset Selection in R

  • tidymodels does not have an implementation for any subset selection techniques
  • regularization (shrinkage-based) methods almost always perform better
  • colino package provides tidymodels implementation
  • Other options
    • caret package
    • olsrr and blorr packages if you don’t care about cross-validation
    • implement yourself

Feature Selection in R

  • When creating your recipe, don’t need to always include all variables in your recipe:
int_recipe <- recipe(resp ~ var1 + var2 + var1*var2, data = training_data) |> 
  step_x(...)

Re-using Recipe but changing formula

noint_recipe2 <- new_recipe |> 
  remove_formula() |> 
  add_formula(resp ~ var1 + var2)

Shrinkage/Penalized/Regularization Methods

Question

  • What criteria do we use to fit a linear regression model? Write down an expression with \(\hat{\beta_i}\)’s, \(x_{ij}\)’s, and \(y_j\)’s in it.

OLS

  • Ordinary Least Squares Regression:

\[ \begin{aligned} \hat{\beta} =\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}SSE(\hat{\beta}) &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\sum_{j=1}^n(y_j-\hat{y}_j)^2\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 \end{aligned} \]

  • \(\hat{\beta} = (\hat{\beta}_0,\ldots, \hat{\beta}_p)\) is the vector of all my coefficients
  • \(\operatorname{argmin}\) is a function (operator) that returns the arguments that minimize the quantity it’s being applied to

Ridge Regression

\[ \begin{aligned} \hat{\beta} &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}} \left(SSE(\hat{\beta}) + \lambda\|\hat{\beta}\|_2^2\right) \\ &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{y}_j)^2 + \lambda\sum_{i=1}^p \hat{\beta}_i^2\right)\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 + \lambda\sum_{i=1}^p \hat{\beta}_i^2\right) \end{aligned} \]

  • \(\lambda\) is called a “tuning parameter” and is not estimated from the data (more on this later)
  • Idea: penalize large coefficients
  • If we penalize large coefficients, what’s going to happen to to our estimated coefficients? THEY SHRINK!
  • \(\|\cdot\|_2\) is called the \(L_2\)-norm

But… but… but why?!?

  • Recall the Bias-Variance Trade-Off
  • Our reducible error can partitioned into:
    • Bias: how much \(\hat{f}\) misses \(f\) by on average
    • Variance: how much \(\hat{f}\) moves around from sample to sample
  • Ridge: increase bias a little bit in exchange for large decrease in variance
  • As we increase \(\lambda\) do we increase or decrease the penalty for large coefficients?
  • As we increase \(\lambda\) do we increase or decrease the flexibility of our model?

LASSO

\[ \begin{aligned} \hat{\beta} &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}} \left(SSE(\hat{\beta}) + \lambda\|\hat{\beta}\|_1\right) \\ &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{y}_j)^2 + \lambda\sum_{i=1}^p |\hat{\beta}_i|\right)\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 + \lambda\sum_{i=1}^p |\hat{\beta}_i|\right) \end{aligned} \]

  • LASSO: Least Absolute Shrinkage and Selection Operator
  • \(\lambda\) is called a “tuning parameter” and is not estimated from the data (more on this later)
  • Idea: penalize large coefficients
  • If we penalize large coefficients, what’s going to happen to to our estimated coefficients THEY SHRINK!
  • \(\|\cdot\|_1\) is called the \(L_1\)-norm

Question

  • What should happen to our coefficients as we increase \(\lambda\)?

Ridge vs. LASSO

  • Ridge has a closed-form solution… how might we calculate it?
  • Ridge has some nice linear algebra properties that makes it EXTREMELY FAST to fit
  • LASSO has no closed-form solution… why?
  • LASSO coefficients estimated numerically… how?
    • Gradient descent works (kind of) but something called coordinate descent is typically better
  • MOST IMPORTANT PROPERTY OF LASSO: it induces sparsity while Ridge does not

Sparsity in Applied Math

  • sparse typically means “most things are zero”
  • Example: sparse matrices are matrices where most entries are zero
    • for large matrices this can provide HUGE performance gains
  • LASSO induces sparsity by setting most of the parameter estimates to zero
    • this means it fits the model and does feature selection SIMULTANEOUSLY
  • Let’s do some board work to see why this is…