MATH 427: Feature Selection and Regularization

Eric Friedlander

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(knitr)
library(fivethirtyeight) # for candy rankings data

tidymodels_prefer()

set.seed(427)

Data: Candy

The data for this lecture comes from the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reese’s vs. Skittles). Click here to check out some of the match ups.

The data set contains 12 characteristics and win percentage from 85 candies in the experiment.

Data: Candy

glimpse(candy_rankings)

Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Cleaning

candy_rankings_clean <- candy_rankings |> 
  select(-competitorname) |> 
  mutate(sugarpercent = sugarpercent*100, # convert proportions into percentages
         pricepercent = pricepercent*100, # convert proportions into percentages
         across(where(is.logical), ~ factor(.x, levels = c("FALSE", "TRUE")))) # convert logicals into factors

Data Cleaning

glimpse(candy_rankings_clean)

Rows: 85
Columns: 12
$ chocolate        <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <fct> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <fct> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 73.2, 60.4, 1.1, 1.1, 90.6, 46.5, 60.4, 31.3, 90.6, 6…
$ pricepercent     <dbl> 86.0, 51.1, 11.6, 51.1, 51.1, 76.7, 76.7, 51.1, 32.5,…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Data Splitting

candy_split <- initial_split(candy_rankings_clean, prop = 0.75, strata = winpercent)
candy_train <- training(candy_split)
candy_test <- testing(candy_split)

Feature Selection for Linear Regression

What is feature selection?

How do we choose what variables to include in our model?
Up to now… include all of them… probably not the best
Advantage of linear regression: interpretability
Including every feature decreases interpretability
Reasons for feature selection:
- Improve model performance
- Improve model interpretability
Parsimony: simpler models are called more parsimonious
Occam’s Razor: more parsimonious models are better than less parsimonious models, holding all else constant

Types of feature selection

Subset selection: Forward/Backward/Best-Subset Selection
Shrinkage-based methods: LASSO and Ridge Regression
Dimension reduction: consider linear combinations of predictors

Subset Selection

Exercise

With your group, write out the steps for the following algorithms on the board
- Group 1: Forward selection
- Group 2: Backward elimination
- Group 3: Step-wise selection
- Group 4: Best-subset selection

Subset Selection in R

tidymodels does not have an implementation for any subset selection techniques
regularization (shrinkage-based) methods almost always perform better
colino package provides tidymodels implementation
Other options
- caret package
- olsrr and blorr packages if you don’t care about cross-validation
- implement yourself

Feature Selection in R

When creating your recipe, don’t need to always include all variables in your recipe:

int_recipe <- recipe(resp ~ var1 + var2 + var1*var2, data = training_data) |> 
  step_x(...)

Re-using Recipe but changing formula

noint_recipe2 <- new_recipe |> 
  remove_formula() |> 
  add_formula(resp ~ var1 + var2)

Shrinkage/Penalized/Regularization Methods

Question

What criteria do we use to fit a linear regression model? Write down an expression with \(\hat{\beta_i}\)’s, \(x_{ij}\)’s, and \(y_j\)’s in it.

OLS

Ordinary Least Squares Regression:

\[ \begin{aligned} \hat{\beta} =\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}SSE(\hat{\beta}) &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\sum_{j=1}^n(y_j-\hat{y}_j)^2\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 \end{aligned} \]

\(\hat{\beta} = (\hat{\beta}_0,\ldots, \hat{\beta}_p)\) is the vector of all my coefficients
\(\operatorname{argmin}\) is a function (operator) that returns the arguments that minimize the quantity it’s being applied to

Ridge Regression

\[ \begin{aligned} \hat{\beta} &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}} \left(SSE(\hat{\beta}) + \lambda\|\hat{\beta}\|_2^2\right) \\ &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{y}_j)^2 + \lambda\sum_{i=1}^p \hat{\beta}_i^2\right)\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 + \lambda\sum_{i=1}^p \hat{\beta}_i^2\right) \end{aligned} \]

\(\lambda\) is called a “tuning parameter” and is not estimated from the data (more on this later)
Idea: penalize large coefficients
If we penalize large coefficients, what’s going to happen to to our estimated coefficients? THEY SHRINK!
\(\|\cdot\|_2\) is called the \(L_2\)-norm

But… but… but why?!?

Recall the Bias-Variance Trade-Off
Our reducible error can partitioned into:
- Bias: how much \(\hat{f}\) misses \(f\) by on average
- Variance: how much \(\hat{f}\) moves around from sample to sample
Ridge: increase bias a little bit in exchange for large decrease in variance
As we increase \(\lambda\) do we increase or decrease the penalty for large coefficients?
As we increase \(\lambda\) do we increase or decrease the flexibility of our model?

LASSO

\[ \begin{aligned} \hat{\beta} &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}} \left(SSE(\hat{\beta}) + \lambda\|\hat{\beta}\|_1\right) \\ &= \underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{y}_j)^2 + \lambda\sum_{i=1}^p |\hat{\beta}_i|\right)\\ &=\underset{\hat{\beta}_0,\ldots, \hat{\beta}_p}{\operatorname{argmin}}\left(\sum_{j=1}^n(y_j-\hat{\beta}_0-\hat{\beta}_1x_{1j} - \cdots - \hat{\beta}_px_{pj})^2 + \lambda\sum_{i=1}^p |\hat{\beta}_i|\right) \end{aligned} \]

LASSO: Least Absolute Shrinkage and Selection Operator
\(\lambda\) is called a “tuning parameter” and is not estimated from the data (more on this later)
Idea: penalize large coefficients
If we penalize large coefficients, what’s going to happen to to our estimated coefficients THEY SHRINK!
\(\|\cdot\|_1\) is called the \(L_1\)-norm

Question

What should happen to our coefficients as we increase \(\lambda\)?

Ridge vs. LASSO

Ridge has a closed-form solution… how might we calculate it?
Ridge has some nice linear algebra properties that makes it EXTREMELY FAST to fit
LASSO has no closed-form solution… why?
LASSO coefficients estimated numerically… how?
- Gradient descent works (kind of) but something called coordinate descent is typically better
MOST IMPORTANT PROPERTY OF LASSO: it induces sparsity while Ridge does not

Sparsity in Applied Math

sparse typically means “most things are zero”
Example: sparse matrices are matrices where most entries are zero
- for large matrices this can provide HUGE performance gains
LASSO induces sparsity by setting most of the parameter estimates to zero
- this means it fits the model and does feature selection SIMULTANEOUSLY
Let’s do some board work to see why this is…