MATH 427: More More on Classification

Eric Friedlander

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(knitr)
library(kableExtra)

tidymodels_prefer()

set.seed(427)

Data: Voter Frequency

Info about data
Goal: Identify individuals who are unlikely to vote to help organization target “get out the vote” effort.

voter_data <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv')

voter_clean <- voter_data |>
  select(-RespId, -weight, -Q1) |>
  mutate(
    educ = factor(educ, levels = c("High school or less", "Some college", "College")),
    income_cat = factor(income_cat, levels = c("Less than $40k", "$40-75k ",
                                               "$75-125k", "$125k or more")),
    voter_category = factor(voter_category, levels = c("rarely/never", "sporadic", "always"))
  ) |>
  filter(Q22 != 5 | is.na(Q22)) |>
  mutate(Q22 = as_factor(Q22),
         Q22 = if_else(is.na(Q22), "Not Asked", Q22),
         across(Q28_1:Q28_8, ~if_else(.x == -1, 0, .x)),
         across(Q28_1:Q28_8, ~ as_factor(.x)),
         across(Q28_1:Q28_8, ~if_else(is.na(.x) , "Not Asked", .x)),
         across(Q29_1:Q29_10, ~if_else(.x == -1, 0, .x)),
         across(Q29_1:Q29_10, ~ as_factor(.x)),
         across(Q29_1:Q29_8, ~if_else(is.na(.x) , "Not Asked", .x)),
        Party_ID = as_factor(case_when(
          Q31 == 1 ~ "Strong Republican",
          Q31 == 2 ~ "Republican",
          Q32 == 1  ~ "Strong Democrat",
          Q32 == 2 ~ "Democrat",
          Q33 == 1 ~ "Lean Republican",
          Q33 == 2 ~ "Lean Democrat",
          TRUE ~ "Other"
        )),
        Party_ID = factor(Party_ID, levels =c("Strong Republican", "Republican", "Lean Republican",
                                                "Other", "Lean Democrat", "Democrat", "Strong Democrat")),
        across(!ppage, ~as_factor(if_else(.x == -1, NA, .x))))

Split Data

set.seed(427)

voter_splits <- initial_split(voter_clean, prop = 0.7, strata = voter_category)
voter_train <- training(voter_splits)
voter_test <- testing(voter_splits)

Problem: More than two categories

voter_train |> 
  ggplot(aes(x = voter_category)) +
  geom_bar()

Define Model: Multinomial Regression

mn_reg_model <- multinom_reg(mixture = 1, penalty = 0.005) |> # I chose this penalty arbitrarily
  set_engine("glmnet", family = "multinomial") |> 
  set_mode("classification")

Define Recipe

mr_recipe <- recipe(voter_category ~ . , data = voter_train) |>
  step_zv(all_predictors()) |>
  step_integer(educ, income_cat, Party_ID, Q2_2:Q4_6, Q6, Q8_1:Q9_4, Q14:Q17_4,
               Q25:Q26) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |> 
  step_normalize(all_numeric_predictors())

Define Workflow and Fit

mr_fit <- workflow() |>
  add_model(mn_reg_model) |>
  add_recipe(mr_recipe) |> 
  fit(voter_train)

Look at Predictions

mr_fit |> augment(new_data = voter_test) |> slice_sample(n=10) |> select(1:4) |> head() |>  kable()

.pred_class	.pred_rarely/never	.pred_sporadic	.pred_always
sporadic	0.0317402	0.5122327	0.4560271
always	0.0463060	0.4464935	0.5072005
sporadic	0.0553279	0.6529393	0.2917328
always	0.0388506	0.4351109	0.5260385
always	0.0361359	0.4164662	0.5473979
always	0.0991499	0.4310622	0.4697879

Confusion Matrix

mr_fit |> augment(new_data = voter_test) |> 
  conf_mat(truth = voter_category, estimate = .pred_class) |> autoplot("heatmap")

Last Time

No “Positive” and “Negative” anymore
Most of our metrics were based on having “Positive” vs. “Negative”
Solution 1: 1-vs-all metrics

Evaluating Multiclass Models

Solution 2: Average metrics across labels
- Macro-averaging average one-versus-all metrics
  - Recall: \(\frac{0.66+0.72+0.41}{3} \approx 0.60\)
- Macro-weighted averaging same but weight by class size
  - Recall: \(\frac{430\times 0.66+773\times 0.72+543\times 0.41}{1746} \approx 0.61\)
- Micro-averaging compute contribution for each class, aggregates them, then computes a single metric
  - Recall: \(\frac{285+559+221}{430+779+543} \approx 0.61\)

Questions From Last Time

Multinomial Logistic regression:
- Each class \(k\): \[\log \frac{\text{Prob. class } k}{\text{Prob. class } K} = \beta_{0k} + \beta_{1k}X_1 + \cdots + \beta_{pk}X_p\]
Shuba: Micro-averaged recall is the same as accuracy… CORRECT!
- Same is true of precision AND \(F_1\)
- Useful if you have a “multi-label” classification problem (not covered in this class)

Rabin’s Question

“What does a medium sized data set mean?”
Idea behind ML: identify patterns in data and use them to make predictions
As data gets “bigger”:
- Patterns become clearer
- Computational complexity increases
Informal definitions:
- Small data: not enough data to fully represent patterns
- Big data: all the info is there but special approaches need to be taken to handle all the data

Thinking about small data

Patterns not fully represented in data \(\Rightarrow\) restrict the set of possible patterns and give model less flexibility and freedom
- Logistic regression (probably with regularization)
- Support vector machines (we haven’t talked about these yet)
- To a lesser extent: Decision Trees
- NOT KNN

Thinking about big data

Patterns are definitely there but size introduces computational problems
- Data set can’t fit in memory
  - Solution 1: Use a high-memory HPC cluster node
  - Solution 2: Modify algorithms to use parts of data instead of full data set (e.g. Stochastic gradient descent)
- Algorithm scales with size of data and will take too long to run/fit
  - Solution 1: Run in parallel if possible using HPC cluster
  - Solution 2: Develop faster algorithms
    - Implement in a compiled language like C
    - Develop (faster) approximate solution
Curse of dimensionalality
- If \(p\) is too-big \(\Rightarrow\) too much space and things are too far apart \(\Rightarrow\) similar impact to small data but without computational benefit
Note: big \(n\) vs. big \(p\) can present different issues

Medium Data

Data that’s big enough to have (most) of the information you need but not so big that it presents computational issues
KNN sweet spot
- Enough data the “nearest-neighbors” are actually “near”
- Not so much data that it takes forever to make predictions

Exploring with App

App 1
App 2

Macro-Averaging in R

voter_metrics <- metric_set(accuracy, precision, recall)

mr_fit |> augment(new_data = voter_test) |> 
  voter_metrics(truth = voter_category, estimate = .pred_class, estimator = "macro") |> 
  kable()

.metric	.estimator	.estimate
accuracy	multiclass	0.6099656
precision	macro	0.6375886
recall	macro	0.5976485

Macro-Weighted Averaging in R

mr_fit |> augment(new_data = voter_test) |> 
  voter_metrics(truth = voter_category, estimate = .pred_class, estimator = "macro_weighted") |> 
  kable()

.metric	.estimator	.estimate
accuracy	multiclass	0.6099656
precision	macro_weighted	0.6176222
recall	macro_weighted	0.6099656

Micro-Averaging in R

mr_fit |> augment(new_data = voter_test) |> 
  voter_metrics(truth = voter_category, estimate = .pred_class, estimator = "micro") |> 
  kable()

.metric	.estimator	.estimate
accuracy	multiclass	0.6099656
precision	micro	0.6099656
recall	micro	0.6099656

What if output is probability?

For binary case we used ROC curve and AUC…
Similar ideas apply here:
- One vs. all
- Macro Averaging
- NO MICRO AVERAGING!
- Hand and Till extension of AUC

Plotting one-vs.-all

mr_fit |> augment(new_data = voter_test) |> 
  roc_curve(truth = voter_category, `.pred_rarely/never`, .pred_sporadic, .pred_always) |> 
  autoplot()

Macro Averaged AUC

mr_fit |> augment(new_data = voter_test) |> 
  roc_auc(truth = voter_category, `.pred_rarely/never`, .pred_sporadic, .pred_always, 
          estimator = "macro") |> 
  kable()

.metric	.estimator	.estimate
roc_auc	macro	0.7802495

Macro-Weighted Averaged AUC

mr_fit |> augment(new_data = voter_test) |> 
  roc_auc(truth = voter_category, `.pred_rarely/never`, .pred_sporadic, .pred_always, 
          estimator = "macro_weighted") |> 
  kable()

.metric	.estimator	.estimate
roc_auc	macro_weighted	0.7636223

Hand and Till AUC

Idea behind traditional AUC: “How well are my classes being separated?”
Hand and Till: Extend this idea to multiple-classes
Paper
- Basic Idea: Do pairwise comparison of classes and average

mr_fit |> augment(new_data = voter_test) |> 
  roc_auc(truth = voter_category, `.pred_rarely/never`, .pred_sporadic, .pred_always) |> 
  kable()

.metric	.estimator	.estimate
roc_auc	hand_till	0.7967478

Discussion

Why do you compute averages/means?

How would heavily imbalanced classes impact each type of averaging?
- Which type(s) of averaging weight(s) each class equally regardless of balance?
- Which type(s) of averaging favor(s) larger classes?
- Does this imply that one is better than the others?

Exploring with App

App
- Break into groups
- Investigate how your performance metrics change between balanced data and unbalanced data
- Additional Considerations:
  - Impact of boundaries/models?
  - Impact of sample size?
  - Impact of noise level?
- Please write down observations so we can discuss