MATH 427: Preprocessing, Missing Data, and Resampling Continued

Eric Friedlander

Announcements

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(knitr)
library(janitor) # for contingency tables
library(ISLR2)
library(readODS)

tidymodels_prefer()

set.seed(427)

Pre-processing

Data: Different Ames Housing Prices

Goal: Predict Sale_Price.

ames <- read_rds("../data/AmesHousing.rds")
ames |> glimpse()
Rows: 881
Columns: 20
$ Sale_Price    <int> 244000, 213500, 185000, 394432, 190000, 149000, 149900, …
$ Gr_Liv_Area   <int> 2110, 1338, 1187, 1856, 1844, NA, NA, 1069, 1940, 1544, …
$ Garage_Type   <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, …
$ Garage_Cars   <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2,…
$ Garage_Area   <dbl> 522, 582, 420, 834, 546, 480, 500, 440, 606, 868, 532, 7…
$ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pa…
$ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
$ Pool_Area     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Neighborhood  <fct> North_Ames, Stone_Brook, Gilbert, Stone_Brook, Northwest…
$ Screen_Porch  <int> 0, 0, 0, 0, 0, 0, 0, 165, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Overall_Qual  <fct> Good, Very_Good, Above_Average, Excellent, Above_Average…
$ Lot_Area      <int> 11160, 4920, 7980, 11394, 11751, 11241, 12537, 4043, 101…
$ Lot_Frontage  <dbl> 93, 41, 0, 88, 105, 0, 0, 53, 83, 94, 95, 90, 105, 61, 6…
$ MS_SubClass   <fct> One_Story_1946_and_Newer_All_Styles, One_Story_PUD_1946_…
$ Misc_Val      <int> 0, 0, 500, 0, 0, 700, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Open_Porch_SF <int> 0, 0, 21, 0, 122, 0, 0, 55, 95, 35, 70, 74, 130, 82, 48,…
$ TotRms_AbvGrd <int> 8, 6, 6, 8, 7, 5, 6, 4, 8, 7, 7, 7, 7, 6, 7, 7, 10, 7, 7…
$ First_Flr_SF  <int> 2110, 1338, 1187, 1856, 1844, 1004, 1078, 1069, 1940, 15…
$ Second_Flr_SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 563, 0, 886, 656, 11…
$ Year_Built    <int> 1968, 2001, 1992, 2010, 1977, 1970, 1971, 1977, 2009, 20…

Today

Well cover some common pre-processing tasks:

  • Dealing with zero-variance (zv) and/or near-zero variance (nzv) variables
  • Imputing missing entries
  • Label encoding ordinal categorical variables
  • Standardizing (centering and scaling) numeric predictors
  • Lumping predictors
  • One-hot/dummy encoding categorical predictor

Pre-Split Cleaning

  • Before you split your data: make sure data is in correct format
  • This may mean different things for different data sets
  • Common examples:
    • Fixing names of columns
    • Ensure all variable types are correct
    • Ensure all factor levels are correct and in order (if applicable)
    • Remove any variables that are not important (or harmful) to your analysis
    • Ensure missing values are coded as such (i.e. as NA instead of 0 or -1 or “missing”)
    • Filling in missing values where you know what the answer should be (i.e. if a missing value really means 0 instead of missing)

Example: Factor Levels in Wrong Order

ames |> pull(Overall_Qual) |> levels()
 [1] "Above_Average"  "Average"        "Below_Average"  "Excellent"     
 [5] "Fair"           "Good"           "Poor"           "Very_Excellent"
 [9] "Very_Good"      "Very_Poor"     

Re-Factoring

ames <- ames |> 
  mutate(Overall_Qual = factor(Overall_Qual, levels = c("Very_Poor", "Poor", 
                                                        "Fair", "Below_Average",
                                                        "Average", "Above_Average", 
                                                        "Good", "Very_Good",
                                                        "Excellent", "Very_Excellent")))
ames |> pull(Overall_Qual) |> levels()
 [1] "Very_Poor"      "Poor"           "Fair"           "Below_Average" 
 [5] "Average"        "Above_Average"  "Good"           "Very_Good"     
 [9] "Excellent"      "Very_Excellent"

Zero-Variance (zv) and/or Near-Zero Variance (nzv) Variables

Heuristic for detecting near-zero variance features is:

  • The fraction of unique values over the sample size is low (say ≤ 10%).
  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say ≥ 20%).
library(caret)
nearZeroVar(ames, saveMetrics = TRUE) |> kable()
freqRatio percentUnique zeroVar nzv
Sale_Price 1.000000 55.7321226 FALSE FALSE
Gr_Liv_Area 1.333333 62.9965948 FALSE FALSE
Garage_Type 2.196581 0.6810443 FALSE FALSE
Garage_Cars 1.970213 0.5675369 FALSE FALSE
Garage_Area 2.250000 38.0249716 FALSE FALSE
Street 219.250000 0.2270148 FALSE TRUE
Utilities 880.000000 0.2270148 FALSE TRUE
Pool_Area 876.000000 0.6810443 FALSE TRUE
Neighborhood 1.476744 2.9511918 FALSE FALSE
Screen_Porch 199.750000 6.6969353 FALSE TRUE
Overall_Qual 1.119816 1.1350738 FALSE FALSE
Lot_Area 1.071429 79.7956867 FALSE FALSE
Lot_Frontage 1.617021 11.5777526 FALSE FALSE
MS_SubClass 1.959064 1.7026107 FALSE FALSE
Misc_Val 141.833333 1.9296254 FALSE TRUE
Open_Porch_SF 23.176471 19.2962543 FALSE FALSE
TotRms_AbvGrd 1.311225 1.2485812 FALSE FALSE
First_Flr_SF 1.777778 63.7911464 FALSE FALSE
Second_Flr_SF 64.250000 31.3280363 FALSE FALSE
Year_Built 1.125000 12.0317821 FALSE FALSE

Recipe: Near-Zero Variance

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) # remove zero or near-zero variable predictors

Missing Data

  • Many times, you can’t just drop missing data
  • Even if you can, dropping missing values can generate biased data/models
  • Sometimes missing data gives you more information
  • Types of missing data:
    • Missing completely at random (MCAR): there is no pattern to your missing values
    • Missing at random (MAR): missing values are dependent on other values in the data set
    • Missing not at random (MNAR): missing values are dependent on the value that is missing
  • Structured missingness (SM): when the missingness of certain values are depends on one another, regardless of whether the missing values are MCAR, MAR, or MNAR

MCAR: Examples

  • Sensor data: occasionally sensors break so you’re missing data randomly
  • Survey data: sometimes people just randomly skip questions
  • Survey data: customers are randomly given 5 questions from a bank of 100 questions

MAR: Examples

  • Men are less likely to respond to surveys about depression
  • Medical study: patients who miss follow-up appointments are more likely to be young
  • Survey responses: ESL respondents may be more likely to skip certain questions that are difficult to interpret (only MAR if you know they are ESL)
  • Measure of student performance: students who score lower are more likely to skip questions

MNAR: Examples

  • Survey on income: respondent may be less likely to report their income if they are poor
  • Survey about political beliefs: respondent may be more likely to skip questions when their answer is perceived as undesirable
  • Customer satisfaction: only customers who feel strongly respond
  • Medical study: patients refuse to report unhealthy habits

Structurally Missing: Examples

  • Health survey: all questions related to pregnancy are left blank by males
  • Bank data set: combination of home, auto, and credit cards… not all customer have all three so have missing data in certain portions
  • Survey: many respondents by stop the survey early so all questions after a certain point are missing
  • Netflix: customers may only watch similar movies and TV shows

Remedies for Missing Data

  • Lot of complicated ways that you can read about
  • Can drop column of too much of the data is missing
  • Imputing:
    • step_impute_median: used for numeric (especially discrete) variables
    • step_impute_mean: used for numeric variables
    • step_impute_knn: used for both numeric and categorical variables (computationally expensive)
    • step_impute_mode: used for nominal (having no order) categorical variable

Exploring Missing Data

ames |> 
  summarize(across(everything(), ~ sum(is.na(.)))) |> 
  pivot_longer(everything()) |> 
  filter(value > 0) |> 
  kable()
name value
Gr_Liv_Area 113
Garage_Type 54
Year_Built 41

Missing Data: Garage_Type

  • The reason that Garage_Type is missing is because there is no basement
    • Solution: replace NAs with No_Garage
    • Do this before data splitting

Fixing Garage_Type

ames <- ames |> 
  mutate(Garage_Type = as_factor(if_else(is.na(Garage_Type), "No_Garage", Garage_Type)))

Missing Data: Year_Built

  • MCAR
    • Solution 1: Impute with mean or median
    • Solution 2: Impute with KNN… maybe we can infer what the values are based on other values in the data set?

Missing Data: Gr_Liv_Area

  • MCAR
    • Solution 1: Impute with mean or median
    • Solution 2: Impute with KNN… maybe we can infer what the values are based on other values in the data set?

Recipe: Missing Data

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) # impute missing values in Overall_Qual and Year_Built
  • Note: step_imput_knn uses the “Gower’s Distance” so don’t need to worry about normalizing

Encoding Ordinal Features

Two types of categorical features:

  • Ordinal (order is important)
  • Nominal (order is not important)

Encoding Ordinal Features

ames |> pull(Overall_Qual) |> levels()
 [1] "Very_Poor"      "Poor"           "Fair"           "Below_Average" 
 [5] "Average"        "Above_Average"  "Good"           "Very_Good"     
 [9] "Excellent"      "Very_Excellent"
  • Very_Poor = 1, Poor = 2, Fair = 3, etc…

Recipe: Encoding Ordinal Features

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) # convert Overall_Qual into ordinal encoding

Lump Small Categories Together

ames |> count(Neighborhood) |> kable()
Neighborhood n
North_Ames 127
College_Creek 86
Old_Town 83
Edwards 49
Somerset 50
Northridge_Heights 52
Gilbert 47
Sawyer 49
Northwest_Ames 41
Sawyer_West 31
Mitchell 33
Brookside 33
Crawford 22
Iowa_DOT_and_Rail_Road 28
Timberland 21
Northridge 22
Stone_Brook 17
South_and_West_of_Iowa_State_University 21
Clear_Creek 16
Meadow_Village 14
Briardale 10
Bloomington_Heights 10
Veenker 9
Northpark_Villa 3
Blueste 3
Greens 4

Lump Small Categories Together

ames |> mutate(Neighborhood = fct_lump_prop(Neighborhood, 0.05)) |> 
  count(Neighborhood) |>  kable()
Neighborhood n
North_Ames 127
College_Creek 86
Old_Town 83
Edwards 49
Somerset 50
Northridge_Heights 52
Gilbert 47
Sawyer 49
Other 338

Recipe: Lumping Small Factors Together

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
  step_other(Neighborhood, threshold = 0.01, other = "Other") # lump all categories with less than 1% representation into a category called Other for each variable

One-hot/dummy encoding categorical predictors

Figure 3.9: Machine Learning with R

Recipe: Dummy Variables

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
  step_other(Neighborhood, threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
  step_dummy(all_nominal_predictors(), one_hot = TRUE)  # in general use one_hot unless doing linear regression

Recipe: Center and scale

preproc <- recipe(Sale_Price ~ ., data = ames) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
  step_other(Neighborhood, threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>  # in general use one_hot unless doing linear regression
  step_normalize(all_numeric_predictors())

Order of Preprocessing Step

Questions to ask:

  1. Should this be done before or after data splitting?
  2. If I do step_A first what is the impact on step_B? For example, do you want to encode categorical variables before or after normalizing?
  3. What data format is required by the model I’m fitting and how will my model react to these changes?
  4. Is this step part of my “model”? I.e. is this a decision I’m making based on the data or based on subject matter expertise?
  5. Do I have access to my test predictors?

Questions

  • Should I lump before or after dummy coding?
  • Should I dummy code before or after normalizing?
  • Should I lump before my initial split?
  • How does ordinal encoding impact linear regression vs. KNN?

Final R Workflow

Clean Data Set

ames <- ames |> 
  mutate(Overall_Qual = factor(Overall_Qual, levels = c("Very_Poor", "Poor", 
                                                        "Fair", "Below_Average",
                                                        "Average", "Above_Average", 
                                                        "Good", "Very_Good",
                                                        "Excellent", "Very_Excellent")),
         Garage_Type = if_else(is.na(Garage_Type), "No_Garage", Garage_Type),
         Garage_Type = as_factor(Garage_Type)
         )

Initial Data Split

set.seed(427)

data_split <- initial_split(ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test  <- testing(data_split)

Define Folds

ames_folds <- vfold_cv(ames_train, v = 10, repeats = 10)
ames_folds
#  10-fold cross-validation repeated 10 times 
# A tibble: 100 × 3
   splits           id       id2   
   <list>           <chr>    <chr> 
 1 <split [594/66]> Repeat01 Fold01
 2 <split [594/66]> Repeat01 Fold02
 3 <split [594/66]> Repeat01 Fold03
 4 <split [594/66]> Repeat01 Fold04
 5 <split [594/66]> Repeat01 Fold05
 6 <split [594/66]> Repeat01 Fold06
 7 <split [594/66]> Repeat01 Fold07
 8 <split [594/66]> Repeat01 Fold08
 9 <split [594/66]> Repeat01 Fold09
10 <split [594/66]> Repeat01 Fold10
# ℹ 90 more rows

Define Model(s)

lm_model <- linear_reg() |>
  set_engine('lm')

knn5_model <- nearest_neighbor(neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("regression")

knn10_model <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("regression")

Define Preprocessing: Linear regression

lm_preproc <- recipe(Sale_Price ~ ., data = ames_train) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
  step_other(all_nominal_predictors(), threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>  # in general use one_hot unless doing linear regression
  step_corr(all_numeric_predictors(), threshold = 0.5) |> # remove highly correlated predictors
  step_lincomb(all_numeric_predictors()) # remove variables that have exact linear combinations

Define Preprocessing: KNN

knn_preproc <- recipe(Sale_Price ~ ., data = ames_train) |> 
  step_nzv(all_predictors()) |>  # remove zero or near-zero variable predictors
  step_impute_knn(Year_Built, Gr_Liv_Area) |>  # impute missing values in Overall_Qual and Year_Built
  step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
  step_other(all_nominal_predictors(), threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>  # in general use one_hot unless doing linear regression
  step_nzv(all_predictors()) |> 
  step_normalize(all_numeric_predictors())

Define Workflows

lm_wf <- workflow() |> add_model(lm_model) |> add_recipe(lm_preproc)
knn5_wf <- workflow() |> add_model(knn5_model) |> add_recipe(knn_preproc)
knn10_wf <- workflow() |> add_model(knn10_model) |> add_recipe(knn_preproc)

Define Metrics

ames_metrics <- metric_set(rmse, rsq)

Fit and Assess Models

lm_results <- lm_wf |> fit_resamples(resamples = ames_folds, metrics = ames_metrics)
knn5_results <- knn5_wf |> fit_resamples(resamples = ames_folds, metrics = ames_metrics)
knn10_results <- knn10_wf |> fit_resamples(resamples = ames_folds, metrics = ames_metrics)

Collecting Metrics

collect_metrics(lm_results) |> kable()
.metric .estimator mean n std_err .config
rmse standard 3.979336e+04 100 768.846396 Preprocessor1_Model1
rsq standard 7.548961e-01 100 0.008863 Preprocessor1_Model1
collect_metrics(knn5_results) |> kable()
.metric .estimator mean n std_err .config
rmse standard 40063.355604 100 963.9908286 Preprocessor1_Model1
rsq standard 0.758572 100 0.0063228 Preprocessor1_Model1
collect_metrics(knn10_results) |> kable()
.metric .estimator mean n std_err .config
rmse standard 3.957217e+04 100 1011.2439632 Preprocessor1_Model1
rsq standard 7.673612e-01 100 0.0065484 Preprocessor1_Model1

Final & Evaluate Final Model

  • After choosing best model/workflow, fit on full training set and assess on test set
final_fit <- knn10_wf |> fit(data = ames_train)
final_fit_perf <- final_fit |> 
  predict(new_data = ames_test) |> 
  bind_cols(ames_test) |> 
  ames_metrics(truth = Sale_Price, estimate = .pred)

final_fit_perf |> kable()
.metric .estimator .estimate
rmse standard 4.947411e+04
rsq standard 7.174733e-01

Tips

  • Can try out different pre-processing to see if it improves your model!
  • Process can be intense for you computer, so might take a while
  • No 100% correct way to do it, although there are some 100% incorrect ways to do it