MAT-427: Data Splitting + KNN

Eric Friedlander

Computational Setup

library(tidyverse)
library(tidymodels)
library(knitr)
library(readODS)
library(modeldata) # contains ames dataset

tidymodels_prefer()

mlr_model <- linear_reg() |> 
  set_engine("lm")

Comparing Models: Data Splitting

Split ames data set into two parts
- Training set: randomly selected proportion \(p\) (typically 50-90%) of data used for fitting model
- Test set: randomly selected proportion \(1-p\) of data used for estimating prediction error
If comparing A LOT of models, split into three parts to prevent information leakage
- Training set: randomly selected proportion \(p\) (typically 50-90%) of data used for fitting model
- Validation set: randomly selected proportion \(q\) (typically 20-30%) of data used to choosing tuning parameters
- Test set: randomly selected proportion \(1-p-q\) of data used for estimating prediction error
Idea: use data your model hasn’t seen to get more accurate estimate of error and prevent overfitting

Comparing Models: Data Splitting with `tidymodels`

set.seed(427) # Why?

ames_split <- initial_split(ames, prop = 0.70, strata = Sale_Price) # initialize 70/30
ames_split

<Training/Testing/Total>
<2049/881/2930>

ames_train <- training(ames_split) # get training data
ames_test <- testing(ames_split) # get test data

strata not necessary but good practice
- strata will use stratified sampling on the variable you specify (very little downside)

Linear Regression: Comparing Models

Let’s create three models with Sale_Price as the response:
- fit1: a linear regression model with Bedroom_AbvGr as the only predictor
- fit2: a linear regression model with Gr_Liv_Area as the only predictor
- fit3 (similar to model in previous slides): a multiple regression model with Gr_Liv_Area and Bedroom_AbvGr as predictors
- fit4: super flexible model which fits a 10th degree polynomial to Gr_Liv_Area and a 2nd degree polynomial to Bedroom_AbvGr

fit1 <- mlr_model |> fit(Sale_Price ~ Bedroom_AbvGr, data = ames_train) # Use only training set
fit2 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area, data = ames_train)
fit3 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area + Bedroom_AbvGr, data = ames_train)
fit4 <- mlr_model |> fit(Sale_Price ~ poly(Gr_Liv_Area, degree = 10) + poly(Bedroom_AbvGr, degree = 2), data = ames_train)

Computing MSE

# Fit 1
fit1_train_mse <- mean((ames_train$Sale_Price - predict(fit1, new_data = ames_train)$.pred)^2)
fit1_test_mse <- mean((ames_test$Sale_Price - predict(fit1, new_data = ames_test)$.pred)^2)

# Fit 2
fit2_train_mse <- mean((ames_train$Sale_Price - predict(fit2, new_data = ames_train)$.pred)^2)
fit2_test_mse <- mean((ames_test$Sale_Price - predict(fit2, new_data = ames_test)$.pred)^2)

# Fit 
fit3_train_mse <- mean((ames_train$Sale_Price - predict(fit3, , new_data = ames_train)$.pred)^2)
fit3_test_mse <- mean((ames_test$Sale_Price - predict(fit3, new_data = ames_test)$.pred)^2)

# Fit 
fit4_train_mse <- mean((ames_train$Sale_Price - predict(fit4, , new_data = ames_train)$.pred)^2)
fit4_test_mse <- mean((ames_test$Sale_Price - predict(fit4, new_data = ames_test)$.pred)^2)

Question

Without looking at the numbers

Do we know which of the following is the smallest: fit1_train_mse, fit2_train_mse, fit3_train_mse, fit4_train_mse? Yes, fit4_train_mse
Do we know which of the following is the smallest: fit1_test_mse, fit2_test_mse, fit3_test_mse, fit4_test_mse? No

Choosing a Model

# Training Errors
c(fit1_train_mse, fit2_train_mse, 
  fit3_train_mse, fit4_train_mse)

[1] 6213135279 3188099910 2781293767 2472424544

which.min(c(fit1_train_mse, fit2_train_mse, 
            fit3_train_mse, fit4_train_mse))

[1] 4

# test Errors
c(fit1_test_mse, fit2_test_mse, 
  fit3_test_mse, fit4_test_mse)

[1] 6.329031e+09 3.203895e+09 2.732389e+09 2.726084e+12

which.min(c(fit1_test_mse, fit2_test_mse, 
            fit3_test_mse, fit4_test_mse))

[1] 3

fit4 has the lowest training MSE (to be expected)
fit3 has the lowest test MSE
- We would choose fit3
Anything else interesting we see?

K-Nearest Neighbors

Regression: Conditional Averaging

Restaurant Outlets Profit dataset

What is a good value of \(\hat{f}(x)\) (expected profit), say at \(x=6\)?

A possible choice is the average of the observed responses at \(x=6\). But we may not observe responses for certain \(x\) values.

K-Nearest Neighbors (KNN) Regression

Non-parametric approach
Formally: Given a value for \(K\) and a test data point \(x_0\), \[\hat{f}(x_0)=\dfrac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i=\text{Average} \ \left(y_i \ \text{for all} \ i:\ x_i \in \mathcal{N}_0\right) \] where \(\mathcal{N}_0\) is the set of the \(K\) training observations closest to \(x_0\).
Informally, average together the \(K\) “closest” observations in your training set
“Closeness”: usually use the Euclidean metric to measure distance
Euclidean distance between \(\mathbf{X}_i=(x_{i1}, x_{i2}, \ldots, x_{ip})\) and \(\mathbf{x}_j=(x_{j1}, x_{j2}, \ldots, x_{jp})\): \[||\mathbf{x}_i-\mathbf{x}_j||_2 = \sqrt{(x_{i1}-x_{j1})^2 + (x_{i2}-x_{j2})^2 + \ldots + (x_{ip}-x_{jp })^2}\]

KNN Regression (single predictor): Fit

\(K=1\)

knnfit1 <- nearest_neighbor(neighbors = 1) |> 
  set_engine("kknn") |> 
  set_mode("regression") |> 
  fit(profit ~ population, data = outlets)   # 1-nn regression
predict(knnfit1, new_data = tibble(population = 6)) |> kable()  # 1-nn prediction

.pred
0.92695

\(K=5\)

knnfit5 <- nearest_neighbor(neighbors = 5) |> 
  set_engine("kknn") |> 
  set_mode("regression") |> 
  fit(profit ~ population, data = outlets)   # 1-nn regression
predict(knnfit5, new_data = tibble(population = 6)) |> kable()  # 1-nn prediction

.pred
4.113736

Regression Methods: Comparison

Question!!!

As \(K\) in KNN regression increases:

the flexibility of the fit (decreases /increases)
the bias of the fit (decreases/increases )
the variance of the fit (decreases/increases)

K-Nearest Neighbors Regression (multiple predictors)

Let’s look at the ames data

ames |>
  select(Sale_Price, Gr_Liv_Area, Bedroom_AbvGr) |>
  head() |> 
  kable()

Sale_Price	Gr_Liv_Area	Bedroom_AbvGr
215000	1656	3
105000	896	2
172000	1329	3
244000	2110	3
189900	1629	3
195500	1604	3

Should 1 square foot count the same as 1 bedroom?
Need to center and scale (freq. just say scale)
- subtract mean from each predictor
- divide by standard deviation of each predictor
- compares apples-to-apples

Scaling in R

# scale predictors
ames_scaled <- tibble(size_scaled = scale(ames$Gr_Liv_Area),
                                  num_bedrooms_scaled = scale(ames$Bedroom_AbvGr),
                                  price = ames$Sale_Price)

head(ames_scaled) |> kable()  # first six observations

size_scaled	num_bedrooms_scaled	price
0.3092123	0.1760642	215000
-1.1942232	-1.0320576	105000
-0.3376606	0.1760642	172000
1.2073172	0.1760642	244000
0.2558008	0.1760642	189900
0.2063456	0.1760642	195500

Question…

What about the training and test sets?
Need to scale BOTH sets based on the mean and standard deviation of the training set…
Discussion: Why?
Discussion: Why don’t I need to center and scale Sale_Price?

Scaling Revisited

ames_train_scaled <- tibble(size_scaled = scale(ames_train$Gr_Liv_Area),
                                  num_bedrooms_scaled = scale(ames_train$Bedroom_AbvGr),
                                  price = ames_train$Sale_Price)

ames_test_scaled <- tibble(size_scaled = (ames_test$Gr_Liv_Area - mean(ames_train$Gr_Liv_Area)/sd(ames_train$Gr_Liv_Area)),
                                  num_bedrooms_scaled = (ames_test$Bedroom_AbvGr - mean(ames_train$Bedroom_AbvGr))/sd(ames_train$Bedroom_AbvGr),
                                  price = ames_test$Sale_Price)

Next time: using recipe’s in tidymodels to simplify this process

K-Nearest Neighbors Regression (multiple predictors)

knnfit10 <- nearest_neighbor(neighbors = 10) |>   # 10-nn regression
  set_engine("kknn") |> 
  set_mode("regression") |> 
  fit(price ~ size_scaled + num_bedrooms_scaled, data = ames_train_scaled)

Test Point: Gr_Liv_area = 2000 square feet, and Bedroom_AbvGr = 3, then

# obtain 10-nn prediction

predict(knnfit10, new_data = tibble(size_scaled = (2000 - mean(ames_train$Gr_Liv_Area))/sd(ames_train$Gr_Liv_Area),
                                     num_bedrooms_scaled = (3 - mean(ames_train$Bedroom_AbvGr))/sd(ames_train$Bedroom_AbvGr)))

# A tibble: 1 × 1
   .pred
   <dbl>
1 256380

MAT-427: Data Splitting + KNN

Computational Setup

Comparing Models: Data Splitting

Comparing Models: Data Splitting with tidymodels

Linear Regression: Comparing Models

Computing MSE

Question

Choosing a Model

K-Nearest Neighbors

Regression: Conditional Averaging

K-Nearest Neighbors (KNN) Regression

KNN Regression (single predictor): Fit

Regression Methods: Comparison

Question!!!

K-Nearest Neighbors Regression (multiple predictors)

Scaling in R

Question…

Scaling Revisited

K-Nearest Neighbors Regression (multiple predictors)

Comparing Models: Data Splitting with `tidymodels`