ames
data set into two parts
tidymodels
set.seed(427) # Why?
ames_split <- initial_split(ames, prop = 0.70, strata = Sale_Price) # initialize 70/30
ames_split
<Training/Testing/Total>
<2049/881/2930>
strata
not necessary but good practice
strata
will use stratified sampling on the variable you specify (very little downside)Sale_Price
as the response:
Bedroom_AbvGr
as the only predictorGr_Liv_Area
as the only predictorGr_Liv_Area
and Bedroom_AbvGr
as predictorsGr_Liv_Area
and a 2nd degree polynomial to Bedroom_AbvGr
fit1 <- mlr_model |> fit(Sale_Price ~ Bedroom_AbvGr, data = ames_train) # Use only training set
fit2 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area, data = ames_train)
fit3 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area + Bedroom_AbvGr, data = ames_train)
fit4 <- mlr_model |> fit(Sale_Price ~ poly(Gr_Liv_Area, degree = 10) + poly(Bedroom_AbvGr, degree = 2), data = ames_train)
# Fit 1
fit1_train_mse <- mean((ames_train$Sale_Price - predict(fit1, new_data = ames_train)$.pred)^2)
fit1_test_mse <- mean((ames_test$Sale_Price - predict(fit1, new_data = ames_test)$.pred)^2)
# Fit 2
fit2_train_mse <- mean((ames_train$Sale_Price - predict(fit2, new_data = ames_train)$.pred)^2)
fit2_test_mse <- mean((ames_test$Sale_Price - predict(fit2, new_data = ames_test)$.pred)^2)
# Fit
fit3_train_mse <- mean((ames_train$Sale_Price - predict(fit3, , new_data = ames_train)$.pred)^2)
fit3_test_mse <- mean((ames_test$Sale_Price - predict(fit3, new_data = ames_test)$.pred)^2)
# Fit
fit4_train_mse <- mean((ames_train$Sale_Price - predict(fit4, , new_data = ames_train)$.pred)^2)
fit4_test_mse <- mean((ames_test$Sale_Price - predict(fit4, new_data = ames_test)$.pred)^2)
Without looking at the numbers
fit1_train_mse
, fit2_train_mse
, fit3_train_mse
, fit4_train_mse
? Yes, fit4_train_mse
fit1_test_mse
, fit2_test_mse
, fit3_test_mse
, fit4_test_mse
? No[1] 6213135279 3188099910 2781293767 2472424544
[1] 4
[1] 6.329031e+09 3.203895e+09 2.732389e+09 2.726084e+12
[1] 3
fit4
has the lowest training MSE (to be expected)fit3
has the lowest test MSE
fit3
Restaurant Outlets Profit dataset
What is a good value of \(\hat{f}(x)\) (expected profit), say at \(x=6\)?
A possible choice is the average of the observed responses at \(x=6\). But we may not observe responses for certain \(x\) values.
As \(K\) in KNN regression increases:
ames
dataSale_Price | Gr_Liv_Area | Bedroom_AbvGr |
---|---|---|
215000 | 1656 | 3 |
105000 | 896 | 2 |
172000 | 1329 | 3 |
244000 | 2110 | 3 |
189900 | 1629 | 3 |
195500 | 1604 | 3 |
# scale predictors
ames_scaled <- tibble(size_scaled = scale(ames$Gr_Liv_Area),
num_bedrooms_scaled = scale(ames$Bedroom_AbvGr),
price = ames$Sale_Price)
head(ames_scaled) |> kable() # first six observations
size_scaled | num_bedrooms_scaled | price |
---|---|---|
0.3092123 | 0.1760642 | 215000 |
-1.1942232 | -1.0320576 | 105000 |
-0.3376606 | 0.1760642 | 172000 |
1.2073172 | 0.1760642 | 244000 |
0.2558008 | 0.1760642 | 189900 |
0.2063456 | 0.1760642 | 195500 |
Sale_Price
?ames_train_scaled <- tibble(size_scaled = scale(ames_train$Gr_Liv_Area),
num_bedrooms_scaled = scale(ames_train$Bedroom_AbvGr),
price = ames_train$Sale_Price)
ames_test_scaled <- tibble(size_scaled = (ames_test$Gr_Liv_Area - mean(ames_train$Gr_Liv_Area)/sd(ames_train$Gr_Liv_Area)),
num_bedrooms_scaled = (ames_test$Bedroom_AbvGr - mean(ames_train$Bedroom_AbvGr))/sd(ames_train$Bedroom_AbvGr),
price = ames_test$Sale_Price)
recipe
’s in tidymodels
to simplify this processGr_Liv_area
= 2000 square feet, and Bedroom_AbvGr
= 3, then