ames data set into two parts
tidymodelsset.seed(427) # Why?
ames_split <- initial_split(ames, prop = 0.70, strata = Sale_Price) # initialize 70/30
ames_split<Training/Testing/Total>
<2049/881/2930>
strata not necessary but good practice
strata will use stratified sampling on the variable you specify (very little downside)Sale_Price as the response:
Bedroom_AbvGr as the only predictorGr_Liv_Area as the only predictorGr_Liv_Area and Bedroom_AbvGr as predictorsGr_Liv_Area and a 2nd degree polynomial to Bedroom_AbvGrfit1 <- mlr_model |> fit(Sale_Price ~ Bedroom_AbvGr, data = ames_train) # Use only training set
fit2 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area, data = ames_train)
fit3 <- mlr_model |> fit(Sale_Price ~ Gr_Liv_Area + Bedroom_AbvGr, data = ames_train)
fit4 <- mlr_model |> fit(Sale_Price ~ poly(Gr_Liv_Area, degree = 10) + poly(Bedroom_AbvGr, degree = 2), data = ames_train)# Fit 1
fit1_train_mse <- mean((ames_train$Sale_Price - predict(fit1, new_data = ames_train)$.pred)^2)
fit1_test_mse <- mean((ames_test$Sale_Price - predict(fit1, new_data = ames_test)$.pred)^2)
# Fit 2
fit2_train_mse <- mean((ames_train$Sale_Price - predict(fit2, new_data = ames_train)$.pred)^2)
fit2_test_mse <- mean((ames_test$Sale_Price - predict(fit2, new_data = ames_test)$.pred)^2)
# Fit
fit3_train_mse <- mean((ames_train$Sale_Price - predict(fit3, , new_data = ames_train)$.pred)^2)
fit3_test_mse <- mean((ames_test$Sale_Price - predict(fit3, new_data = ames_test)$.pred)^2)
# Fit
fit4_train_mse <- mean((ames_train$Sale_Price - predict(fit4, , new_data = ames_train)$.pred)^2)
fit4_test_mse <- mean((ames_test$Sale_Price - predict(fit4, new_data = ames_test)$.pred)^2)Without looking at the numbers
fit1_train_mse, fit2_train_mse, fit3_train_mse, fit4_train_mse? Yes, fit4_train_msefit1_test_mse, fit2_test_mse, fit3_test_mse, fit4_test_mse? No[1] 6213135279 3188099910 2781293767 2472424544
[1] 4
[1] 6.329031e+09 3.203895e+09 2.732389e+09 2.726084e+12
[1] 3
fit4 has the lowest training MSE (to be expected)fit3 has the lowest test MSE
fit3Restaurant Outlets Profit dataset
What is a good value of \(\hat{f}(x)\) (expected profit), say at \(x=6\)?
A possible choice is the average of the observed responses at \(x=6\). But we may not observe responses for certain \(x\) values.
As \(K\) in KNN regression increases:
ames data| Sale_Price | Gr_Liv_Area | Bedroom_AbvGr |
|---|---|---|
| 215000 | 1656 | 3 |
| 105000 | 896 | 2 |
| 172000 | 1329 | 3 |
| 244000 | 2110 | 3 |
| 189900 | 1629 | 3 |
| 195500 | 1604 | 3 |
# scale predictors
ames_scaled <- tibble(size_scaled = scale(ames$Gr_Liv_Area),
num_bedrooms_scaled = scale(ames$Bedroom_AbvGr),
price = ames$Sale_Price)
head(ames_scaled) |> kable() # first six observations| size_scaled | num_bedrooms_scaled | price |
|---|---|---|
| 0.3092123 | 0.1760642 | 215000 |
| -1.1942232 | -1.0320576 | 105000 |
| -0.3376606 | 0.1760642 | 172000 |
| 1.2073172 | 0.1760642 | 244000 |
| 0.2558008 | 0.1760642 | 189900 |
| 0.2063456 | 0.1760642 | 195500 |
Sale_Price?ames_train_scaled <- tibble(size_scaled = scale(ames_train$Gr_Liv_Area),
num_bedrooms_scaled = scale(ames_train$Bedroom_AbvGr),
price = ames_train$Sale_Price)
ames_test_scaled <- tibble(size_scaled = (ames_test$Gr_Liv_Area - mean(ames_train$Gr_Liv_Area)/sd(ames_train$Gr_Liv_Area)),
num_bedrooms_scaled = (ames_test$Bedroom_AbvGr - mean(ames_train$Bedroom_AbvGr))/sd(ames_train$Bedroom_AbvGr),
price = ames_test$Sale_Price)recipe’s in tidymodels to simplify this processGr_Liv_area = 2000 square feet, and Bedroom_AbvGr = 3, then