Goal: Predict Sale_Price.
Rows: 881
Columns: 20
$ Sale_Price <int> 244000, 213500, 185000, 394432, 190000, 149000, 149900, …
$ Gr_Liv_Area <int> 2110, 1338, 1187, 1856, 1844, NA, NA, 1069, 1940, 1544, …
$ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, …
$ Garage_Cars <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2,…
$ Garage_Area <dbl> 522, 582, 420, 834, 546, 480, 500, 440, 606, 868, 532, 7…
$ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pa…
$ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
$ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Neighborhood <fct> North_Ames, Stone_Brook, Gilbert, Stone_Brook, Northwest…
$ Screen_Porch <int> 0, 0, 0, 0, 0, 0, 0, 165, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Overall_Qual <fct> Good, Very_Good, Above_Average, Excellent, Above_Average…
$ Lot_Area <int> 11160, 4920, 7980, 11394, 11751, 11241, 12537, 4043, 101…
$ Lot_Frontage <dbl> 93, 41, 0, 88, 105, 0, 0, 53, 83, 94, 95, 90, 105, 61, 6…
$ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_PUD_1946_…
$ Misc_Val <int> 0, 0, 500, 0, 0, 700, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Open_Porch_SF <int> 0, 0, 21, 0, 122, 0, 0, 55, 95, 35, 70, 74, 130, 82, 48,…
$ TotRms_AbvGrd <int> 8, 6, 6, 8, 7, 5, 6, 4, 8, 7, 7, 7, 7, 6, 7, 7, 10, 7, 7…
$ First_Flr_SF <int> 2110, 1338, 1187, 1856, 1844, 1004, 1078, 1069, 1940, 15…
$ Second_Flr_SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 563, 0, 886, 656, 11…
$ Year_Built <int> 1968, 2001, 1992, 2010, 1977, 1970, 1971, 1977, 2009, 20…
Well cover some common pre-processing tasks:
NA instead of 0 or -1 or “missing”)ames <- ames |>
mutate(Overall_Qual = factor(Overall_Qual, levels = c("Very_Poor", "Poor",
"Fair", "Below_Average",
"Average", "Above_Average",
"Good", "Very_Good",
"Excellent", "Very_Excellent")))
ames |> pull(Overall_Qual) |> levels() [1] "Very_Poor" "Poor" "Fair" "Below_Average"
[5] "Average" "Above_Average" "Good" "Very_Good"
[9] "Excellent" "Very_Excellent"
Heuristic for detecting near-zero variance features is:
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| Sale_Price | 1.000000 | 55.7321226 | FALSE | FALSE |
| Gr_Liv_Area | 1.333333 | 62.9965948 | FALSE | FALSE |
| Garage_Type | 2.196581 | 0.6810443 | FALSE | FALSE |
| Garage_Cars | 1.970213 | 0.5675369 | FALSE | FALSE |
| Garage_Area | 2.250000 | 38.0249716 | FALSE | FALSE |
| Street | 219.250000 | 0.2270148 | FALSE | TRUE |
| Utilities | 880.000000 | 0.2270148 | FALSE | TRUE |
| Pool_Area | 876.000000 | 0.6810443 | FALSE | TRUE |
| Neighborhood | 1.476744 | 2.9511918 | FALSE | FALSE |
| Screen_Porch | 199.750000 | 6.6969353 | FALSE | TRUE |
| Overall_Qual | 1.119816 | 1.1350738 | FALSE | FALSE |
| Lot_Area | 1.071429 | 79.7956867 | FALSE | FALSE |
| Lot_Frontage | 1.617021 | 11.5777526 | FALSE | FALSE |
| MS_SubClass | 1.959064 | 1.7026107 | FALSE | FALSE |
| Misc_Val | 141.833333 | 1.9296254 | FALSE | TRUE |
| Open_Porch_SF | 23.176471 | 19.2962543 | FALSE | FALSE |
| TotRms_AbvGrd | 1.311225 | 1.2485812 | FALSE | FALSE |
| First_Flr_SF | 1.777778 | 63.7911464 | FALSE | FALSE |
| Second_Flr_SF | 64.250000 | 31.3280363 | FALSE | FALSE |
| Year_Built | 1.125000 | 12.0317821 | FALSE | FALSE |
step_impute_median: used for numeric (especially discrete) variablesstep_impute_mean: used for numeric variablesstep_impute_knn: used for both numeric and categorical variables (computationally expensive)step_impute_mode: used for nominal (having no order) categorical variableGarage_Type is missing is because there is no basement
NAs with No_Garagestep_imput_knn uses the “Gower’s Distance” so don’t need to worry about normalizingTwo types of categorical features:
[1] "Very_Poor" "Poor" "Fair" "Below_Average"
[5] "Average" "Above_Average" "Good" "Very_Good"
[9] "Excellent" "Very_Excellent"
Very_Poor = 1, Poor = 2, Fair = 3, etc…| Neighborhood | n |
|---|---|
| North_Ames | 127 |
| College_Creek | 86 |
| Old_Town | 83 |
| Edwards | 49 |
| Somerset | 50 |
| Northridge_Heights | 52 |
| Gilbert | 47 |
| Sawyer | 49 |
| Northwest_Ames | 41 |
| Sawyer_West | 31 |
| Mitchell | 33 |
| Brookside | 33 |
| Crawford | 22 |
| Iowa_DOT_and_Rail_Road | 28 |
| Timberland | 21 |
| Northridge | 22 |
| Stone_Brook | 17 |
| South_and_West_of_Iowa_State_University | 21 |
| Clear_Creek | 16 |
| Meadow_Village | 14 |
| Briardale | 10 |
| Bloomington_Heights | 10 |
| Veenker | 9 |
| Northpark_Villa | 3 |
| Blueste | 3 |
| Greens | 4 |
preproc <- recipe(Sale_Price ~ ., data = ames) |>
step_nzv(all_predictors()) |> # remove zero or near-zero variable predictors
step_impute_knn(Year_Built, Gr_Liv_Area) |> # impute missing values in Overall_Qual and Year_Built
step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
step_other(Neighborhood, threshold = 0.01, other = "Other") # lump all categories with less than 1% representation into a category called Other for each variableFigure 3.9: Machine Learning with R
preproc <- recipe(Sale_Price ~ ., data = ames) |>
step_nzv(all_predictors()) |> # remove zero or near-zero variable predictors
step_impute_knn(Year_Built, Gr_Liv_Area) |> # impute missing values in Overall_Qual and Year_Built
step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
step_other(Neighborhood, threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
step_dummy(all_nominal_predictors(), one_hot = TRUE) # in general use one_hot unless doing linear regressionpreproc <- recipe(Sale_Price ~ ., data = ames) |>
step_nzv(all_predictors()) |> # remove zero or near-zero variable predictors
step_impute_knn(Year_Built, Gr_Liv_Area) |> # impute missing values in Overall_Qual and Year_Built
step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
step_other(Neighborhood, threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
step_dummy(all_nominal_predictors(), one_hot = TRUE) |> # in general use one_hot unless doing linear regression
step_normalize(all_numeric_predictors())Questions to ask:
ames <- ames |>
mutate(Overall_Qual = factor(Overall_Qual, levels = c("Very_Poor", "Poor",
"Fair", "Below_Average",
"Average", "Above_Average",
"Good", "Very_Good",
"Excellent", "Very_Excellent")),
Garage_Type = if_else(is.na(Garage_Type), "No_Garage", Garage_Type),
Garage_Type = as_factor(Garage_Type)
)# 10-fold cross-validation repeated 10 times
# A tibble: 100 × 3
splits id id2
<list> <chr> <chr>
1 <split [594/66]> Repeat01 Fold01
2 <split [594/66]> Repeat01 Fold02
3 <split [594/66]> Repeat01 Fold03
4 <split [594/66]> Repeat01 Fold04
5 <split [594/66]> Repeat01 Fold05
6 <split [594/66]> Repeat01 Fold06
7 <split [594/66]> Repeat01 Fold07
8 <split [594/66]> Repeat01 Fold08
9 <split [594/66]> Repeat01 Fold09
10 <split [594/66]> Repeat01 Fold10
# ℹ 90 more rows
lm_preproc <- recipe(Sale_Price ~ ., data = ames_train) |>
step_nzv(all_predictors()) |> # remove zero or near-zero variable predictors
step_impute_knn(Year_Built, Gr_Liv_Area) |> # impute missing values in Overall_Qual and Year_Built
step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
step_other(all_nominal_predictors(), threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
step_dummy(all_nominal_predictors(), one_hot = FALSE) |> # in general use one_hot unless doing linear regression
step_corr(all_numeric_predictors(), threshold = 0.5) |> # remove highly correlated predictors
step_lincomb(all_numeric_predictors()) # remove variables that have exact linear combinationsknn_preproc <- recipe(Sale_Price ~ ., data = ames_train) |>
step_nzv(all_predictors()) |> # remove zero or near-zero variable predictors
step_impute_knn(Year_Built, Gr_Liv_Area) |> # impute missing values in Overall_Qual and Year_Built
step_integer(Overall_Qual) |> # convert Overall_Qual into ordinal encoding
step_other(all_nominal_predictors(), threshold = 0.01, other = "Other") |> # lump all categories with less than 1% representation into a category called Other for each variable
step_dummy(all_nominal_predictors(), one_hot = TRUE) |> # in general use one_hot unless doing linear regression
step_nzv(all_predictors()) |>
step_normalize(all_numeric_predictors())| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| rmse | standard | 3.979336e+04 | 100 | 768.846396 | Preprocessor1_Model1 |
| rsq | standard | 7.548961e-01 | 100 | 0.008863 | Preprocessor1_Model1 |
| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| rmse | standard | 40063.355604 | 100 | 963.9908286 | Preprocessor1_Model1 |
| rsq | standard | 0.758572 | 100 | 0.0063228 | Preprocessor1_Model1 |
| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| rmse | standard | 3.957217e+04 | 100 | 1011.2439632 | Preprocessor1_Model1 |
| rsq | standard | 7.673612e-01 | 100 | 0.0065484 | Preprocessor1_Model1 |