MATH 427: Decision Trees Continued

Eric Friedlander

Announcements

  • Job Application Discussion
  • Job Interviews
  • Admissions Data Project

Job Application Discussion

  • We missed the mark a bit.
  • Resumes and CVs were good. Sample analysis were… not.
  • Biggest issue: professionalism and editing.
  • If the majority of your analysis was code, you probably got a bad grade.
  • Next time:
    • Explain everything that you are doing and justify it!
    • Proofread your document!
    • Make it look nice!

Job Application Discussion

  • Some common themes:
    • Y’all love the word “hone”
    • Y’all love the phrase “actionable insights”… so do I… but still
    • I think some of you are overselling yourselves
    • In cover letter, use fewer “buzz-words” and include more substance

Job Interviews

  • Link to Resources
  • Two Rounds
    • First: Screening interview with Dani (Schedule by Wednesday)
      • Due next Friday, April 11th
    • Second: Technical interview with me (Due last day of class)
      • Note that we’ll be doing a lot of stuff between now and then so try to get this done sooner rather than later

Project

  • Project Instructions
  • Brian Bava visiting on Friday to discuss
  • DO ASAP: Sign data agreement
  • Do by Wednesday: Load the data and review instructions. We will have a discussion about cleaning the data on Wednesday. You are all expected to contribute.
  • Do by Friday: Explore the data. Your team is expected to bring at least three substantial questions for Brian.

Computational Set-Up

library(tidyverse)
library(tidymodels)
library(dsbox) # dcbikeshare data
library(knitr)

tidymodels_prefer()

set.seed(427)

Last Time

  • Regression Trees in R
  • Warm-up question: What is cost complexity tuning?

Regression Trees in R

Data: dcbikeshare

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. As of May 2018, there are about over 1600 bike-sharing programs around the world, providing more than 18 million bicycles for public use. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Documentation

glimpse(dcbikeshare)
Rows: 731
Columns: 16
$ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
$ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ casual     <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
$ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
$ cnt        <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…

Cleaning the Data

dcbikeshare_clean <- dcbikeshare |> 
  select(-instant, -dteday, -casual, -registered, -yr) |> 
  mutate(
    season = as_factor(case_when(
      season == 1 ~ "winter",
      season == 2 ~ "spring",
      season == 3 ~ "summer",
      season == 4 ~ "fall"
    )),
    mnth = as_factor(mnth),
    weekday = as_factor(weekday),
    weathersit = as_factor(weathersit)
  )

Split the Data

set.seed(427)

bike_split <- initial_split(dcbikeshare_clean, prop = 0.7, strata = cnt)
bike_train <- training(bike_split)
bike_test <- testing(bike_split)

Recipe

bike_recipe <- recipe(cnt ~ ., data = bike_train) |>   # set up recipe
  step_integer(season, mnth, weekday) |>   # numeric conversion of levels of the predictors
  step_dummy(all_nominal(), one_hot = TRUE)  # one-hot/dummy encode nominal categorical predictors

Define Model Workflow

dec_tree_lowcc <- decision_tree(cost_complexity = 10^(-4)) |> 
  set_engine("rpart") |> 
  set_mode("regression")

dec_tree_highcc <- decision_tree(cost_complexity = 0.1) |> 
  set_engine("rpart") |> 
  set_mode("regression")

Visualize

library(rpart.plot)
workflow() |> 
  add_recipe(bike_recipe) |> 
  add_model(dec_tree_lowcc) |> 
  fit(bike_train) |>
  extract_fit_engine() |> 
  rpart.plot()

Visualize

Visualize

library(rpart.plot)
workflow() |> 
  add_recipe(bike_recipe) |> 
  add_model(dec_tree_highcc) |> 
  fit(bike_train) |>
  extract_fit_engine() |> 
  rpart.plot()

Visualize

Defin Model Workflow with Tuning

dec_tree <- decision_tree(cost_complexity = tune()) |> 
  set_engine("rpart") |> 
  set_mode("regression")

dt_wf <- workflow() |> 
  add_recipe(bike_recipe) |> 
  add_model(dec_tree)

Define Folds and Tuning Grid

bike_folds <- vfold_cv(bike_train, v = 5, repeats = 10)

cp_grid <- grid_regular(cost_complexity(range = c(-4, -1)), # I had to play around with these 
                             levels = 20)

Tuning CP

tuning_cp_results <- tune_grid(
  dt_wf,
  resamples= bike_folds,
  grid = cp_grid
)

Plot Results

autoplot(tuning_cp_results)

Select Best Trees

best_tree <- select_best(tuning_cp_results)
best_tree |> kable()
cost_complexity .config
0.0054556 Preprocessor1_Model12
ose_tree <- select_by_one_std_err(tuning_cp_results, desc(cost_complexity))
ose_tree |> kable()
cost_complexity .config
0.0078476 Preprocessor1_Model13

Fit Best Tree

best_tree <- select_best(tuning_cp_results)
best_wf <- finalize_workflow(dt_wf, best_tree)
best_model <- best_wf |> fit(bike_train)
best_model |> 
  extract_fit_engine() |> 
  rpart.plot()

Fit OSE Tree

ose_tree <- select_best(tuning_cp_results)
ose_wf <- finalize_workflow(dt_wf, ose_tree)
ose_model <- ose_wf |> fit(bike_train)
ose_model |> 
  extract_fit_engine() |> 
  rpart.plot()

Questions

  • Why are both models the same but have different RMSE estimates from CV?
  • What’s the difference between encoding mnth as an ordinal variable vs. a one-hot encoding?

Decision Trees

  • Advantages
    • Easy to explain and interpret
    • Closely mirror human decision-making
    • Can be displayed graphically, and are easily interpreted by non-experts
    • Does not require standardization of predictors
    • Can handle missing data directly
    • Can easily capture non-linear patterns
  • Disadvantages
    • Do not have same level of prediction accuracy
    • Not very robust