As a model’s flexibility increases:
Suppose the CEO of a restaurant franchise is considering opening new outlets in different cities. They would like to expand their business to cities that give them higher profits with the assumption that highly populated cities will probably yield higher profits.
They have data on the population (in 100,000) and profit (in $1,000) at 97 cities where they currently have outlets.
tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
tidyverse
is for manipulating and visualizing datatidyverse
is a meta-package meaning it is a collection of a bunch of other packagestidymodels
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
tidymodels
creates a unified framework for building models in Rscikit-learn
in PythonThis corresponds to the model:
\[ \begin{aligned} \text{Profit} &= -3.90 + 1.19\times\text{Population}\\ \hat{Y}_i &= -3.90 + 1.19X_i \end{aligned} \] i.e. \(\hat{\beta}_0 = -3.90\) and \(\hat{\beta}_1 = 1.19\)
tidymodels
Specify mathematical structure of model (e.g. linear regression, logistic regression)
Specify the engine for fitting the model. (e.g. lm
, stan
, glmnet
).
When required, declare the mode of the model (i.e. regression or classification).
tidymodels
# Usually put these at the top
library(tidymodels) # load tidymodels package
tidymodels_prefer() # avoid common conflicts
lm_model <- linear_reg() |> # Step 1
set_engine("lm") # Step 2
# Step 3 not required since linear regression can't be used for classification
# Fit the model
lm_model_fit <- lm_model |>
fit(profit ~ population, data = outlets)
tidymodels
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.895781 | 0.7194828 | -5.414696 | 5e-07 |
population | 1.193034 | 0.0797439 | 14.960806 | 0e+00 |
Same model as before:
\[ \begin{aligned} \text{Profit} &= -3.90 + 1.19\times\text{Population}\\ \hat{Y}_i &= -3.90 + 1.19X_i \end{aligned} \]
new_cities <- tibble(population = rnorm(100, 7, 3))
lm_model_fit |>
predict(new_data = new_cities) |>
kable()
.pred |
---|
4.2987846 |
4.3494028 |
1.6358812 |
2.8371943 |
8.6964415 |
1.2610114 |
4.4093826 |
7.0132467 |
8.9081681 |
-3.7261162 |
-0.3152552 |
-0.0578742 |
3.1103344 |
-1.4511631 |
3.2056127 |
7.2094157 |
5.2694317 |
9.8741950 |
5.3986612 |
5.7894116 |
6.5817409 |
6.4489003 |
1.5635853 |
-5.8145557 |
0.3096059 |
4.7217335 |
8.0782299 |
6.4065191 |
7.0495725 |
10.5119132 |
3.2281542 |
5.3756384 |
10.5424872 |
2.6444235 |
3.1483294 |
5.7659904 |
3.3551593 |
-1.1774409 |
9.5361487 |
-0.3336581 |
4.1063193 |
2.7105932 |
6.1253040 |
3.6662178 |
6.1309169 |
9.2088409 |
-2.1660582 |
4.9800410 |
5.5140171 |
3.3714057 |
5.0758817 |
2.9767839 |
4.4891785 |
-2.2982990 |
-2.0242922 |
4.4403419 |
4.0335077 |
6.8410161 |
3.6835670 |
-0.6670421 |
6.1270222 |
7.1611647 |
3.2563676 |
11.3135720 |
-0.8720586 |
7.4979481 |
5.4139803 |
7.3006650 |
3.3803037 |
5.1760916 |
4.7255493 |
5.3377241 |
8.0237716 |
5.0616701 |
-1.7427836 |
5.5563838 |
9.1823534 |
3.3094120 |
4.7223157 |
3.3083360 |
4.0882384 |
-0.3834378 |
0.9586050 |
6.1677633 |
-0.9134436 |
6.9813206 |
-1.8743103 |
10.6155073 |
1.7708313 |
10.9242269 |
4.7897426 |
5.2228631 |
5.1455980 |
-0.9701960 |
5.7970445 |
0.3254730 |
1.5828592 |
-1.7630027 |
-3.9350027 |
8.2198551 |
Note: New data must be a data frame with the same columns names as the training data
\[ \begin{aligned} Y&=f(\mathbf{X}) + \epsilon\\ &=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon \end{aligned} \] where \(\beta_j\) quantifies the association between the \(j^{th}\) predictor and the response.
size
is in square feetnum_bedrooms
is a countprice
is in $1,000’sSome Exploratory Data Analysis (EDA)
mlr_model <- linear_reg() |>
set_engine("lm")
house_price_mlr <- mlr_model |>
fit(price ~ size + num_bedrooms, data = house_prices) # fit the model
house_price_mlr |>
tidy() |> # produce result summaries of the model
kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 89.5977660 | 41.7674230 | 2.1451591 | 0.0374991 |
size | 0.1392106 | 0.0147951 | 9.4092391 | 0.0000000 |
num_bedrooms | -8.7379154 | 15.4506975 | -0.5655353 | 0.5745825 |
num_bedrooms
remaining fixed, an additional 1 square foot of size
leads to an increase in price
by approximately $139.20.size
remaining fixed, an additional bedroom leads to an decrease in price
of approximately $8,737.90.price
when size
is 2000 square feet for a house with 3 bedroomsSale_Price
: in dollarsGr_Liv_Area
: size in square feetBedroom_AbvGr
: number of bedrooms above grade