default | student | balance | income |
---|---|---|---|
No | No | 729.5265 | 44361.625 |
No | Yes | 817.1804 | 12106.135 |
No | No | 1073.5492 | 31767.139 |
No | No | 529.2506 | 35704.494 |
No | No | 785.6559 | 38463.496 |
No | Yes | 919.5885 | 7491.559 |
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
default | student | balance | income |
---|---|---|---|
No | No | 729.5265 | 44361.625 |
No | Yes | 817.1804 | 12106.135 |
No | No | 1073.5492 | 31767.139 |
No | No | 529.2506 | 35704.494 |
No | No | 785.6559 | 38463.496 |
No | Yes | 919.5885 | 7491.559 |
We will consider default
as the response variable.
default | n | percent |
---|---|---|
No | 9667 | 0.9667 |
Yes | 333 | 0.0333 |
default | n | percent |
---|---|---|
No | 5796 | 0.966 |
Yes | 204 | 0.034 |
default | n | percent |
---|---|---|
No | 3871 | 0.96775 |
Yes | 129 | 0.03225 |
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
fct
= factor
which is the data type you want to use for categorical dataas_factor
will typically transform things (including numbers) into factors for youchr
can also be used but factor
s are better because they store all possible levels for your categorical datafactor
s are helpful for plotting because you can reorder the levels to help you plot thingspredict
with a categorical response: documentationkknn
actually takes a weighted average of the nearest neighbors
weight_func = "rectangular"
knn_default_unw_fit <- nearest_neighbor(neighbors = 10, weight_fun = "rectangular") |>
set_engine("kknn") |>
set_mode("classification") |>
fit(default ~ balance, data = default_train) # fit 10-nn model
knn_uw_prob_preds <- predict(knn_default_unw_fit, new_data = default_test, type = "prob") # obtain predictions as probabilities
knn_uw_prob_preds |> filter(.pred_No*.pred_Yes >0) |> head() |> kable()
.pred_No | .pred_Yes |
---|---|
0.9 | 0.1 |
0.9 | 0.1 |
0.9 | 0.1 |
0.9 | 0.1 |
0.7 | 0.3 |
0.5 | 0.5 |
Default_lr <- default_train |>
mutate(default_0_1 = if_else(default == "Yes", 1, 0))
lrfit <- linear_reg() |>
set_engine("lm") |>
fit(default_0_1 ~ balance, data = Default_lr) # fit SLR
lrfit |> predict(new_data = default_train) |> head() |> kable()
.pred |
---|
0.0316011 |
0.0655518 |
-0.0065293 |
0.0274263 |
0.0451629 |
0.0327046 |
Suppose we have a response, \[Y=\begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}\]
Consider a one-dimensional binary classification problem:
Fitting a logistic regression model with default
as the response and balance
as the predictor:
For balance
=$700,
predict(logregfit, new_data = tibble(balance = 700), type = "class") |> kable() # obtain class predictions
.pred_class |
---|
No |
predict(logregfit, new_data = tibble(balance = 700), type = "raw") |> kable() # obtain log-odds predictions
x |
---|
-6.819727 |
predict(logregfit, new_data = tibble(balance = 700), type = "prob") |> kable() # obtain probability predictions
.pred_No | .pred_Yes |
---|---|
0.9989092 | 0.0010908 |