| default | student | balance | income |
|---|---|---|---|
| No | No | 729.5265 | 44361.625 |
| No | Yes | 817.1804 | 12106.135 |
| No | No | 1073.5492 | 31767.139 |
| No | No | 529.2506 | 35704.494 |
| No | No | 785.6559 | 38463.496 |
| No | Yes | 919.5885 | 7491.559 |
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
| default | student | balance | income |
|---|---|---|---|
| No | No | 729.5265 | 44361.625 |
| No | Yes | 817.1804 | 12106.135 |
| No | No | 1073.5492 | 31767.139 |
| No | No | 529.2506 | 35704.494 |
| No | No | 785.6559 | 38463.496 |
| No | Yes | 919.5885 | 7491.559 |
We will consider default as the response variable.
| default | n | percent |
|---|---|---|
| No | 9667 | 0.9667 |
| Yes | 333 | 0.0333 |
| default | n | percent |
|---|---|---|
| No | 5796 | 0.966 |
| Yes | 204 | 0.034 |
| default | n | percent |
|---|---|---|
| No | 3871 | 0.96775 |
| Yes | 129 | 0.03225 |
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
fct = factor which is the data type you want to use for categorical dataas_factor will typically transform things (including numbers) into factors for youchr can also be used but factors are better because they store all possible levels for your categorical datafactors are helpful for plotting because you can reorder the levels to help you plot thingspredict with a categorical response: documentationkknn actually takes a weighted average of the nearest neighbors
weight_func = "rectangular"knn_default_unw_fit <- nearest_neighbor(neighbors = 10, weight_fun = "rectangular") |>
set_engine("kknn") |>
set_mode("classification") |>
fit(default ~ balance, data = default_train) # fit 10-nn model
knn_uw_prob_preds <- predict(knn_default_unw_fit, new_data = default_test, type = "prob") # obtain predictions as probabilities
knn_uw_prob_preds |> filter(.pred_No*.pred_Yes >0) |> head() |> kable()| .pred_No | .pred_Yes |
|---|---|
| 0.9 | 0.1 |
| 0.9 | 0.1 |
| 0.9 | 0.1 |
| 0.9 | 0.1 |
| 0.7 | 0.3 |
| 0.5 | 0.5 |
Default_lr <- default_train |>
mutate(default_0_1 = if_else(default == "Yes", 1, 0))
lrfit <- linear_reg() |>
set_engine("lm") |>
fit(default_0_1 ~ balance, data = Default_lr) # fit SLR
lrfit |> predict(new_data = default_train) |> head() |> kable()| .pred |
|---|
| 0.0316011 |
| 0.0655518 |
| -0.0065293 |
| 0.0274263 |
| 0.0451629 |
| 0.0327046 |
Suppose we have a response, \[Y=\begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}\]
Consider a one-dimensional binary classification problem:
Fitting a logistic regression model with default as the response and balance as the predictor:
For balance=$700,
predict(logregfit, new_data = tibble(balance = 700), type = "class") |> kable() # obtain class predictions| .pred_class |
|---|
| No |
predict(logregfit, new_data = tibble(balance = 700), type = "raw") |> kable() # obtain log-odds predictions| x |
|---|
| -6.819727 |
predict(logregfit, new_data = tibble(balance = 700), type = "prob") |> kable() # obtain probability predictions| .pred_No | .pred_Yes |
|---|---|
| 0.9989092 | 0.0010908 |