MATH 427: Classification + Logistic Regression

Eric Friedlander

Classification Problems

Response $Y$ is qualitative (categorical).
Objective: build a classifier $\hat{Y}=\hat{C}(\mathbf{X})$
- assigns class label to a future unlabeled (unseen) observations
- understand the relationship between the predictors and response
Two ways to make predictions
- Class probabilities
- Class labels

Default Dataset

A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.

library(ISLR2)   # load library
head(Default) |> kable()  # print first six observations

default	student	balance	income
No	No	729.5265	44361.625
No	Yes	817.1804	12106.135
No	No	1073.5492	31767.139
No	No	529.2506	35704.494
No	No	785.6559	38463.496
No	Yes	919.5885	7491.559

We will consider default as the response variable.

Split the data

set.seed(427)

default_split <- initial_split(Default, prop = 0.6, strata = default)
default_split

<Training/Testing/Total>
<6000/4000/10000>

default_train <- training(default_split)
default_test <- testing(default_split)

Summarizing our response variable

library(janitor)
Default |> tabyl(default) |> kable()  # class frequencies

default	n	percent
No	9667	0.9667
Yes	333	0.0333

default_train |> tabyl(default) |> kable()

default	n	percent
No	5796	0.966
Yes	204	0.034

default_test |> tabyl(default) |> kable()

default	n	percent
No	3871	0.96775
Yes	129	0.03225

Data Types in R

Default |> glimpse()

Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

fct = factor which is the data type you want to use for categorical data
as_factor will typically transform things (including numbers) into factors for you
chr can also be used but factors are better because they store all possible levels for your categorical data
factors are helpful for plotting because you can reorder the levels to help you plot things

K-Nearest Neighbors Classifier

Given a value for $K$ and a test data point $x_0$: \[P(Y=j | X=x_0)=\dfrac{1}{K} \sum_{x_i \in \mathcal{N}_0} I(y_i = j)\] where $\mathcal{N}_0$ is the set of the $K$ “closest” neighbors.
For classification: neighbors “vote” for class (unlike in regression where predictions are obtained by averaging) \[P(Y=j | X=x_0)=\text{Proportion of neighbors in class }j\]

K-Nearest Neighbors Classifier: Build Model

knn_default_fit <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("classification") |>
  fit(default ~ balance, data = default_train)   # fit 10-nn model

Why don’t I need to worry about centering and scaling?

K-Nearest Neighbors Classifier: Predictions

predict with a categorical response: documentation
Two different ways of making predictions

Predicting a class

knn_class_preds <- predict(knn_default_fit, new_data = default_test, type = "class")   # obtain default class label predictions

knn_class_preds |> head() |> kable()

.pred_class
No
No
No
No
No
No

Predicting a probability

Can anyone pick-out what’s wrong here? Hint: $k = 10$

knn_prob_preds <- predict(knn_default_fit, new_data = default_test, type = "prob")   # obtain predictions as probabilities
knn_prob_preds |> filter(.pred_No*.pred_Yes >0) |> head() |> kable()

.pred_No	.pred_Yes
0.8685	0.1315
0.8685	0.1315
0.8685	0.1315
0.8505	0.1495
0.7105	0.2895
0.4775	0.5225

I’ve been lying to you

kknn actually takes a weighted average of the nearest neighbors
- I.e. closer observations get more weight
To use unweighted KNN need weight_func = "rectangular"

Unweighted KNN

knn_default_unw_fit <- nearest_neighbor(neighbors = 10, weight_fun = "rectangular") |>
  set_engine("kknn") |>
  set_mode("classification") |>
  fit(default ~ balance, data = default_train)   # fit 10-nn model

knn_uw_prob_preds <- predict(knn_default_unw_fit, new_data = default_test, type = "prob")   # obtain predictions as probabilities
knn_uw_prob_preds |> filter(.pred_No*.pred_Yes >0) |> head() |> kable()

.pred_No	.pred_Yes
0.9	0.1
0.9	0.1
0.9	0.1
0.9	0.1
0.7	0.3
0.5	0.5

Logistic Regression

Why Not Linear Regression?

Default_lr <- default_train |> 
  mutate(default_0_1 = if_else(default == "Yes", 1, 0))

lrfit <- linear_reg() |> 
  set_engine("lm") |> 
  fit(default_0_1 ~ balance, data = Default_lr)   # fit SLR

lrfit |> predict(new_data = default_train) |> head() |> kable()

.pred
0.0316011
0.0655518
-0.0065293
0.0274263
0.0451629
0.0327046

Why Not Linear Regression?

Linear regression: does not model probabilities well
- might produce probabilities less than zero or bigger than one
- treats increase from 0.41 to 0.5 as same as 0.01 to 0.1 (bad)

Why Not Linear Regression?

Suppose we have a response, \[Y=\begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}\]

Linear regression suggests an ordering, and in fact implies that the differences between classes have meaning
- e.g. drug overdose $-$ stroke $= 1$? 🤔

Logistic Regression

Consider a one-dimensional binary classification problem:

Transform the linear model $\beta_0 + \beta_1 \ X$ so that the output is a probability
Use logistic function: \[g(t)=\dfrac{e^t}{1+e^t} \ \ \ \text{for} \ t \in \mathcal{R}\]
Then: \[p(X)=P(Y=1|X)=g\left(\beta_0 + \beta_1 \ X\right)=\dfrac{e^{\beta_0 + \beta_1 \ X}}{1+e^{\beta_0 + \beta_1 \ X}}\]

Other important quantities

Odds: $\dfrac{p(x)}{1-p(x)}$
Log-Odds (logit): $\log\left(\dfrac{p(x)}{1-p(x)}\right) = \beta_0 + \beta_1 \ X$
- Linear function of predictors

Logistic Regression

Fitting the model

Fitting a logistic regression model with default as the response and balance as the predictor:

logregfit <- logistic_reg() |> 
  set_engine("glm") |> 
  fit(default ~ balance, data = default_train)   # fit logistic regression model

tidy(logregfit) |> kable()  # obtain results

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6926385	0.4659035	-22.95033	0
balance	0.0055327	0.0002841	19.47329	0

Interpreting Coefficients

As $X$ increases by 1, the log-odds increase by $\hat{\beta}_1$
- I.e. probability of default increases but NOT linearly
- Change in the probability of default due to a one-unit change in balance depends on the current balance value

Interpreting Coefficients

Making predictions: Theory

For balance=$700,

\[\hat{p}(X)=\dfrac{e^{\hat{\beta}_0+\hat{\beta}_1 X}}{1+e^{\hat{\beta}_0+\hat{\beta}_1 X}}=\dfrac{e^{-10.69 + (0.005533 \times 700)}}{1+e^{-10.69 + (0.005533 \times 700)}}=0.0011\]
\[\textbf{Odds}(X) = \dfrac{\hat{p}(X)}{1-\hat{p}(X)} = \dfrac{0.0011}{1-0.0011}\approx 0.0011\]
\[\textbf{Log-Odds}(X)=\log\left(\dfrac{\hat{p}(X)}{1-\hat{p}(X)}\right) = \log(0.0011) = -6.80\]

Making predictions in R

predict(logregfit, new_data = tibble(balance = 700), type = "class") |> kable()   # obtain class predictions

.pred_class
No

predict(logregfit, new_data = tibble(balance = 700), type = "raw") |> kable()   # obtain log-odds predictions

x
-6.819727

predict(logregfit, new_data = tibble(balance = 700), type = "prob") |> kable()  # obtain probability predictions

.pred_No	.pred_Yes
0.9989092	0.0010908