MATH 427: Simple Linear Regression + Gradient Descent

Eric Friedlander

Concept Check

What do we call these:

  • \(\mathbf{X}\)
  • \(Y\)

What is our goal in supervised learning?

Concept Check

  • Features: \(\mathbf{X}\)
  • Target: \(Y\)
  • Goal: Predict \(Y\) using \(\mathbf{X}\)
  • What do we call this process if \(Y\) is numerical? What about categorical?

Concept Check

  • What do we call this process if \(Y\) is numerical? What about categorical?
    • Numerical: regression
    • Categorical: classification
  • What is the difference between a training and a test set?
  • How do we evaluate the performance of a regression model?

Bias-Variance Trade-Off

Supervised Learning: Assessing Model Performance

  • Labeled training data \((x_1,y_1), (x_2, y_2), \ldots, (x_n,y_n)\)
    • i.e. \(n\) training observations
  • Fit/train a model from training data
    • \(\hat{y}=\hat{f}(x)\), regression
    • \(\hat{y}=\hat{C}(x)\), classification
  • Obtain estimates \(\hat{f}(x_1), \hat{f}(x_2), \ldots, \hat{f}(x_n)\) (or, \(\hat{C}(x_1), \hat{C}(x_2), \ldots, \hat{C}(x_n)\)) of training data
  • Compute error:
    • Regression \[\text{Training MSE}=\text{Average}_{Training} \left(y-\hat{f}(x)\right)^2 = \frac{1}{n} \displaystyle \sum_{i=1}^{n} \left(y_i-\hat{f}(x_i)\right)^2\]
    • Classification \[ \begin{aligned} \text{Training Error Rate} &=\text{Average}_{Training} \ \left[I \left(y\ne\hat{C}(x)\right) \right]\\ &= \frac{1}{n} \displaystyle \sum_{i=1}^{n} \ I\left(y_i \ne \hat{C}(x_i)\right) \end{aligned} \]

Supervised Learning: Assessing Model Performance

  • In general, not interested in performance on training data
  • Want: performance on unseen test data… why?
  • Fresh test data: \((x_1^{test},y_1^{test}), (x_2^{test},y_2^{test}), \ldots, (x_m^{test},y_m^{test})\).
  • Compute test error:
    • Regression \[\text{Test MSE}=\text{Average}_{Test} \left(y-\hat{f}(x)\right)^2 = \frac{1}{m} \displaystyle \sum_{i=1}^{m} \left(y_i^{test}-\hat{f}(x_i^{test})\right)^2\]
    • Classification \[\text{Test Error Rate}=\text{Average}_{Test} \ \left[I \left(y\ne\hat{C}(x)\right) \right]= \frac{1}{m} \displaystyle \sum_{i=1}^{m} \ I\left(y_i^{test} \ne \hat{C}(x_i^{test})\right)\]

Supervised Learning: Bias-Variance Trade-off

  • Model fit on training data \(\hat{f}(x)\)
  • “True” relationship: \(Y=f(x)+\epsilon\)
  • \((x_0^{test}, y_0^{test})\): test observation
  • Bias-Variance Trade-Off (Theoretical) \[\underbrace{E\left(y_0^{test}-\hat{f}(x_0^{test})\right)^2}_{total \ error}=\underbrace{Var\left(\hat{f}(x_0^{test})\right)}_{source \ 1} + \underbrace{\left[Bias\left(\hat{f}(x_0^{test})\right)\right]^2}_{source \ 2}+\underbrace{Var(\epsilon)}_{source \ 3}\] where \(Bias\left(\hat{f}(x_0)\right)=E\left(\hat{f}(x_0)\right)-f(x_0)\)
  • Question: Where is \(\hat{y}_0^{test}\)?

Supervised Learning: Bias-Variance Trade-off

  • Reducible Error:
    • Source 1: how \(\hat{f}(x)\) varies among different randomly selected possible training data (Variance)
    • Source 2: how \(\hat{f}(x)\) (when predicting the test data) differs from its target \(f(x)\) (Bias)
  • Irreducible Error:
    • Source 3: how \(y\) differs from “true” \(f(x)\)

Bias-Variance Trade-off: Example

  • For now: focus on regression problems (ideas extend to classification)
  • Consider: three different examples of simulated “toy” datasets and three types of models (\(\hat{f}_i(.)\))
    • Linear Regression orange
    • Smoothing Spline 1 blue
    • More flexible Smoothing Spline 2 green
  • “True” (simulated) function \(f(.)\) black
  • Training Error
  • Test Error

Bias-Variance Trade-off: Example

From ISLR2

Bias-Variance Trade-off: Example

From ISLR2

Bias-Variance Trade-off: Example

From ISLR2

Bias-Variance Trade-off: Example

From ISLR2

Linear Regression

A Familiar Supervised Learning Model

  • Assume relationship between \(\mathbf{X}\) and \(Y\) is: \[Y=f(\mathbf{X}) + \epsilon\] where \(\epsilon\) is a random error term (includes measurement error, other discrepancies) independent of \(\mathbf{X}\) and has mean zero.
  • Objective: To approximate/estimate \(f(\mathbf{X})\)
  • Linear Regression: assume that \(f(\mathbf{X})\) is a linear function of \(\mathbf{X}\)
    • For \(p=1\): \(f(\mathbf{X}) = \beta_0 + \beta_1 X_1\)
    • For \(p > 1\): \(f(\mathbf{X}) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\)

Linear Regression

Suppose the CEO of a restaurant franchise is considering opening new outlets in different cities. They would like to expand their business to cities that give them higher profits with the assumption that highly populated cities will probably yield higher profits.

They have data on the population (in 100,000) and profit (in $1,000) at 97 cities where they currently have outlets.

outlets <- readRDS("../data/outlets.rds")   # load dataset
head(outlets) |> kable() # first six observations of the dataset
population profit
6.1101 17.5920
5.5277 9.1302
8.5186 13.6620
7.0032 11.8540
5.8598 6.8233
8.3829 11.8860

Linear Regression

  • Objective: choose “best” \(\beta_0\) and \(\beta_1\) if we assume \[\text{profit} = \beta_0 + \beta_1\times\text{population}\]
  • Once this is done, we can:
    • predict the profit for a new city with a given population
    • understand the relationship between population and profit better

Outlet EDA

outlets |> ggpairs() # ggpairs is from GGAlley package

Linear Regression: Estimating Parameters

  • Suppose we have \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
  • Observed response: \(y_i\) for \(i=1,\ldots,n\)
  • Predicted response: \(\hat{y}_i\) for \(i=1, \ldots, n\)
  • Residual: \(e_i = \hat{y}_i - y_i\) for \(i=1, \ldots, n\)
  • Mean Squared Error (MSE): \(MSE =\dfrac{e^2_1+e^2_2+\ldots+e^2_n}{n}\) also known as the loss/cost function
  • GOAL: Find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) which minimizes \(MSE\)

Gradient Descent Algorithm

  • How do we minimize the \(MSE\)?
    • Can be done “analytically” but most ML models can’t be fit that way
  • Today: popular optimization algorithm called gradient descent.
  • NOTE: Gradient Descent is not a machine learning model/algorithm. It is an optimization technique that helps to fit machine learning models.

Gradient Descent Algorithm

  • Think of the MSE as a function of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
    • \(\text{MSE} = \frac{\sum (y_i - \hat{\beta}_0 - \hat{\beta}_1)^2}{n}\)

Gradient Descent Algorithm

Updates to the parameter: \[ \begin{aligned} \text{new value of parameters} &= \text{old value of parameters}\\ &\qquad- \text{step size} \times \text{gradient of function w.r.t. parameters} \end{aligned} \]

Gradient Descent Algorithm

Gradient Descent Algorithm

Step size too small

Step size too too big

Gradient Descent for Linear Regression

  • Objective: We want to find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) which minimizes
  • \[\begin{aligned} MSE &= \dfrac{e^2_1+e^2_2+\ldots+e^2_n}{n}\\ &= \dfrac{(\hat{y}_1 - y_1)^2 + (\hat{y}_2 - y_2)^2 + \ldots + (\hat{y}_n - y_n)^2}{n} \end{aligned}\]
  • \[MSE = \dfrac{(\hat{\beta}_0 + \hat{\beta}_1 \ x_1 - y_1)^2 + (\hat{\beta}_0 + \hat{\beta}_1 \ x_2 - y_2)^2 + \ldots + (\hat{\beta}_0 + \hat{\beta}_1 \ x_n - y_n)^2}{n}\]
  • \[MSE = \displaystyle \dfrac{1}{n}\sum_{i=1}^{n} (\hat{\beta}_0 + \hat{\beta}_1 \ x_i - y_i)^2 = \displaystyle \dfrac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \displaystyle \dfrac{1}{n}\sum_{i=1}^{n} (e_i)^2\]

Gradient Descent for Linear Regression

  • To compute gradient, need partial derivatives of \(MSE\) with respect to \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
  • \[\text{gradient of MSE with respect to} \ \hat{\beta}_0 = \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} (\hat{\beta}_0 + \hat{\beta}_1 \ x_i - y_i) = \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} (\hat{y}_i - y_i)\]
  • \[\text{gradient of MSE with respect to} \ \hat{\beta}_1 = \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} x_i (\hat{\beta}_0 + \hat{\beta}_1 \ x_i - y_i) = \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} x_i (\hat{y}_i - y_i)\]

Gradient Descent for Linear Regression

  • Gradient descent update:
    • For \(\hat{\beta}_0\): \[ \begin{aligned} \hat{\beta}_0 \ \text{(new)} &= \hat{\beta}_0 \ \text{(old)}- \bigg(\text{step size} \times \text{derivative w.r.t} \ \hat{\beta}_0\bigg)\\ &= \hat{\beta}_0 \ \text{(old)}- \bigg(\text{step size} \times \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} (\hat{y}_i - y_i)\bigg) \end{aligned} \]
    • For \(\hat{\beta}_1\): \[ \begin{aligned} \hat{\beta}_1 \ \text{(new)} &= \hat{\beta}_1 \ \text{(old)} - \bigg(\text{step size} \times \text{derivative w.r.t} \ \hat{\beta}_1 \bigg)\\ &= \hat{\beta}_1 \ \text{(old)} - \bigg(\text{step size} \times \dfrac{2}{n} \displaystyle \sum_{i=1}^{n} x_i (\hat{y}_i - y_i)\bigg) \end{aligned} \]