Homework 2: Intro to Regression
Introduction
In this homework, you will practice multiple linear regression by working with the data set Carseats
from the package ISLR2
. In addition, you will practice collaborating with a team over GitHub.
Learning goals
By the end of the homework, you will…
- Be able to collaborate with teammates on the same document using GitHub
- Gain practice writing a reproducible report using Quarto
- Fit and interpret linear models
- Split data using
tidymodels
- Compare and evaluate different linear models
Getting Started
The process of collaborating over GitHub is similar to when you completed your work alone. You will clone the repo to you computer, do your work, and then stage, commit, and push changes to GitHub. Once you push changes to GitHub, your partners should be able to then pull your changes to their computer.
Teams
You can find your team for this assignment on Canvas in the People section. The group set is called HW2. Your group will consist of 2-3 people and has been randomly generated. But first, some rules:
- You and your team members should be in the same physical room when you complete this assignment.
- You should all contribute to each problem, but to receive credit you must rotate who actually writes down the answers.
- In order to receive credit, you must have a commit after each exercise by the correct member of your team.
- For now, don’t try to edit the same document at the same time. We will cover that in later homeworks.
Clone the repo & start new RStudio project
The following directions will guide you through the process of setting up your homework to work as a group.
Exercise 0.1
In your group, decide on a team name. Then have one member of your group:
- Click this link to accept the assignment and enter your team name.
- Repeat the directions for creating a project from HW 1 with the HW 2 repository.
Once this is complete, the other two members can do the same thing, being careful to join the already created team on GitHub classroom.
Exercise 0.2
Have the first member of your group fill in the blanks below and then render, commit, and push the changes back to your GitHub repository.
- Team Name: [Make one up and enter here]
- Member 1: [Insert Name]
- Member 2: [Insert Name]
- Member 3: [Insert Name/Delete line if you only have two members]
The Data
This data set contains simulated data of child car seat sales at 400 different stores. Your prediction task in this homework will be to predict Sales
from some combination of the other variables. Look at the help documentation for this data to explore the available variables.
Exercise 1
All members of your group should install the ISLR2
package if they haven’t. Remember that you only need to do this once on your computer.
Have the second member of your group pull the changes made by member 1. Before completing this question.
Load the ISLR2
package. You should be able to access the dataset Carseats
. Based on your knowledge of the world, which features do you think will be most predictive of Sales
. Hint: ?Carseats
will give you more information on the data set and the variables.
Once you have completed this, have the second member render, commit, and push the changes to GitHub.
Data Splitting
Exercise 2
Have the second member of your group pull the changes made by member 1 and member 2. Before completing this question.
Use tidymodels
to create a training and test set from the Carseats
data using a 70-30 split. Note that this is a random process so you will get different partitions every time you split your data. As a result, it is considered good practice to set your seed so that the results a reproducible. For this homework please use the seed 427. For the training set, what quantitative variable is most highly correlated with Sales
, our target variable?
Once you have completed this, have the third member render, commit, and push the changes to GitHub.
Committing Changes
You should continue in this manner, rotating who completes each question in order from member 1 to member 2 to member 3 and back to member 1 for the remainder of the assignment. In order to receive credit, you must have a commit after each exercise by the correct member of your team.
Fitting Our First Model
In predictive modeling, we often begin with a simple baseline model, to which we compare other models. Any more complicated model must outperform the baseline model to be considered useful.
Exercise 3
Fit a linear regression model predicting Sales
using the variable you identified in Exercise 2. Write down the resulting model in the form: \[Price = \beta_0 + \beta_1\times Variable\] Don’t forget to use your training set rather than the full data to train your model.
Gradient Descent
Exercise 4 (Hard)
Re-estimate the parameters (i.e. the intercept and slope) for the model above using gradient descent. Your estimates should be similar to those in Exercise 3 but likely won’t be EXACTLY the same. Your solution should include the following:
- Initialize the values of your \(\beta\)’s.
- Initialize the step size.
- Initialize your tolerance (i.e. the stopping criteria).
- Enter a loop. For each iteration in the loop:
- Compute the partial derivatives for \(\beta_0\) and \(\beta_1\).
- Update the values of \(\beta_0\) and \(\beta_1\).
- Check to see if the change in your estimates is smaller than your tolerance. If it is, exit the loop, otherwise continue.
Evaluating Our Model
Let’s now see how our model performs.
Exercise 5
Compute the RMSE for your baseline model on both the training and test set and report them. Try and report your results inline.
The second primary metric that we can use to assess the accuracy of regression models is called the coefficient of determination, denoted \(R^2\). \(R^2\) is the proportion of variance (information) in our target variable that is explained by our model and can be computed by squaring \(R\), the correlation coefficient between the target variable \(y\) and the predicted target \(\hat{y}\). The lm
function actually computes the \(R^2\) of our training data for us which we can access using the glance
function from the broom
package which is included in tidymodels
so you don’t need to load it.
Exercise 6
What proportion of the variation in Sales
is explained by our baseline model for the training and validation sets?
Categorical Predictors
Let’s start to expand our model a bit by adding the predictor US
.
Exercise 7
Using tidymodels
, build a two-input linear model for Sales
by adding US
in addition to the variable you selected above. Save your model as lmfit1
. Use tidy
and kable
to output the model. Is the coefficient for Price
the same or different than it was in our baseline model?
Notice that the only coefficient added for US
is called USYes
. When you build a linear model with a categorical variable, R will introduce dummy variables which encode each category as a vector of 0’s and 1’s. In data science, this is sometimes called one-hot encoding. One level is always lumped into the intercept coefficient and is called the reference level. In this case, the reference level is No
. When including a categorical variable in a linear model, you can interpret the resulting line being shifted up or down based on the category of a given observation.
Let’s now assess the accuracy of our new model. To make computation of RMSE and \(R^2\) easier let’s take advantage of the rmse
and r2
functions in the yardstick
package (also included in tidymodels
).
Exercise 8
Use the rmse
and rsq
functions from theyardstick
package to compute the RMSE and \(R^2\) values for this new model on both the training and validation sets. How do these compare to the baseline model?
Overfitting and the Bias-Variance Trade-Off
Now let’s see what happens if we add in ALL of the predictors to our model. This is sometimes referred to as the full model. To include all of your predictors in a model you can use the syntax fit(Y ~ ., data)
.
Exercise 9
Fit a model using all of the predictors in your training data. Call the model lmfull
. Assess the model’s accuracy on the training data and the test data, comparing it to the previous models we’ve fit, and comment on your results.
You should notice that while the accuracy metrics on the training data drastically improve, there is little to no difference in the metrics for the test set. This is because of a phenomenon known as overfitting. Overfitting occurs when your model starts matching the training TOO well. A good visualization of an overfit model is Figure 2 in the Wikipedia article for overfitting. As you include more variables/information in your model, your performance will ALWAYS increase on your training data. This is one of the reasons we always use holdout sets. Eventually your model will begin to over-align to the noise in your training data and the accuracy on holdout sets will be level off and in most cases begin to degrade.
When modeling there are two related trade-offs that you need to consider. The first is the trade-off between prediction accuracy and interpretability. In general, one can typically create models with better prediction accuracy by sacrificing interpretability (e.g. by including more variables in your model, transforming these variables, etc.). Another trade-off is something called the bias-variance trade-off. As we increase the complexity of a model, we allow it to account for more and more intricate patterns in our data. In theory, this allows the model to mimic more complex relationships between our predictors and our target variables, reducing bias. On the other hand, more complex models typically have more parameters which need to be estimated which require more data to estimate accurately. When you increase the complexity of a model you usually increase the variance of the estimates of model parameters and the predictions the model makes. In other words, the model will be much more sensitive to the noise in the data that you have. Bias and variance will both decrease the accuracy of your model so you should try to minimize both. However, past a certain point, it will be a trade-off between the two.
Exercise 10
Find and fit a model on the training data which outperforms all the models we fit so far when evaluated on the test set. Briefly summarize your results. The “best” model will receive a high-five from Dr. Friedlander. Feel free to use any techniques we’ve learned in this class, up to this point. I encourage you to try out different data transformations like polynomials.