# g1 <- rep((200:450)*(1/100), 250)
# g2 <- rep((0:250)*(1/100), each = 250)
# grid <- tibble( x1 = g1
# , x2 = g2)
Homework 3: \(K\)-Nearest Neighbors
Introduction
In this homework, you will practice applying the \(K\)-Nearest Neighbors (KNN) method which is capable of performing both classification and regression. You will also practice collaborating with a team over GitHub.
Learning goals
By the end of the homework, you will…
- Be able to work simultaneously with teammates on the same document using GitHub
- Fit and interpret KNN models in both regression and classification settings
- Compare and evaluate different KNN models
Getting Started
In last weeks homework, you learned how to share your work using GitHub but only while working on the same document at different times. This week we will learn how to work on the same document simultaneously.
Teams & Rules
You can find your team for this assignment on Canvas in the People section. The group set is called HW3. Your group will consist of 2-3 people and has been randomly generated. But first, some rules:
- You and your team members should be in the same physical room for Exercises 0.1 and 0.2. After that, you are welcome to divide up the work as you see fit. Please note that most of these problems have to be done sequentially except for Exercises 10-12, which can be done apart from 1-9.
- You are all responsible for understanding the work that you turn in.
- In order to receive credit, each team member but make a roughly equal contribution. Since this project has 12 exercises (not including 0), each team member should commit and push at least 4 different exercises.
- If you are working on the same document simultaneously, make sure to render, commit, push, and pull FREQUENTLY.
- If you encounter a merge error that you don’t know how to fix, contact Dr. Friedlander as soon as possible. I recommend starting this assignment early so there is time for Dr. Friedlander to help you resolve any problems before the deadline.
Clone the repo & start new RStudio project
The following directions will guide you through the process of setting up your homework to work as a group.
Exercise 0.1
In your group, decide on a team name. Then have one member of your group:
- Click this link to accept the assignment and enter your team name.
- Repeat the directions for creating a project from HW 1 with the HW 3 repository.
Once this is complete, the other members can do the same thing, being careful to join the already created team on GitHub classroom.
Exercise 0.2
We will now learn how to collaborate on the same document at the same time by creating a merge conflict. You can read more about it here and here.
- Have Members 1, 2, and 3 all create write different team names below and Render the document.
- Have Member 1 add, commit, and push changes.
- Have Member 2 add, commit, and push changes. This should cause an error.
- Have Member 2 pull changes from the remote repo which should generate something like this:
CONFLICT (content): Merge conflict in 03-hw-knn.qmd
. - Have Member 2 open the
.qmd
file. You should see something like the third block of code in Section 22.4 of this link. - Have Member edit the document so that it has only their team name, then render, commit, and push. This should not cause an error.
- Have Member 3 repeat steps 3-6.
- Agree on what you want your team name to actually be and have Member 1 repeat steps 3-6.
- Note that you will only generate a merge conflict if you make edits to the same line of code. If you are working on the same document simultaneously but are editing different portions, git should automatically merge for you.
- Team Name: [Insert Name]
- Member 1: [Insert Name]
- Member 2: [Insert Name]
- Member 3: [Insert Name/Delete line if you only have two members]
\(K\)-Nearest Neighbors
The basic idea is that predictions are made based on the \(K\) observation in the training data which are “closest” to the observation that we’re making a prediction for. While many different metrics (i.e. measures of distance) can be used, we will work exclusively with the Euclidean metric: \[\text{dist}(x, y) = \sqrt{\sum_{i=1}^p(x_i-y_i)^2}\] for vectors \(x = (x_1,\ldots, x_p)\) and \(y = (y_1,\ldots,y_p)\).
KNN for Classification
We’ll start by using KNN for classification.
Data
We will be working with the famous iris
data set which consists of four measurements (in centimeters) for 150 plants belonging to three species of iris. This data set was first published in a classic 1936 paper by English statistician, and notable racist/eugenicist, Ronald Fisher. In that paper, multivariate linear models were applied to classify these plants. Of course, back then, model fitting was an extremely laborious process that was done without the aid of calculators or statistical software.
Exercise 1
Import the datasets
package, take a look at the columns of iris
, and split your data into training and test sets using a 70-30 split. IMPORTANT: Make sure that each species is represented proportionally in the training set by using the strata
argument in the initial_split
function! Once again, set your seed to 427.
Exercise 2
Create a scatter plot of your training set with Sepal.Width
and Petal.Width
on the x- and y- axes, respectively, and color the points by Species
.
Exercise 3
As the name suggests, the \(K\)-nearest neighbors (KNN) method classifies a point based on the classification of the observations in the training set that are nearest to that point. If \(k > 1\), then the neighbors essentially “vote” on the classification of the point. Using only your graph, if \(k = 1\), how would KNN classify a flower that had sepal width 3cm and petal width 1cm?
Exercise 4
Just to verify that we are correct, find the sepal width, petal width, and species of the observation in your training set that is closest to our flower with sepal width 3cm and petal width 1cm. This should be done by computing the Euclidean distance of (3, 1) to each observation and then sorting the resulting tibble to get the row with the smallest distance. Don’t worry about normalizaing the data.
Exercise 5
Create a recipe
to center and scale your data sets using the mean and standard deviation from your training set.
Exercise 6
Create a workflow
to fit a KNN model that uses weight_func = "rectangular"
with 1 neighbor that includes the recipe above and fit the model to your training set.
We would like to understand how the method of \(K\)-nearest neighbors will classify points in the plane. That is, we would like to view the decision boundaries of this model. To do this, we will use our model to classify a large grid of points in the plane, and color them by their classification. The code below creates a data frame called grid
consisting of 6.25^{4} points in the plane.
Exercise 7
Uncomment the code above, and change the variable names so that it will work with your model. Classify the points in grid
using your training data and \(k = 1\). Then, plot the points in grid
colored by their classification. Make sure your code is written so that the grid points are being centered and scaled before predictions are being made for them.
Exercise 8
Notice that the decision boundary between versicolor
and virginica
looks a little strange. What do you observe? Why do you think this is happening? Does using \(k = 2\) make things better or worse? Why do you think that is?
Exercise 9
Determine which value of \(k\), the number of neighbors selected, gives the highest accuracy on the test set. Test all \(k\)s between 1 and 40. Note that there may be ties because our data set is a little bit too small. To break ties just choose the smallest \(k\) among the ones which are tied. Hint: A for loop may be helpful. What is the accuracy of the model you ended up choosing?
Awesome!! Your model probably did pretty well, because KNN performs really well on the iris
data set. However, this isn’t a very challenging data set for most classification methods. More challenging data sets have data on different scales and class imbalance where there are very few observations belonging to a particular class.
KNN for Regression
For regression, we can predict the response variable for our point to be the average (or sometimes median) of the response variable for the \(K\)-nearest neighbors.
Data & Dummy Variables
For this portion of the homework, we’ll use the Carseat
data from the ISLR2
package. Frequently, when working with categorical data, you will be required to transform that data into dummy variables. Namely, you’ll create a unique variable for each column which gets a 1 if the corresponding observation is from that category and a 0 otherwise. In data science, this format is sometimes referred to as one-hot encoding.
Exercise 10
Look at the Carseat
data from the ISLR2
package. Then, split the data into a training and test set using a 70-30 split and a seed of 427 (as usual).
Exercise 11
Create a recipe which first, converts all categorical variables into dummy variables using step_dummy()
then centers and scales all of the predictors based on the training data.
Exercise 12
Fit a KNN model to predict Sales
from the data we have. Fit your model on the training data and use the test set to choose the appropriate variables and the number of neighbors to include. You may find it useful to plot the \(R^2\) and RMSE against the number of neighbors you include in your model. You may find that the RMSE and \(R^2\) disagree on what the best model is. You will have to make a judgement call on which model is “best”. One thing that can be helpful is looking at plots of your target variables (Sales
in this case) against the model residuals.