Homework 3: \(K\)-Nearest Neighbors

Author

Your Name

Introduction

In this homework, you will practice applying the \(K\)-Nearest Neighbors (KNN) method which is capable of performing both classification and regression. You will also practice collaborating with a team over GitHub.

Learning goals

By the end of the homework, you will…

Be able to work simultaneously with teammates on the same document using GitHub
Fit and interpret KNN models in both regression and classification settings
Compare and evaluate different KNN models

Getting Started

In last weeks homework, you learned how to share your work using GitHub but only while working on the same document at different times. This week we will learn how to work on the same document simultaneously.

Teams & Rules

You can find your team for this assignment on Canvas in the People section. The group set is called HW3. Your group will consist of 2-3 people and has been randomly generated. But first, some rules:

You and your team members should be in the same physical room for Exercises 0.1 and 0.2. After that, you are welcome to divide up the work as you see fit. Please note that most of these problems have to be done sequentially except for Exercises 10-12, which can be done apart from 1-9.
You are all responsible for understanding the work that you turn in.
In order to receive credit, each team member but make a roughly equal contribution. Since this project has 12 exercises (not including 0), each team member should commit and push at least 4 different exercises.
If you are working on the same document simultaneously, make sure to render, commit, push, and pull FREQUENTLY.
If you encounter a merge error that you don’t know how to fix, contact Dr. Friedlander as soon as possible. I recommend starting this assignment early so there is time for Dr. Friedlander to help you resolve any problems before the deadline.

Clone the repo & start new RStudio project

The following directions will guide you through the process of setting up your homework to work as a group.

Exercise 0.1

Question

In your group, decide on a team name. Then have one member of your group:

Click this link to accept the assignment and enter your team name.
Repeat the directions for creating a project from HW 1 with the HW 3 repository.

Once this is complete, the other members can do the same thing, being careful to join the already created team on GitHub classroom.

Exercise 0.2

We will now learn how to collaborate on the same document at the same time by creating a merge conflict. You can read more about it here and here.

Question

Have Members 1, 2, and 3 all create write different team names below and Render the document.
Have Member 1 add, commit, and push changes.
Have Member 2 add, commit, and push changes. This should cause an error.
Have Member 2 pull changes from the remote repo which should generate something like this: CONFLICT (content): Merge conflict in 03-hw-knn.qmd.
Have Member 2 open the .qmd file. You should see something like the third block of code in Section 22.4 of this link.
Have Member edit the document so that it has only their team name, then render, commit, and push. This should not cause an error.
Have Member 3 repeat steps 3-6.
Agree on what you want your team name to actually be and have Member 1 repeat steps 3-6.
Note that you will only generate a merge conflict if you make edits to the same line of code. If you are working on the same document simultaneously but are editing different portions, git should automatically merge for you.

Team Name: [Insert Name]
Member 1: [Insert Name]
Member 2: [Insert Name]
Member 3: [Insert Name/Delete line if you only have two members]

\(K\)-Nearest Neighbors

The basic idea is that predictions are made based on the \(K\) observation in the training data which are “closest” to the observation that we’re making a prediction for. While many different metrics (i.e. measures of distance) can be used, we will work exclusively with the Euclidean metric: \[\text{dist}(x, y) = \sqrt{\sum_{i=1}^p(x_i-y_i)^2}\] for vectors \(x = (x_1,\ldots, x_p)\) and \(y = (y_1,\ldots,y_p)\).

KNN for Classification

We’ll start by using KNN for classification.

Data

We will be working with the famous iris data set which consists of four measurements (in centimeters) for 150 plants belonging to three species of iris. This data set was first published in a classic 1936 paper by English statistician, and notable racist/eugenicist, Ronald Fisher. In that paper, multivariate linear models were applied to classify these plants. Of course, back then, model fitting was an extremely laborious process that was done without the aid of calculators or statistical software.

Exercise 1

Question

Import the datasets package, take a look at the columns of iris, and split your data into training and test sets using a 70-30 split. IMPORTANT: Make sure that each species is represented proportionally in the training set by using the strata argument in the initial_split function! Once again, set your seed to 427.

Exercise 2

Question

Create a scatter plot of your training set with Sepal.Width and Petal.Width on the x- and y- axes, respectively, and color the points by Species.

Exercise 3

Question

As the name suggests, the \(K\)-nearest neighbors (KNN) method classifies a point based on the classification of the observations in the training set that are nearest to that point. If \(k > 1\), then the neighbors essentially “vote” on the classification of the point. Using only your graph, if \(k = 1\), how would KNN classify a flower that had sepal width 3cm and petal width 1cm?

Exercise 4

Question

Just to verify that we are correct, find the sepal width, petal width, and species of the observation in your training set that is closest to our flower with sepal width 3cm and petal width 1cm. This should be done by computing the Euclidean distance of (3, 1) to each observation and then sorting the resulting tibble to get the row with the smallest distance. Don’t worry about normalizaing the data.

Exercise 5

Question

Create a recipe to center and scale your data sets using the mean and standard deviation from your training set.

Exercise 6

Question

Create a workflow to fit a KNN model that uses weight_func = "rectangular" with 1 neighbor that includes the recipe above and fit the model to your training set.

We would like to understand how the method of \(K\)-nearest neighbors will classify points in the plane. That is, we would like to view the decision boundaries of this model. To do this, we will use our model to classify a large grid of points in the plane, and color them by their classification. The code below creates a data frame called grid consisting of 6.25^{4} points in the plane.

# g1 <- rep((200:450)*(1/100), 250)
# g2 <- rep((0:250)*(1/100), each = 250)
# grid <- tibble(  x1 = g1
#                    , x2 = g2)

Exercise 7

Question

Uncomment the code above, and change the variable names so that it will work with your model. Classify the points in grid using your training data and \(k = 1\). Then, plot the points in grid colored by their classification. Make sure your code is written so that the grid points are being centered and scaled before predictions are being made for them.

Exercise 8

Question

Notice that the decision boundary between versicolor and virginica looks a little strange. What do you observe? Why do you think this is happening? Does using \(k = 2\) make things better or worse? Why do you think that is?

Exercise 9

Question

Determine which value of \(k\), the number of neighbors selected, gives the highest accuracy on the test set. Test all \(k\)s between 1 and 40. Note that there may be ties because our data set is a little bit too small. To break ties just choose the smallest \(k\) among the ones which are tied. Hint: A for loop may be helpful. What is the accuracy of the model you ended up choosing?

Awesome!! Your model probably did pretty well, because KNN performs really well on the iris data set. However, this isn’t a very challenging data set for most classification methods. More challenging data sets have data on different scales and class imbalance where there are very few observations belonging to a particular class.

KNN for Regression

For regression, we can predict the response variable for our point to be the average (or sometimes median) of the response variable for the \(K\)-nearest neighbors.

Data & Dummy Variables

For this portion of the homework, we’ll use the Carseat data from the ISLR2 package. Frequently, when working with categorical data, you will be required to transform that data into dummy variables. Namely, you’ll create a unique variable for each column which gets a 1 if the corresponding observation is from that category and a 0 otherwise. In data science, this format is sometimes referred to as one-hot encoding.

Exercise 10

Question

Look at the Carseat data from the ISLR2 package. Then, split the data into a training and test set using a 70-30 split and a seed of 427 (as usual).

Exercise 11

Question

Create a recipe which first, converts all categorical variables into dummy variables using step_dummy() then centers and scales all of the predictors based on the training data.

Exercise 12

Question

Fit a KNN model to predict Sales from the data we have. Fit your model on the training data and use the test set to choose the appropriate variables and the number of neighbors to include. You may find it useful to plot the \(R^2\) and RMSE against the number of neighbors you include in your model. You may find that the RMSE and \(R^2\) disagree on what the best model is. You will have to make a judgement call on which model is “best”. One thing that can be helpful is looking at plots of your target variables (Sales in this case) against the model residuals.