Homework 1: Exploratory Data Analysis

Author

My name

Adapted from “Start teaching with R,” created by R Pruim, N J Horton, and D Kaplan, 2013, “Interactive and Dynamic Graphics for Data Analysis,” by Dianne Cook and Deborah F. Swayne, Colby Long’s DATA 325 Course at Wooster College and Maria Tackett’s STA-210 course at Duke University.

Introduction

In this homework we will familiarize ourselves with the tools that we’ll use throughout the course and refresh ourselves on topic related to exploratory data analysis.

Course Toolkit

The primary tools we’ll be using in this course are R, RStudio, git, and GitHub. We will be using them throughout the course both to learn the concepts discussed in the course and to analyze real data and come to informed conclusions.

Note

R is the name of the programming language itself and RStudio is a convenient interface.

Note

Git is a version control system (like “Track Changes” features from Microsoft Word but more powerful) and GitHub is the home for your Git-based projects on the internet (like Dropbox but much better for code).

To make versioning simpler, this homework will be completed individually. In the future, you’ll learn about collaborating on GitHub and producing a single homework for your team, but for now, concentrate on getting the basics down.

Exploratory Data Analysis

One of the most important components of data science is exploratory data analysis. I really like the following definition, which comes from this article (though it’s probably not the original source).

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

Before you begin your exploratory analysis, you may already have a particular question in mind. For example, you might work for an online retailer and want to develop a model to predict which purchased items will be returned. Or, you may not have a particular question in mind. Instead, you might just be asked to look at browsing data for several customers and figure out some way to increase purchases. In either case, before you construct a fancy model, you need to explore and understand your data. This is how you gain new insights and determine if an idea is worth pursuing.

Learning goals

By the end of the homework, you will…

Be familiar with the workflow using RStudio and GitHub
Gain practice writing a reproducible report using Quarto
Practice version control using GitHub
Be able to create numerical and visual summaries of data
Use those summaries

Getting Started

Accessing R and RStudio

Follow the directions here to install R and RStudio on your computer
If, for some reason, you are unable to use R or RStudio on your personal computer, you may use the College of Idaho’s RStudio Server. However, I do not recommend you do this as I want you to practice installing packages in this course and you do not have the permissions to do that on the server. If you must use the RStudio Server, please notify Dr. Friedlander immediately.

Set up your SSH Key

You will authenticate GitHub using SSH. Below are an outline of the authentication steps.

Note

You only need to do this authentication process one time on a single system.

Step 0: Type install.packages("credentials") into your console.
Step 1: Type credentials::ssh_setup_github() into your console.
Step 2: R will ask “No SSH key found. Generate one now?” Click 1 for yes.
Step 3: You will generate a key. It will begin with “ssh-rsa….” R will then ask “Would you like to open a browser now?” Click 1 for yes.
Step 4: You may be asked to provide your username and password to log into GitHub. This would be the ones associated with your account that you set up. After entering this information, paste the key in and give it a name. You might name it in a way that indicates where the key will be used, e.g., mat427)

You can find more detailed instructions here if you’re interested.

Installing git

You must now install git on your computer. Please follow the directions here to install git on your computer.

Configure git

There is one more thing we need to do before getting started on the assignment. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.

To do so, you will use the use_git_config() function from the usethis package. You will need to install the usethis package in the same way you installed the credentials packages above.

Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name", 
  user.email = "Email associated with your GitHub account")

For example, mine would be

usethis::use_git_config(
  user.name = "EricFriedlander",
  user.email = "efriedlander@collegeofidaho.edu")

You are now ready interact between GitHub and RStudio!

Clone the repo & start new RStudio project

Go to the course organization at github.com/mat427sp25 organization on GitHub. Click on the repo with the prefix hw-01-. It contains the starter documents you need to complete the lab.
- If you do not see your hw-01 repo, click here to create your repo.
Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
In RStudio, go to File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click 01-hw-eda.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

R and R Studio

Below are the components of the RStudio IDE.

Below are the components of a Quarto (.qmd) file.

YAML

The top portion of your Quarto file (between the three dashed lines) is called YAML. It is rumored that it stood for “Yet Another Markup Language” but it officially stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.

Important

Open the Quarto (.qmd) file in your project, change the author name to your name, and render the document. Examine the rendered document.

Committing changes

Now, go to the Git pane in your RStudio instance. This will be in the top right hand corner in a separate tab.

If you have made changes to your Quarto (.qmd) file, you should see it listed here. Click on it to select it in this list and then click on Diff. This shows you the difference between the last committed state of the document and its current state including changes. You should see deletions in red and additions in green.

If you’re happy with these changes, we’ll prepare the changes to be pushed to your remote repository. First, stage your changes by checking the appropriate box on the files you want to prepare. Next, write a meaningful commit message (for instance, “updated author name”) in the Commit message box. Finally, click Commit. Note that every commit needs to have a commit message associated with it.

You don’t have to commit after every change, as this would get quite tedious. You should commit states that are meaningful to you for inspection, comparison, or restoration.

In the first few assignments I will nudge you when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.

Now let’s make sure all the changes went to GitHub. Go to your GitHub repo and refresh the page. You should see your commit message next to the updated files. If you see this, all your changes are on GitHub and you’re good to go!

Push changes

Now that you have made an update and committed this change, it’s time to push these changes to your repo on GitHub.

In order to push your changes to GitHub, you must have staged your commit to be pushed. click on Push.

Understanding your data

Today we will be working with the TIPS data set which is in the regclass package. The data in the TIPS dataset is information recorded by one waiter about each tip he received over a period of a few months working in a restaurant. We would like to use this data to address the question, “What factors affect tipping behavior?”

Exercise 1

Question

Install the regclass package by either typing install.packages("regclass") in the console or by clicking “Tools > Install Packages” and selecting the package. Once you have done this, the code chunk below will load the package and data set. Notice that a bunch of unnecessary output is included when you knit the document. Change the Quarto chunk options so that this is not displayed.

library(regclass)

Loading required package: bestglm

Loading required package: leaps

Loading required package: VGAM

Loading required package: stats4

Loading required package: splines

Loading required package: rpart

Loading required package: randomForest

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

Important regclass change from 1.3:
All functions that had a . in the name now have an _
all.correlations -> all_correlations, cor.demo -> cor_demo, etc.

data("TIPS")

When exploring a new data set, it’s important to first understand the basics. What format is our data in? What types of information are included in the data set? How many observations are there?

Exercise 2

Question

In R, data sets are usually stored in a 2-dimensional structure called a data frame. The tidyverse provides a lot of useful functions for a variety of applications including data exploration and the particular flavor of data frame that the tidyverse uses is called a tibble. After loading the tidyverse library, you can get an idea of the structure of a data set using the syntax str(dataset) or glimpse(data), and you can peak at the first few rows and columns with head(dataset). Create a code chunk below, and use these functions (and others) in the R chunk below to better understand the data. How many tips are recorded in this data set? Which days of the week did the waiter work?

Often, a data set will come with a code book which gives more complete information about the structure of the data, the meaning of variables, and how the data were collected. In this case, most of the column names are pretty self explanatory.

Variable	Description
`TipPercentage`	the gratuity, as a percentage of the bill
`Bill`	the cost of the meal in US dollars
`Tip`	the tip in US dollars
`Gender`	gender of the bill payer
`Smoker`	whether the party included smokers
`Weekday`	day of the week
`Time`	time the bill was paid
`PartySize`	size of the party

Exercise 3

Question

Even though the column names are self-explanatory, we might have more questions about the data. For example, we might conjecture that people tip differently for breakfast and lunch, but our data only tells us if the bill was paid at “Day” or “Night.” State another reasonable conjecture about a factor that might affect tipping behavior. What additional information would be helpful to explore that conjecture?

Render-Commit-Push

This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 1 - 3”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Numerical Summaries

Now we’d like to start looking closely at the data set to develop some ideas about what factors might affect tipping. The basic descriptive statistics have obvious names, like mean, median, sd, IQR, quantile, etc. When using the tidyverse you use these in conjunction with the function summarize. Other options for data exploration include the function summary(), which computes several numerical summaries all at once, and the skimr package which includes many useful functions for taking a quick look at your data. We can apply these functions to an entire data frame or a specific column of the data fame.

Exercise 4

Question

Use some of these summaries to answer the following. How many smokers are in the data set? How fancy do you think restaurant is? Is it possible to tell from this summary how many different shifts the waiter worked? Why or why not?

As we start to explore different questions, we might want to know things about interactions between variables. Like, are tips larger during the day or at night? Or does gender or smoking status matter for how much people spend and how much they tip? You can calculate statistics within groups by including grouping variables and using group_by or aggregate like this:

# Tidyverse
TIPS |> 
  group_by(Time) |> 
  summarize(median(Tip))

# A tibble: 2 × 2
  Time  `median(Tip)`
  <fct>         <dbl>
1 Day            2.25
2 Night          3

TIPS |> 
  group_by(Gender, Smoker) |> 
  summarize(avg_bill = mean(Bill), avg_TipPerc = mean(TipPercentage))

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 4 × 4
# Groups:   Gender [2]
  Gender Smoker avg_bill avg_TipPerc
  <fct>  <fct>     <dbl>       <dbl>
1 Female No         18.1        15.7
2 Female Yes        18.0        18.2
3 Male   No         19.8        16.1
4 Male   Yes        22.3        15.3

# Using aggregate
aggregate(Tip ~ Time, data = TIPS, FUN = median)

   Time  Tip
1   Day 2.25
2 Night 3.00

aggregate(cbind(Bill, TipPercentage) ~ Gender + Smoker, data = TIPS, FUN = mean)

  Gender Smoker     Bill TipPercentage
1 Female     No 18.10519      15.69296
2   Male     No 19.79124      16.06701
3 Female    Yes 17.97788      18.21606
4   Male    Yes 22.28450      15.27967

We can also use the kable function from the knitr package to make our tables look pretty:

library(knitr)

TIPS |> 
  group_by(Time) |> 
  summarize(median(Tip)) |> 
  kable()

Time	median(Tip)
Day	2.25
Night	3.00

The ~ (tilde) symbol appears in a lot of functions. In R, a formula is an expression involving ~ that provides slots for laying out how you want to relate variables: y ~ x means “\(y\) versus \(x\)”, “\(y\) depends on \(x\)”, or “break down \(y\) by \(x\)”. In the first case above, you’re saying “break Tip down by Time” or “perform this function on the Tip, conditioned on Time.”

Exercise 5

Question

Calculate the variance of the tip percentage broken down by day of the week. Do you notice anything unusual? Explore the data and determine a possible cause for this.

For categorical variables, we can create tables as follows:

# Using table
table(TIPS$Smoker, TIPS$Gender) |> 
  kable()

	Female	Male
No	54	97
Yes	33	60

# Using xtabs
xtabs(~ Smoker + Gender, data = TIPS) |> 
  kable()

	Female	Male
No	54	97
Yes	33	60

# Using the janitor package (my favorite)
library(janitor)
TIPS |> 
  tabyl(Smoker, Gender) |>  # creates table
  adorn_totals(where = c("row", "col")) |>  # add margin totals if you want
  kable()

Smoker	Female	Male	Total
No	54	97	151
Yes	33	60	93
Total	87	157	244

Exercise 6

Question

Which day of the week has the highest percentage of tables that are smokers? Hint: look at documentation and use Google to figure out how to create table proportions.

Render-Commit-Push

This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 4 - 6”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Graphical Summaries

Graphical summaries are a key tool in exploratory data analysis to to help you understand your data. They also help you communicate insights about your data to others. For example, we might want to display relationships about some of our categorical variables. So we could start by graphing different party sizes in our data set.

TIPS |> 
  ggplot(aes(x = PartySize)) +
  geom_bar()

Or we could explore the question about the percentage of tables that are smokers on different days of the week visually.

TIPS |> 
  ggplot(aes(x = Weekday, fill = Smoker)) +
    geom_bar()

TIPS |> 
  ggplot(aes(x = Weekday, fill = Smoker)) +
    geom_bar(position = "fill")

We might summarize a numerical variable with a histogram. For example, here is a histogram of all of the tips in the data set.

TIPS |> 
  ggplot(aes(x = Tip)) +
    geom_histogram(bins = 100)

Exercise 7

Question

Notice that there are a few “spikes” in the histogram above. What do you think is causing this?

We can also summarize this numerical data broken down by one of the categorical variables using boxplots, violin plots, or sina plots. Note that to create sina plots we need the ggforce package.

TIPS |> 
  ggplot(aes(x=Weekday, y=Tip)) +
  geom_boxplot() +
  labs(title = "Tips by Day of the Week", 
       x = "Day of the Week",
       y = "Tips")

TIPS |> 
  ggplot(aes(x=Weekday, y=Tip)) +
  geom_boxplot() +
  geom_jitter() +
  labs(title = "Tips by Day of the Week", 
       x = "Day of the Week",
       y = "Tips")

TIPS |> 
  ggplot(aes(x=Weekday, y=Tip)) +
  geom_violin() +
  labs(title = "Tips by Day of the Week", 
       x = "Day of the Week",
       y = "Tips")

library(ggforce)
TIPS |> 
  ggplot(aes(x=Weekday, y=Tip)) +
  geom_sina() +
  labs(title = "Tips by Day of the Week", 
       x = "Day of the Week",
       y = "Tips")

Or we can visualize the relationship between a lot of our numerical variables at once.

# Using pairs (only numerical allowed)
pairs(~ Bill + TipPercentage + Tip
    , data = TIPS
    , main="Scatterplot Matrix for TIPS")

# Using ggpairs from GGally package (preferable even though more syntax)
library(GGally)
TIPS |> 
  select(Bill, TipPercentage, Tip, Weekday) |> 
  ggpairs()

Exercise 8

Question

Are there any clear linear relationships in the scatterplots above? What do you think is the explanation for these relationships?

There are lots of other interesting graphical summaries available for interpreting and displaying data. In addition, there are lots of R packages that allow you to draw these graphics and to further customize some of the ones we discussed here. In your projects, you are welcome to use any of these that you think are appropriate.

Render-Commit-Push

This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 7 - 8”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Exercise 9

Question

State a reasonable conjecture about tipping behavior that you would like to explore in the data set. For example, you might think that people on dates tip more or that the waiter gets smaller tips when he has too many tables. Give at least one numerical and one graphical summary to explore this conjecture. Is there any evidence to support your conjecture?

It’s okay if your conjecture is not supported or if you are just wrong–that’s often the case in exploratory data analysis.

Render-Commit-Push

This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 9”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Caution

Before you’re done with you work, make sure you look it over one last time to make sure the rendered document looks like you want it to! I can’t tell you how often students turn in work and their output doesn’t match their prose or the output definitely doesn’t look the way they wanted it to.