::use_git_config(
usethisuser.name = "Your name",
user.email = "Email associated with your GitHub account")
Homework 1: Exploratory Data Analysis
Adapted from “Start teaching with R,” created by R Pruim, N J Horton, and D Kaplan, 2013, “Interactive and Dynamic Graphics for Data Analysis,” by Dianne Cook and Deborah F. Swayne, Colby Long’s DATA 325 Course at Wooster College and Maria Tackett’s STA-210 course at Duke University.
Introduction
In this homework we will familiarize ourselves with the tools that we’ll use throughout the course and refresh ourselves on topic related to exploratory data analysis.
Course Toolkit
The primary tools we’ll be using in this course are R, RStudio, git, and GitHub. We will be using them throughout the course both to learn the concepts discussed in the course and to analyze real data and come to informed conclusions.
R is the name of the programming language itself and RStudio is a convenient interface.
Git is a version control system (like “Track Changes” features from Microsoft Word but more powerful) and GitHub is the home for your Git-based projects on the internet (like Dropbox but much better for code).
To make versioning simpler, this homework will be completed individually. In the future, you’ll learn about collaborating on GitHub and producing a single homework for your team, but for now, concentrate on getting the basics down.
Exploratory Data Analysis
One of the most important components of data science is exploratory data analysis. I really like the following definition, which comes from this article (though it’s probably not the original source).
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.
Before you begin your exploratory analysis, you may already have a particular question in mind. For example, you might work for an online retailer and want to develop a model to predict which purchased items will be returned. Or, you may not have a particular question in mind. Instead, you might just be asked to look at browsing data for several customers and figure out some way to increase purchases. In either case, before you construct a fancy model, you need to explore and understand your data. This is how you gain new insights and determine if an idea is worth pursuing.
Learning goals
By the end of the homework, you will…
- Be familiar with the workflow using RStudio and GitHub
- Gain practice writing a reproducible report using Quarto
- Practice version control using GitHub
- Be able to create numerical and visual summaries of data
- Use those summaries
Getting Started
Accessing R and RStudio
Follow the directions here to install R and RStudio on your computer
If, for some reason, you are unable to use R or RStudio on your personal computer, you may use the College of Idaho’s RStudio Server. However, I do not recommend you do this as I want you to practice installing packages in this course and you do not have the permissions to do that on the server. If you must use the RStudio Server, please notify Dr. Friedlander immediately.
Set up your SSH Key
You will authenticate GitHub using SSH. Below are an outline of the authentication steps.
You only need to do this authentication process one time on a single system.
- Step 0: Type
install.packages("credentials")
into your console. - Step 1: Type
credentials::ssh_setup_github()
into your console. - Step 2: R will ask “No SSH key found. Generate one now?” Click 1 for yes.
- Step 3: You will generate a key. It will begin with “ssh-rsa….” R will then ask “Would you like to open a browser now?” Click 1 for yes.
- Step 4: You may be asked to provide your username and password to log into GitHub. This would be the ones associated with your account that you set up. After entering this information, paste the key in and give it a name. You might name it in a way that indicates where the key will be used, e.g.,
mat427
)
You can find more detailed instructions here if you’re interested.
Installing git
You must now install git on your computer. Please follow the directions here to install git on your computer.
Configure git
There is one more thing we need to do before getting started on the assignment. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.
To do so, you will use the use_git_config()
function from the usethis
package. You will need to install the usethis
package in the same way you installed the credentials
packages above.
Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.
For example, mine would be
::use_git_config(
usethisuser.name = "EricFriedlander",
user.email = "efriedlander@collegeofidaho.edu")
You are now ready interact between GitHub and RStudio!
Clone the repo & start new RStudio project
- Go to the course organization at github.com/mat427sp25 organization on GitHub. Click on the repo with the prefix hw-01-. It contains the starter documents you need to complete the lab.
- If you do not see your hw-01 repo, click here to create your repo.
- Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
- In RStudio, go to File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git.
- Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
- Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
- Click 01-hw-eda.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
R and R Studio
Below are the components of the RStudio IDE.
Below are the components of a Quarto (.qmd) file.
YAML
The top portion of your Quarto file (between the three dashed lines) is called YAML. It is rumored that it stood for “Yet Another Markup Language” but it officially stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the Quarto (.qmd) file in your project, change the author name to your name, and render the document. Examine the rendered document.
Committing changes
Now, go to the Git pane in your RStudio instance. This will be in the top right hand corner in a separate tab.
If you have made changes to your Quarto (.qmd) file, you should see it listed here. Click on it to select it in this list and then click on Diff. This shows you the difference between the last committed state of the document and its current state including changes. You should see deletions in red and additions in green.
If you’re happy with these changes, we’ll prepare the changes to be pushed to your remote repository. First, stage your changes by checking the appropriate box on the files you want to prepare. Next, write a meaningful commit message (for instance, “updated author name”) in the Commit message box. Finally, click Commit. Note that every commit needs to have a commit message associated with it.
You don’t have to commit after every change, as this would get quite tedious. You should commit states that are meaningful to you for inspection, comparison, or restoration.
In the first few assignments I will nudge you when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.
Now let’s make sure all the changes went to GitHub. Go to your GitHub repo and refresh the page. You should see your commit message next to the updated files. If you see this, all your changes are on GitHub and you’re good to go!
Push changes
Now that you have made an update and committed this change, it’s time to push these changes to your repo on GitHub.
In order to push your changes to GitHub, you must have staged your commit to be pushed. click on Push.
Understanding your data
Today we will be working with the TIPS
data set which is in the regclass
package. The data in the TIPS
dataset is information recorded by one waiter about each tip he received over a period of a few months working in a restaurant. We would like to use this data to address the question, “What factors affect tipping behavior?”
Exercise 1
Install the regclass
package by either typing install.packages("regclass")
in the console or by clicking “Tools > Install Packages” and selecting the package. Once you have done this, the code chunk below will load the package and data set. Notice that a bunch of unnecessary output is included when you knit the document. Change the Quarto chunk options so that this is not displayed.
library(regclass)
Loading required package: bestglm
Loading required package: leaps
Loading required package: VGAM
Loading required package: stats4
Loading required package: splines
Loading required package: rpart
Loading required package: randomForest
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
Important regclass change from 1.3:
All functions that had a . in the name now have an _
all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
data("TIPS")
When exploring a new data set, it’s important to first understand the basics. What format is our data in? What types of information are included in the data set? How many observations are there?
Exercise 2
In R, data sets are usually stored in a 2-dimensional structure called a data frame. The tidyverse
provides a lot of useful functions for a variety of applications including data exploration and the particular flavor of data frame that the tidyverse uses is called a tibble
. After loading the tidyverse
library, you can get an idea of the structure of a data set using the syntax str(dataset)
or glimpse(data)
, and you can peak at the first few rows and columns with head(dataset)
. Create a code chunk below, and use these functions (and others) in the R chunk below to better understand the data. How many tips are recorded in this data set? Which days of the week did the waiter work?
Often, a data set will come with a code book which gives more complete information about the structure of the data, the meaning of variables, and how the data were collected. In this case, most of the column names are pretty self explanatory.
Variable | Description |
---|---|
TipPercentage |
the gratuity, as a percentage of the bill |
Bill |
the cost of the meal in US dollars |
Tip |
the tip in US dollars |
Gender |
gender of the bill payer |
Smoker |
whether the party included smokers |
Weekday |
day of the week |
Time |
time the bill was paid |
PartySize |
size of the party |
Exercise 3
Even though the column names are self-explanatory, we might have more questions about the data. For example, we might conjecture that people tip differently for breakfast and lunch, but our data only tells us if the bill was paid at “Day” or “Night.” State another reasonable conjecture about a factor that might affect tipping behavior. What additional information would be helpful to explore that conjecture?
This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 1 - 3”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Numerical Summaries
Now we’d like to start looking closely at the data set to develop some ideas about what factors might affect tipping. The basic descriptive statistics have obvious names, like mean, median, sd, IQR, quantile
, etc. When using the tidyverse
you use these in conjunction with the function summarize
. Other options for data exploration include the function summary()
, which computes several numerical summaries all at once, and the skimr
package which includes many useful functions for taking a quick look at your data. We can apply these functions to an entire data frame or a specific column of the data fame.
Exercise 4
Use some of these summaries to answer the following. How many smokers are in the data set? How fancy do you think restaurant is? Is it possible to tell from this summary how many different shifts the waiter worked? Why or why not?
As we start to explore different questions, we might want to know things about interactions between variables. Like, are tips larger during the day or at night? Or does gender or smoking status matter for how much people spend and how much they tip? You can calculate statistics within groups by including grouping variables and using group_by
or aggregate
like this:
# Tidyverse
|>
TIPS group_by(Time) |>
summarize(median(Tip))
# A tibble: 2 × 2
Time `median(Tip)`
<fct> <dbl>
1 Day 2.25
2 Night 3
|>
TIPS group_by(Gender, Smoker) |>
summarize(avg_bill = mean(Bill), avg_TipPerc = mean(TipPercentage))
`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
# A tibble: 4 × 4
# Groups: Gender [2]
Gender Smoker avg_bill avg_TipPerc
<fct> <fct> <dbl> <dbl>
1 Female No 18.1 15.7
2 Female Yes 18.0 18.2
3 Male No 19.8 16.1
4 Male Yes 22.3 15.3
# Using aggregate
aggregate(Tip ~ Time, data = TIPS, FUN = median)
Time Tip
1 Day 2.25
2 Night 3.00
aggregate(cbind(Bill, TipPercentage) ~ Gender + Smoker, data = TIPS, FUN = mean)
Gender Smoker Bill TipPercentage
1 Female No 18.10519 15.69296
2 Male No 19.79124 16.06701
3 Female Yes 17.97788 18.21606
4 Male Yes 22.28450 15.27967
We can also use the kable
function from the knitr
package to make our tables look pretty:
library(knitr)
|>
TIPS group_by(Time) |>
summarize(median(Tip)) |>
kable()
Time | median(Tip) |
---|---|
Day | 2.25 |
Night | 3.00 |
The ~
(tilde) symbol appears in a lot of functions. In R, a formula is an expression involving ~
that provides slots for laying out how you want to relate variables: y ~ x
means “\(y\) versus \(x\)”, “\(y\) depends on \(x\)”, or “break down \(y\) by \(x\)”. In the first case above, you’re saying “break Tip down by Time” or “perform this function on the Tip, conditioned on Time.”
Exercise 5
Calculate the variance of the tip percentage broken down by day of the week. Do you notice anything unusual? Explore the data and determine a possible cause for this.
For categorical variables, we can create tables as follows:
# Using table
table(TIPS$Smoker, TIPS$Gender) |>
kable()
Female | Male | |
---|---|---|
No | 54 | 97 |
Yes | 33 | 60 |
# Using xtabs
xtabs(~ Smoker + Gender, data = TIPS) |>
kable()
Female | Male | |
---|---|---|
No | 54 | 97 |
Yes | 33 | 60 |
# Using the janitor package (my favorite)
library(janitor)
|>
TIPS tabyl(Smoker, Gender) |> # creates table
adorn_totals(where = c("row", "col")) |> # add margin totals if you want
kable()
Smoker | Female | Male | Total |
---|---|---|---|
No | 54 | 97 | 151 |
Yes | 33 | 60 | 93 |
Total | 87 | 157 | 244 |
Exercise 6
Which day of the week has the highest percentage of tables that are smokers? Hint: look at documentation and use Google to figure out how to create table proportions.
This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 4 - 6”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Graphical Summaries
Graphical summaries are a key tool in exploratory data analysis to to help you understand your data. They also help you communicate insights about your data to others. For example, we might want to display relationships about some of our categorical variables. So we could start by graphing different party sizes in our data set.
|>
TIPS ggplot(aes(x = PartySize)) +
geom_bar()
Or we could explore the question about the percentage of tables that are smokers on different days of the week visually.
|>
TIPS ggplot(aes(x = Weekday, fill = Smoker)) +
geom_bar()
|>
TIPS ggplot(aes(x = Weekday, fill = Smoker)) +
geom_bar(position = "fill")
We might summarize a numerical variable with a histogram. For example, here is a histogram of all of the tips in the data set.
|>
TIPS ggplot(aes(x = Tip)) +
geom_histogram(bins = 100)
Exercise 7
Notice that there are a few “spikes” in the histogram above. What do you think is causing this?
We can also summarize this numerical data broken down by one of the categorical variables using boxplots, violin plots, or sina plots. Note that to create sina plots we need the ggforce
package.
|>
TIPS ggplot(aes(x=Weekday, y=Tip)) +
geom_boxplot() +
labs(title = "Tips by Day of the Week",
x = "Day of the Week",
y = "Tips")
|>
TIPS ggplot(aes(x=Weekday, y=Tip)) +
geom_boxplot() +
geom_jitter() +
labs(title = "Tips by Day of the Week",
x = "Day of the Week",
y = "Tips")
|>
TIPS ggplot(aes(x=Weekday, y=Tip)) +
geom_violin() +
labs(title = "Tips by Day of the Week",
x = "Day of the Week",
y = "Tips")
library(ggforce)
|>
TIPS ggplot(aes(x=Weekday, y=Tip)) +
geom_sina() +
labs(title = "Tips by Day of the Week",
x = "Day of the Week",
y = "Tips")
Or we can visualize the relationship between a lot of our numerical variables at once.
# Using pairs (only numerical allowed)
pairs(~ Bill + TipPercentage + Tip
data = TIPS
, main="Scatterplot Matrix for TIPS") ,
# Using ggpairs from GGally package (preferable even though more syntax)
library(GGally)
|>
TIPS select(Bill, TipPercentage, Tip, Weekday) |>
ggpairs()
Exercise 8
Are there any clear linear relationships in the scatterplots above? What do you think is the explanation for these relationships?
There are lots of other interesting graphical summaries available for interpreting and displaying data. In addition, there are lots of R packages that allow you to draw these graphics and to further customize some of the ones we discussed here. In your projects, you are welcome to use any of these that you think are appropriate.
This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 7 - 8”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Exercise 9
State a reasonable conjecture about tipping behavior that you would like to explore in the data set. For example, you might think that people on dates tip more or that the waiter gets smaller tips when he has too many tables. Give at least one numerical and one graphical summary to explore this conjecture. Is there any evidence to support your conjecture?
It’s okay if your conjecture is not supported or if you are just wrong–that’s often the case in exploratory data analysis.
This is a good place to render, commit, and push changes to your hw-eda repo on GitHub. Write an informative commit message (e.g. “Completed exercises 9”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Before you’re done with you work, make sure you look it over one last time to make sure the rendered document looks like you want it to! I can’t tell you how often students turn in work and their output doesn’t match their prose or the output definitely doesn’t look the way they wanted it to.