write.csv(prediction_vector, "my_predictions.csv", row.names = FALSE)
Hack-A-Thon Instructions
Census Income
The U.S. Constitution requires that a census be conducted every ten years in order to allocate congressional representatives. The first census was conducted in 1790 and the U.S. Census Bureau is still compiling the data from the 2020 census. While the Constitution only requires an “actual enumeration” of citizens, the census has expanded to include a number of demographic questions. As described by the Census Bureau, the results of the 2020 census will,
“…determine congressional representation, inform hundreds of billions in federal funding, and provide data that will impact communities for the next decade.”
Project Overview
In this hack-a-thon, you will use census data to predict whether or not someone has an annual income of more than $50,000. The data for making your predictions are contained in two files:
census_train.csv
contains 35,000 rows representing unique individuals, and 15 columns, representing demographic information about those individuals (including whether their income is above or below $50,000).census_test.csv
contains 13,840 rows, but only 14 columns since theincome
column has been removed.
A complete description of the variables in the data set is below.
Deliverables
There are two deliverables for this project:
- An HTML file outlining your modeling process. This should following the same guidelines for the Job Application 2 Resources. This should be pushed to the GitHub classroom assignment. This is also where you can get all of the data.
- A
.csv
file, uploaded to Canvas, containing a vector of your predictions for whether the individuals in the test set make more than $50,000. That is, you will create a length 13,480 vector of0
s and1
s where1
indicates an individual makes MORE than $50K and write them to a file using:
Your submission will be evaluated for predictive quality (accuracy), writing quality, and clarity. Not all columns in this data set contain numerical values, so some will need to be translated into appropriate forms before beginning data analysis. There are also instances of missing or incomplete data, and some issues with how the data have been entered that you will need to address. You are also welcome to use any other techniques or packages you would like, but make sure that you can explain your analysis well.
Notes
This data set is great for practicing data science techniques. Because of this, if you search the internet you will find the original data set as well as articles that describe exactly how to apply statistical learning methods to it. There are several reasons NOT TO DO THIS. For one, it is cheating and will result in a 0. But also, it ruins the fun. This project is a great chance for you to go more in depth about something we have discussed in class and to challenge yourself to make the best predictions possible! You are welcome to search the internet for general advice regarding predictive modeling, but if you encounter census data, look away! If you have more specific questions about the data set, please ask me.
Description of Variables
age
: continuous.
workclass
: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt
: continuous. A weight that represents how common people with these exact age and racial demographics are in the United States.
education
: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num
: continuous. Numerical representation of education level.
marital-status
: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. (“civ” and “AF” represent “civilian” (not in military) or “Armed Forces” (in military)).
occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex
: Female, Male.
capital-gain
: continuous. (Income from the sale of a capital asset, e.g., stocks or property)
capital-loss
: continuous. (A loss occurred when a capital asset, e.g., stocks or property, decreases in value.)
hours-per-week
: continuous. Number of hours worked per week.
native-country
: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US (Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holland-Netherlands.
income
: whether or not annual income from all sources is above or below $50,000.
Evaluation Guidelines
Your submission will be assessed in each of the following areas as either Outstanding, Good, Acceptable, Needs Work, or Inadequate. The questions below will be used as a guide to determine the quality of your analysis in each area.
Predictive Accuracy Evaluation
- Are the submitted predictions in the appropriate form?
- Are appropriate modeling techniques employed given the data and the task?
- Are the predictions reasonably accurate given the available data?
- Where do the predictions rank in terms of the class?
Writing Quality Evaluation
- Is the report well-organized?
- Is the modeling process described in a clear, logical, and engaging manner?
- Is the modeling choices clearly justified?
- Is the report free of spelling and grammatical errors?
- Is the report aimed at the appropriate (technical) audience?
Data Analysis Evaluation:
- Are the mathematical and statistical methods described accurately?
- Are appropriate predictor variables used in the analysis?
- Are the reasons for excluding and/or transforming the data sensible and well-explained?
- How was the model chosen and validated?
- How were the parameters of the model tuned (if applicable)?
Submitting Your Work
You must submit two things:
- Your report must be pushed to GitHub.
- The csv file containing your predictions must be uploaded to Canvas.