01:00
Complete all the preparation work (readings and videos) before class.
Ask questions, come to office hours and help session.
Do the homework; get started on homework early when possible.
Don’t procrastinate and don’t let a week pass by with lingering questions.
Stay up-to-date on announcements on Canvas and sent via email.
If you email me about an error please include a screenshot of the error and the code causing the error.
Raise your hand or email me.
Tom M. Mitchell (1998):
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.
MNIST handwritten digits (from ISLR, James et al.)
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. According to Tom Mitchell’s definition, which of the following is the task T, experience E, and performance measure P in this setting?
01:00
Ames Housing dataset - Contains data on 881 houses in Ames, IA. We are interested in predicting sale price.
The first ten observations are shown below.
Sale_Price | Gr_Liv_Area | Garage_Type | Garage_Cars | Garage_Area | Street | Utilities | Pool_Area | Neighborhood |
---|---|---|---|---|---|---|---|---|
244000 | 2110 | Attchd | 2 | 522 | Pave | AllPub | 0 | North_Ames |
213500 | 1338 | Attchd | 2 | 582 | Pave | AllPub | 0 | Stone_Brook |
185000 | 1187 | Attchd | 2 | 420 | Pave | AllPub | 0 | Gilbert |
394432 | 1856 | Attchd | 3 | 834 | Pave | AllPub | 0 | Stone_Brook |
190000 | 1844 | Attchd | 2 | 546 | Pave | AllPub | 0 | Northwest_Ames |
149000 | NA | Attchd | 2 | 480 | Pave | AllPub | 0 | North_Ames |
149900 | NA | Attchd | 2 | 500 | Pave | AllPub | 0 | North_Ames |
127500 | 1069 | Attchd | 2 | 440 | Pave | AllPub | 0 | Northpark_Villa |
395192 | 1940 | Attchd | 3 | 606 | Pave | AllPub | 0 | Northridge_Heights |
290941 | 1544 | Attchd | 3 | 868 | Pave | AllPub | 0 | Northridge_Heights |
Default dataset - Contains credit card default data on 10,000 individuals. We are interested in predicting whether somebody will default or not.
Ten observations are shown below.
default | student | balance | income |
---|---|---|---|
No | No | 939.0985 | 45519.02 |
No | Yes | 397.5425 | 22710.87 |
Yes | No | 1511.6110 | 53506.94 |
No | No | 301.3194 | 51539.95 |
No | No | 878.4461 | 29561.78 |
Yes | No | 1673.4863 | 49310.33 |
No | No | 310.1302 | 37697.22 |
No | No | 1272.0539 | 44895.59 |
No | No | 887.2014 | 41641.45 |
No | No | 230.8689 | 32798.78 |
NA
For the Ames Housing and Default datasets:
Attchd
397.5425
Suppose you have information about 867 cancer patients on their age, tumor size, clump thickness of the tumor, uniformity of cell size, and whether the tumor is malignant or benign. Based on these data, you are interested in building a model to predict the type of tumor (malignant or benign) for future cancer patients.
Machine Learning Tasks (from Bunker and Fayez, 2017)
Supervised Learning problems can be categorized into:
US Arrests dataset - Data on arrests for 50 US states.
The first ten observations are shown below.
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 13.2 | 236 | 58 | 21.2 |
Alaska | 10.0 | 263 | 48 | 44.5 |
Arizona | 8.1 | 294 | 80 | 31.0 |
Arkansas | 8.8 | 190 | 50 | 19.5 |
California | 9.0 | 276 | 91 | 40.6 |
Colorado | 7.9 | 204 | 78 | 38.7 |
Connecticut | 3.3 | 110 | 77 | 11.1 |
Delaware | 5.9 | 238 | 72 | 15.8 |
Florida | 15.4 | 335 | 80 | 31.9 |
Georgia | 17.4 | 211 | 60 | 25.8 |
For each of the following, identify whether the problem belongs to the supervised or unsupervised learning paradigm