mn_reg_model <-multinom_reg(mixture =1, penalty =0.005) |># I chose this penalty arbitrarilyset_engine("glmnet", family ="multinomial") |>set_mode("classification")
Each class \(k\): \[\log \frac{\text{Prob. class } k}{\text{Prob. class } K} = \beta_{0k} + \beta_{1k}X_1 + \cdots + \beta_{pk}X_p\]
Shuba: Micro-averaged recall is the same as accuracy… CORRECT!
Same is true of precision AND \(F_1\)
Useful if you have a “multi-label” classification problem (not covered in this class)
Rabin’s Question
“What does a medium sized data set mean?”
Idea behind ML: identify patterns in data and use them to make predictions
As data gets “bigger”:
Patterns become clearer
Computational complexity increases
Informal definitions:
Small data: not enough data to fully represent patterns
Big data: all the info is there but special approaches need to be taken to handle all the data
Thinking about small data
Patterns not fully represented in data \(\Rightarrow\) restrict the set of possible patterns and give model less flexibility and freedom
Logistic regression (probably with regularization)
Support vector machines (we haven’t talked about these yet)
To a lesser extent: Decision Trees
NOT KNN
Thinking about big data
Patterns are definitely there but size introduces computational problems
Data set can’t fit in memory
Solution 1: Use a high-memory HPC cluster node
Solution 2: Modify algorithms to use parts of data instead of full data set (e.g. Stochastic gradient descent)
Algorithm scales with size of data and will take too long to run/fit
Solution 1: Run in parallel if possible using HPC cluster
Solution 2: Develop faster algorithms
Implement in a compiled language like C
Develop (faster) approximate solution
Curse of dimensionalality
If \(p\) is too-big \(\Rightarrow\) too much space and things are too far apart \(\Rightarrow\) similar impact to small data but without computational benefit
Note: big \(n\) vs. big \(p\) can present different issues
Medium Data
Data that’s big enough to have (most) of the information you need but not so big that it presents computational issues
KNN sweet spot
Enough data the “nearest-neighbors” are actually “near”
Not so much data that it takes forever to make predictions