Feature selection (correlation, linear / logistic coefficients, frequent words, frequent words by class, etc.).
Practice
Practice Midterm live, on course webpage.
Exam, .qmd, Solutions, and Rubric.
We will work through it in our model groups 3/10.
It is based on the 5 homeworks.
It is based on the prior slide.
Little to no computatational linguistics
Modality Update
I will release the midterm exam Monday at 6 PM PT
3/17
I will expect all students to complete by Friday at 10 PM PT
3/21
It will be digital release via GitHub Classroom
You will have 4 hours after starting the assignment to complete it, via submitting upload.
We will conduct the practice midterm over GitHub Classroom.
First Model Due 3/10
Publish
Each group should create:
An annotated .*md file, and
The .rds/.pickle/.parquet file that it generates, that
Contains only the features you want in the model.
Under version control, on GitHub.
Constraints
I will run:
The specified \(K\)NN or Naive Bayes model,
With: province ~ . (or the whole data frame in scikit)
With repeated 5-fold cross validation
With the same index for partitioning training and test sets for every group.
On whatever is turned in before class.
Bragging rights for highest Kappa
Context
The “final exam” is that during the last class you will present your model results as though you are speaking to the managers of a large winery.
I may change the target audience a bit stay tuned.
It should be presented from a Quarto presentation on GitHub.
You must present via the in-room “teaching machine” computer, not your own physical device, to ensure that you are comfortable distributing your findings.
Group Meetings
You should have a group assignment
Meet in your groups!
Talk about your homework with your group.
Decision trees
Meme
This flow chart is also canonical
(sincere apologies but I do not think I can alt-text this)
pred <-predict(fit, newdata=test)confusionMatrix(factor(pred), factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction California Oregon
California 633 176
Oregon 158 371
Accuracy : 0.7504
95% CI : (0.7263, 0.7734)
No Information Rate : 0.5912
P-Value [Acc > NIR] : <2e-16
Kappa : 0.4809
Mcnemar's Test P-Value : 0.3523
Sensitivity : 0.8003
Specificity : 0.6782
Pos Pred Value : 0.7824
Neg Pred Value : 0.7013
Prevalence : 0.5912
Detection Rate : 0.4731
Detection Prevalence : 0.6046
Balanced Accuracy : 0.7392
'Positive' Class : California
Random Forest
fit <-train(province ~ .,data = train, method ="rf",trControl = ctrl)fit
Random Forest
5358 samples
7 predictor
2 classes: 'California', 'Oregon'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4823, 4823, 4822, 4822, 4822, 4822, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7515905 0.4769738
4 0.7530827 0.4839936
7 0.7528962 0.4836777
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 4.
Confusion Matrix
pred <-predict(fit, newdata=test)confusionMatrix(factor(pred),factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction California Oregon
California 646 188
Oregon 145 359
Accuracy : 0.7511
95% CI : (0.727, 0.7741)
No Information Rate : 0.5912
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.4788
Mcnemar's Test P-Value : 0.02136
Sensitivity : 0.8167
Specificity : 0.6563
Pos Pred Value : 0.7746
Neg Pred Value : 0.7123
Prevalence : 0.5912
Detection Rate : 0.4828
Detection Prevalence : 0.6233
Balanced Accuracy : 0.7365
'Positive' Class : California
Pros
Easy to use and understand.
Can handle both categorical and numerical data.
Resistant to outliers, hence require little data preprocessing.
New features can be easily added.
Can be used to build larger classifiers by using ensemble methods.
Cons
Prone to overfitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate.