Feature selection (correlation, linear / logistic coefficients, frequent words, frequent words by class, etc.).
Practice
Practice midterm out.
You will go over it on 10 Mar.
It is based on the 5 homeworks.
It is based on the prior slide.
Little to no computatational linguistics
I’m regarding tidytext as extension, not core, content.
Modality Update
Reminder for me if we haven’t set modality.
First Model Due 3/10
Publish
Each group should create:
An annotated .*md file, and
The .rds/.pickle/.parquet file that it generates, that
Contains only the features you want in the model.
Under version control, on GitHub.
Constraints
I will run:
The specified \(K\)NN or Naive Bayes model,
With: province ~ . (or the whole data frame in scikit)
With repeated 5-fold cross validation
With the same index for partitioning training and test sets for every group.
On whatever is turned in before class.
Bragging rights for highest Kappa
Context
The “final exam” is that during the last class you will present your model results as though you are speaking to the managers of a large winery.
It should be presented from a Quarto presentation on GitHub or perhaps e.g. RPubs.
You must present via the in-room “teaching machine” computer, not your own physical device, to ensure that you are comfortable distributing your findings.
wine_words <-function(df, j, stem){ words <- df %>%unnest_tokens(word, description) %>%anti_join(stop_words) %>%filter(!(word %in%c("wine","pinot","vineyard")))if(stem){ words <- words %>%mutate(word =wordStem(word)) } words %>%count(id, word) %>%group_by(id) %>%mutate(exists = (n>0)) %>% ungroup %>%group_by(word) %>%mutate(total =sum(n)) %>%filter(total > j) %>%pivot_wider(id_cols = id, names_from = word, values_from = exists, values_fill =list(exists=0)) %>%right_join(select(df,id,province)) %>%select(-id) %>%mutate(across(-province, ~replace_na(.x, F)))}
Split It
wino <-wine_words(wine, 1000, T)wine_index <-createDataPartition(wino$province, p =0.80, list =FALSE)train <- wino[ wine_index, ]test <- wino[-wine_index, ]
Quick Training
control =trainControl(method ="cv", number =5)fit <-train(province ~ .,data = train, trControl = control,method ="multinom",maxit =5) # speed it up - default 100
# weights: 186 (150 variable)
initial value 9612.789552
final value 4293.654702
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9612.789552
final value 4298.443311
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9612.789552
final value 4293.659497
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4286.131959
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4290.880467
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4286.136714
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9612.789552
final value 4323.482725
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9612.789552
final value 4328.101450
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9612.789552
final value 4323.487350
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4300.974171
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4305.760565
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4300.978964
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4350.524391
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4355.253533
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 9614.581312
final value 4350.529126
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 12017.330760
final value 5085.731910
stopped after 5 iterations
Matrix
pred <-factor(predict(fit, newdata = test))confusionMatrix(pred,factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 181 11 1 5 2
California 38 607 5 18 11
Casablanca_Valley 0 1 7 0 0
Marlborough 0 0 0 1 0
New_York 0 0 0 0 0
Oregon 19 172 13 21 13
Reference
Prediction Oregon
Burgundy 28
California 83
Casablanca_Valley 1
Marlborough 0
New_York 0
Oregon 435
Overall Statistics
Accuracy : 0.7358
95% CI : (0.714, 0.7568)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5831
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.7605 0.7674 0.269231
Specificity 0.9672 0.8243 0.998786
Pos Pred Value 0.7939 0.7966 0.777778
Neg Pred Value 0.9606 0.7980 0.988582
Prevalence 0.1423 0.4728 0.015541
Detection Rate 0.1082 0.3628 0.004184
Detection Prevalence 0.1363 0.4555 0.005380
Balanced Accuracy 0.8639 0.7958 0.634008
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.0222222 0.00000 0.7952
Specificity 1.0000000 1.00000 0.7886
Pos Pred Value 1.0000000 NaN 0.6464
Neg Pred Value 0.9736842 0.98446 0.8880
Prevalence 0.0268978 0.01554 0.3270
Detection Rate 0.0005977 0.00000 0.2600
Detection Prevalence 0.0005977 0.00000 0.4023
Balanced Accuracy 0.5111111 0.50000 0.7919
# weights: 186 (150 variable)
initial value 105991.531402
final value 48035.564209
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105991.531402
final value 48040.946734
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105991.531402
final value 48035.569592
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105991.531402
final value 46545.313878
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105991.531402
final value 46550.591504
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105991.531402
final value 46545.319156
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 48068.956223
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 48074.276422
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 48068.961544
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 45944.777964
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 45950.193395
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 105989.739643
final value 45944.783380
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 106025.574832
final value 48603.782942
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 106025.574832
final value 48609.093526
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 106025.574832
final value 48603.788253
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 132497.029230
final value 51901.008448
stopped after 5 iterations
Matrix
pred <-factor(predict(fit, newdata = test))confusionMatrix(pred,factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 221 98 1 6 6
California 0 211 0 1 2
Casablanca_Valley 2 20 18 0 5
Marlborough 2 135 0 23 5
New_York 0 16 0 1 5
Oregon 13 311 7 14 3
Reference
Prediction Oregon
Burgundy 60
California 3
Casablanca_Valley 3
Marlborough 21
New_York 2
Oregon 458
Overall Statistics
Accuracy : 0.5595
95% CI : (0.5353, 0.5834)
No Information Rate : 0.4728
P-Value [Acc > NIR] : 7.657e-13
Kappa : 0.408
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.9286 0.2668 0.69231
Specificity 0.8808 0.9932 0.98179
Pos Pred Value 0.5638 0.9724 0.37500
Neg Pred Value 0.9867 0.6016 0.99508
Prevalence 0.1423 0.4728 0.01554
Detection Rate 0.1321 0.1261 0.01076
Detection Prevalence 0.2343 0.1297 0.02869
Balanced Accuracy 0.9047 0.6300 0.83705
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.51111 0.192308 0.8373
Specificity 0.89988 0.988464 0.6909
Pos Pred Value 0.12366 0.208333 0.5682
Neg Pred Value 0.98521 0.987265 0.8973
Prevalence 0.02690 0.015541 0.3270
Detection Rate 0.01375 0.002989 0.2738
Detection Prevalence 0.11118 0.014345 0.4818
Balanced Accuracy 0.70549 0.590386 0.7641
# weights: 186 (150 variable)
initial value 29052.125562
final value 14594.434569
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29052.125562
final value 14600.444684
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29052.125562
final value 14594.440582
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29085.273112
final value 15333.346364
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29085.273112
final value 15339.437606
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29085.273112
final value 15333.352458
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 15053.131982
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 15059.407508
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 15053.138261
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29052.125562
final value 15366.545546
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29052.125562
final value 15372.704878
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29052.125562
final value 15366.551708
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 14871.524669
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 14877.593414
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 29053.917321
final value 14871.530741
stopped after 5 iterations
# weights: 186 (150 variable)
initial value 36324.339720
final value 19122.264208
stopped after 5 iterations
Matrix
pred <-factor(predict(fit, newdata = test))confusionMatrix(pred,factor(test$province))
pred <-predict(fit, newdata=test)confusionMatrix(factor(pred),factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 182 11 0 2 1
California 16 674 7 13 14
Casablanca_Valley 1 3 15 0 2
Marlborough 1 7 0 19 2
New_York 1 4 0 0 4
Oregon 37 92 4 11 3
Reference
Prediction Oregon
Burgundy 28
California 93
Casablanca_Valley 2
Marlborough 3
New_York 1
Oregon 420
Overall Statistics
Accuracy : 0.7854
95% CI : (0.765, 0.8049)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6639
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.7647 0.8521 0.576923
Specificity 0.9707 0.8379 0.995143
Pos Pred Value 0.8125 0.8250 0.652174
Neg Pred Value 0.9614 0.8633 0.993333
Prevalence 0.1423 0.4728 0.015541
Detection Rate 0.1088 0.4029 0.008966
Detection Prevalence 0.1339 0.4883 0.013748
Balanced Accuracy 0.8677 0.8450 0.786033
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.42222 0.153846 0.7678
Specificity 0.99201 0.996357 0.8694
Pos Pred Value 0.59375 0.400000 0.7407
Neg Pred Value 0.98416 0.986771 0.8852
Prevalence 0.02690 0.015541 0.3270
Detection Rate 0.01136 0.002391 0.2510
Detection Prevalence 0.01913 0.005977 0.3389
Balanced Accuracy 0.70712 0.575102 0.8186
Boosting
A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a “rule of thumb” for that set of races (such as, “Bet on the horse that has recently won the most races” or “Bet on the horse with the most favored odds”). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert’s opinion on different collections of races, the gambler is able to extract many rules of thumb.
Boosting
In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler:
First, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful?
Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule?
Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above
fit <-train(province~., data = train, method="gbm",trControl=control,tuneGrid = gbm_grid,)
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0337
2 1.7734 nan 0.0100 0.0323
3 1.7555 nan 0.0100 0.0310
4 1.7384 nan 0.0100 0.0305
5 1.7219 nan 0.0100 0.0292
6 1.7059 nan 0.0100 0.0282
7 1.6905 nan 0.0100 0.0277
8 1.6753 nan 0.0100 0.0269
9 1.6607 nan 0.0100 0.0263
10 1.6464 nan 0.0100 0.0257
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0447
2 1.7665 nan 0.0100 0.0429
3 1.7424 nan 0.0100 0.0408
4 1.7192 nan 0.0100 0.0395
5 1.6968 nan 0.0100 0.0384
6 1.6752 nan 0.0100 0.0366
7 1.6545 nan 0.0100 0.0351
8 1.6348 nan 0.0100 0.0340
9 1.6153 nan 0.0100 0.0331
10 1.5965 nan 0.0100 0.0321
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3040
2 1.6166 nan 0.1000 0.2231
3 1.4855 nan 0.1000 0.1611
4 1.3902 nan 0.1000 0.1356
5 1.3107 nan 0.1000 0.1072
6 1.2487 nan 0.1000 0.0885
7 1.1952 nan 0.1000 0.0753
8 1.1495 nan 0.1000 0.0631
9 1.1116 nan 0.1000 0.0514
10 1.0806 nan 0.1000 0.0519
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3868
2 1.5589 nan 0.1000 0.2768
3 1.3945 nan 0.1000 0.2037
4 1.2728 nan 0.1000 0.1553
5 1.1791 nan 0.1000 0.1307
6 1.1005 nan 0.1000 0.1040
7 1.0380 nan 0.1000 0.0885
8 0.9844 nan 0.1000 0.0739
9 0.9390 nan 0.1000 0.0633
10 0.8998 nan 0.1000 0.0543
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0344
2 1.7731 nan 0.0100 0.0331
3 1.7549 nan 0.0100 0.0314
4 1.7377 nan 0.0100 0.0311
5 1.7210 nan 0.0100 0.0293
6 1.7051 nan 0.0100 0.0279
7 1.6896 nan 0.0100 0.0273
8 1.6747 nan 0.0100 0.0271
9 1.6596 nan 0.0100 0.0261
10 1.6452 nan 0.0100 0.0257
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0459
2 1.7663 nan 0.0100 0.0433
3 1.7420 nan 0.0100 0.0399
4 1.7193 nan 0.0100 0.0405
5 1.6966 nan 0.0100 0.0379
6 1.6750 nan 0.0100 0.0372
7 1.6542 nan 0.0100 0.0362
8 1.6338 nan 0.0100 0.0348
9 1.6142 nan 0.0100 0.0342
10 1.5952 nan 0.0100 0.0331
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3023
2 1.6162 nan 0.1000 0.2169
3 1.4892 nan 0.1000 0.1744
4 1.3879 nan 0.1000 0.1386
5 1.3074 nan 0.1000 0.1108
6 1.2418 nan 0.1000 0.0914
7 1.1891 nan 0.1000 0.0784
8 1.1436 nan 0.1000 0.0675
9 1.1039 nan 0.1000 0.0556
10 1.0704 nan 0.1000 0.0491
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3831
2 1.5624 nan 0.1000 0.2729
3 1.3957 nan 0.1000 0.2122
4 1.2695 nan 0.1000 0.1646
5 1.1713 nan 0.1000 0.1271
6 1.0942 nan 0.1000 0.1038
7 1.0317 nan 0.1000 0.0892
8 0.9761 nan 0.1000 0.0759
9 0.9298 nan 0.1000 0.0632
10 0.8906 nan 0.1000 0.0576
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0334
2 1.7731 nan 0.0100 0.0325
3 1.7552 nan 0.0100 0.0313
4 1.7381 nan 0.0100 0.0299
5 1.7215 nan 0.0100 0.0292
6 1.7050 nan 0.0100 0.0284
7 1.6895 nan 0.0100 0.0274
8 1.6743 nan 0.0100 0.0267
9 1.6596 nan 0.0100 0.0262
10 1.6453 nan 0.0100 0.0250
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0440
2 1.7670 nan 0.0100 0.0430
3 1.7429 nan 0.0100 0.0407
4 1.7200 nan 0.0100 0.0385
5 1.6979 nan 0.0100 0.0373
6 1.6768 nan 0.0100 0.0358
7 1.6572 nan 0.0100 0.0363
8 1.6368 nan 0.0100 0.0342
9 1.6172 nan 0.0100 0.0333
10 1.5983 nan 0.0100 0.0325
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3001
2 1.6164 nan 0.1000 0.2077
3 1.4951 nan 0.1000 0.1758
4 1.3931 nan 0.1000 0.1369
5 1.3132 nan 0.1000 0.1086
6 1.2499 nan 0.1000 0.0922
7 1.1959 nan 0.1000 0.0750
8 1.1516 nan 0.1000 0.0646
9 1.1129 nan 0.1000 0.0553
10 1.0807 nan 0.1000 0.0508
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3862
2 1.5593 nan 0.1000 0.2630
3 1.4028 nan 0.1000 0.2091
4 1.2780 nan 0.1000 0.1585
5 1.1818 nan 0.1000 0.1313
6 1.1027 nan 0.1000 0.1057
7 1.0379 nan 0.1000 0.0908
8 0.9831 nan 0.1000 0.0754
9 0.9374 nan 0.1000 0.0657
10 0.8982 nan 0.1000 0.0563
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0344
2 1.7735 nan 0.0100 0.0322
3 1.7557 nan 0.0100 0.0314
4 1.7384 nan 0.0100 0.0298
5 1.7220 nan 0.0100 0.0292
6 1.7059 nan 0.0100 0.0287
7 1.6901 nan 0.0100 0.0279
8 1.6748 nan 0.0100 0.0266
9 1.6602 nan 0.0100 0.0269
10 1.6456 nan 0.0100 0.0253
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0431
2 1.7675 nan 0.0100 0.0414
3 1.7444 nan 0.0100 0.0402
4 1.7216 nan 0.0100 0.0403
5 1.6991 nan 0.0100 0.0383
6 1.6775 nan 0.0100 0.0365
7 1.6571 nan 0.0100 0.0368
8 1.6366 nan 0.0100 0.0355
9 1.6166 nan 0.0100 0.0336
10 1.5978 nan 0.0100 0.0330
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.2969
2 1.6168 nan 0.1000 0.2240
3 1.4859 nan 0.1000 0.1598
4 1.3911 nan 0.1000 0.1300
5 1.3126 nan 0.1000 0.1076
6 1.2487 nan 0.1000 0.0973
7 1.1919 nan 0.1000 0.0805
8 1.1451 nan 0.1000 0.0672
9 1.1060 nan 0.1000 0.0567
10 1.0731 nan 0.1000 0.0478
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3829
2 1.5645 nan 0.1000 0.2694
3 1.4019 nan 0.1000 0.2116
4 1.2746 nan 0.1000 0.1632
5 1.1779 nan 0.1000 0.1329
6 1.0969 nan 0.1000 0.1032
7 1.0349 nan 0.1000 0.0856
8 0.9835 nan 0.1000 0.0755
9 0.9362 nan 0.1000 0.0686
10 0.8948 nan 0.1000 0.0546
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0345
2 1.7733 nan 0.0100 0.0321
3 1.7556 nan 0.0100 0.0312
4 1.7386 nan 0.0100 0.0304
5 1.7221 nan 0.0100 0.0287
6 1.7061 nan 0.0100 0.0285
7 1.6905 nan 0.0100 0.0271
8 1.6756 nan 0.0100 0.0265
9 1.6610 nan 0.0100 0.0255
10 1.6468 nan 0.0100 0.0257
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.0100 0.0448
2 1.7663 nan 0.0100 0.0420
3 1.7426 nan 0.0100 0.0414
4 1.7198 nan 0.0100 0.0392
5 1.6981 nan 0.0100 0.0374
6 1.6772 nan 0.0100 0.0369
7 1.6564 nan 0.0100 0.0348
8 1.6367 nan 0.0100 0.0330
9 1.6179 nan 0.0100 0.0335
10 1.5991 nan 0.0100 0.0327
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3022
2 1.6181 nan 0.1000 0.2193
3 1.4904 nan 0.1000 0.1651
4 1.3958 nan 0.1000 0.1349
5 1.3156 nan 0.1000 0.1121
6 1.2516 nan 0.1000 0.0950
7 1.1965 nan 0.1000 0.0721
8 1.1542 nan 0.1000 0.0686
9 1.1126 nan 0.1000 0.0580
10 1.0789 nan 0.1000 0.0502
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.4031
2 1.5548 nan 0.1000 0.2621
3 1.3979 nan 0.1000 0.2038
4 1.2766 nan 0.1000 0.1591
5 1.1817 nan 0.1000 0.1300
6 1.1024 nan 0.1000 0.1048
7 1.0390 nan 0.1000 0.0882
8 0.9850 nan 0.1000 0.0759
9 0.9389 nan 0.1000 0.0575
10 0.9021 nan 0.1000 0.0555
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.4019
2 1.5516 nan 0.1000 0.2671
3 1.3915 nan 0.1000 0.2040
4 1.2698 nan 0.1000 0.1624
5 1.1734 nan 0.1000 0.1273
6 1.0979 nan 0.1000 0.1092
7 1.0321 nan 0.1000 0.0875
8 0.9798 nan 0.1000 0.0749
9 0.9335 nan 0.1000 0.0660
10 0.8939 nan 0.1000 0.0519
Confusion Matrix
pred <-predict(fit, newdata=test)confusionMatrix(factor(pred),factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 147 4 0 2 3
California 26 681 23 19 21
Casablanca_Valley 0 1 3 0 0
Marlborough 1 2 0 6 0
New_York 0 0 0 0 0
Oregon 64 103 0 18 2
Reference
Prediction Oregon
Burgundy 14
California 114
Casablanca_Valley 0
Marlborough 2
New_York 0
Oregon 417
Overall Statistics
Accuracy : 0.7496
95% CI : (0.7281, 0.7702)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5944
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.61765 0.8609 0.115385
Specificity 0.98397 0.7698 0.999393
Pos Pred Value 0.86471 0.7704 0.750000
Neg Pred Value 0.93945 0.8606 0.986219
Prevalence 0.14226 0.4728 0.015541
Detection Rate 0.08787 0.4071 0.001793
Detection Prevalence 0.10161 0.5284 0.002391
Balanced Accuracy 0.80081 0.8154 0.557389
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.133333 0.00000 0.7623
Specificity 0.996929 1.00000 0.8339
Pos Pred Value 0.545455 NaN 0.6904
Neg Pred Value 0.976534 0.98446 0.8784
Prevalence 0.026898 0.01554 0.3270
Detection Rate 0.003586 0.00000 0.2493
Detection Prevalence 0.006575 0.00000 0.3610
Balanced Accuracy 0.565131 0.50000 0.7981
Random Forest
By far the most popular to my mind is random forest.
Bagged tree with random feature selection.
It just works.
fit <-train(province ~ .,data = train, trControl = control,method ="rf",maxit =5) # speed it up - default 100fit
Random Forest
6707 samples
29 predictor
6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5365, 5366, 5365, 5367, 5365
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8045275 0.6836790
15 0.7952852 0.6786598
29 0.7803768 0.6564252
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Confusion Matrix
pred <-predict(fit, newdata=test)confusionMatrix(factor(pred),factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 186 3 0 2 1
California 15 715 21 24 22
Casablanca_Valley 0 0 1 0 0
Marlborough 0 0 0 1 0
New_York 0 0 0 0 0
Oregon 37 73 4 18 3
Reference
Prediction Oregon
Burgundy 14
California 102
Casablanca_Valley 0
Marlborough 0
New_York 0
Oregon 431
Overall Statistics
Accuracy : 0.7974
95% CI : (0.7773, 0.8164)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.672
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.7815 0.9039 0.0384615
Specificity 0.9861 0.7914 1.0000000
Pos Pred Value 0.9029 0.7953 1.0000000
Neg Pred Value 0.9646 0.9018 0.9850478
Prevalence 0.1423 0.4728 0.0155409
Detection Rate 0.1112 0.4274 0.0005977
Detection Prevalence 0.1231 0.5374 0.0005977
Balanced Accuracy 0.8838 0.8477 0.5192308
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.0222222 0.00000 0.7879
Specificity 1.0000000 1.00000 0.8801
Pos Pred Value 1.0000000 NaN 0.7615
Neg Pred Value 0.9736842 0.98446 0.8952
Prevalence 0.0268978 0.01554 0.3270
Detection Rate 0.0005977 0.00000 0.2576
Detection Prevalence 0.0005977 0.00000 0.3383
Balanced Accuracy 0.5111111 0.50000 0.8340
Model Comparison
# "doParallel"cl <-makePSOCKcluster(3)registerDoParallel(cl)system.time({ tb_fit <-train(province~., data = train, method="treebag", trControl=control,maxit =5); bg_fit <-train(province~., data = train, method="gbm",trControl=control, tuneGrid = gbm_grid) rf_fit <-train(province~., data = train, method="rf", trControl=control,maxit =5);})
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.7918 nan 0.1000 0.3881
2 1.5580 nan 0.1000 0.2784
3 1.3949 nan 0.1000 0.2072
4 1.2726 nan 0.1000 0.1588
5 1.1777 nan 0.1000 0.1238
6 1.1022 nan 0.1000 0.1066
7 1.0374 nan 0.1000 0.0878
8 0.9832 nan 0.1000 0.0763
9 0.9376 nan 0.1000 0.0657
10 0.8978 nan 0.1000 0.0563