\(K\)NN

Author

Calvin Deutschbein

Published

February 10, 2025

Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

  • This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
  • If you wish to use a similar header, here’s is the format specification for this document:

1. Setup

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))
sh(library(fastDummies))
sh(library(class))
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. \(K\)NN Concepts

Generally we regard selection of appropriate \(K\) as similar to the context of precision and recall, discussed in earlier post. However with \(K\)NN we have a slightly different trade-off - between bias, where the model is overgeneralized and begins to lose nuance in niche cases, and variance, where a model captures noise within the data set and extrapolates it to the population. Small \(K\) values tend to high variance, and larger \(K\) values tend to high bias.

3. Feature Engineering

  1. Create a version of the year column that is a factor (instead of numeric).
  2. Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.
  • Take care to handle upper and lower case characters.
  1. Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators.
  2. Remove the description column from the data.
wine <- wine %>%
  mutate(fct_year = factor(year)) %>%
  mutate(description = tolower(description)) %>%
  mutate(cherry = str_detect(description, "cherry"),
         chocolate = str_detect(description, "chocolate"),
         earth = str_detect(description, "earth")) %>%
  mutate(cherry_year = year*cherry,
         chocolate_year = year*chocolate,
         earth_year = year*earth) %>%
  select(-description)

4. Preprocessing

  1. Preprocess the dataframe from the previous code block using BoxCox, centering and scaling of the numeric features
  2. Create dummy variables for the year factor column
wine <- wine %>%
  preProcess(method = c("BoxCox", "center", "scale")) %>%
  predict(wine) %>%
  dummy_cols(select_columns = "fct_year",
             remove_most_frequent_dummy = TRUE,
             remove_selected_columns = TRUE)

5. Running \(K\)NN

  1. Split the dataframe into an 80/20 training and test set
  2. Use Caret to run a \(K\)NN model that uses our engineered features to predict province
  • use 5-fold cross validated subsampling
  • allow Caret to try 15 different values for \(K\)
  1. Display the confusion matrix on the test data
split <- createDataPartition(wine$province, p = 0.8, list = FALSE)
train <- wine[split, ]
test <- wine[-split, ]
fit <- train(province ~ .,
             data = train, 
             method = "knn",
             tuneLength = 15,
             metric = "Kappa",
             trControl = trainControl(method = "cv", number = 5))
confusionMatrix(predict(fit, test),factor(test$province))
Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               104         23                 2           3        0
  California              66        674                10          15       14
  Casablanca_Valley        0          0                 0           0        0
  Marlborough              1          0                 0           0        0
  New_York                 1          1                 2           3        0
  Oregon                  66         93                12          24       12
                   Reference
Prediction          Oregon
  Burgundy              31
  California           254
  Casablanca_Valley      0
  Marlborough            0
  New_York               1
  Oregon               261

Overall Statistics
                                          
               Accuracy : 0.621           
                 95% CI : (0.5973, 0.6444)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3712          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.43697            0.8521                  0.00000
Specificity                  0.95889            0.5930                  1.00000
Pos Pred Value               0.63804            0.6525                      NaN
Neg Pred Value               0.91126            0.8172                  0.98446
Prevalence                   0.14226            0.4728                  0.01554
Detection Rate               0.06216            0.4029                  0.00000
Detection Prevalence         0.09743            0.6175                  0.00000
Balanced Accuracy            0.69793            0.7225                  0.50000
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                   0.0000000        0.000000        0.4771
Specificity                   0.9993857        0.995143        0.8162
Pos Pred Value                0.0000000        0.000000        0.5577
Neg Pred Value                0.9730861        0.984384        0.7627
Prevalence                    0.0268978        0.015541        0.3270
Detection Rate                0.0000000        0.000000        0.1560
Detection Prevalence          0.0005977        0.004782        0.2797
Balanced Accuracy             0.4996929        0.497571        0.6467

6. Kappa

How do we determine whether a Kappa value represents a good, bad or some other outcome?

In my training, I was taught regard Kappa values as within five “bins”, ranging from “not good” to “suspiciously good”:

  • [0.0,0.2): Unusable
  • [0.2,0.4): Bad
  • [0.4,0.6): Okay
  • [0.6,0.8): Excellent
  • [0.8,1.0): Suspicious, likely overfit.

7. Improvement

How can we interpret the confusion matrix, and how can we improve in our predictions?

For me, confusion between specifical Californian and Oregonian wins both jumps out numerical and is consistent with my own understand of the world - Both California and Oregon share a border on the Pacific coast of the United States, and are likely planting in similar volcanic soil in the temperate climate zones. They likely even experience similar rainfall! To differentiate specifically these two easily confusable wins, I think I should look into dedicated features that specifical capture the essense of the difference between California and Oregonian wins. Secondly, I notes that almost no wins are predicted to be in Marlborough or Casablanca - which isn’t too surprising with a \(K\) getting pretty close to the number of wines from those regions as a whole! I would need either more data or more advanced numerical techniques to differentiate wines within in regions from the overwhelming popular California and Oregon wines.