\(K\)NN

Author

Your name here!

Published

February 10, 2025

Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

  • This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
  • If you wish to use a similar header, here’s is the format specification for this document:

1. Setup

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. \(K\)NN Concepts

TODO: Explain how the choice of K affects the quality of your prediction when using a \(K\) Nearest Neighbors algorithm.

3. Feature Engineering

  1. Create a version of the year column that is a factor (instead of numeric).
  2. Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.
  • Take care to handle upper and lower case characters.
  1. Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators.
  2. Remove the description column from the data.
# your code here

4. Preprocessing

  1. Preprocess the dataframe from the previous code block using BoxCox, centering and scaling of the numeric features
  2. Create dummy variables for the year factor column
# your code here

5. Running \(K\)NN

  1. Split the dataframe into an 80/20 training and test set
  2. Use Caret to run a \(K\)NN model that uses our engineered features to predict province
  • use 5-fold cross validated subsampling
  • allow Caret to try 15 different values for \(K\)
  1. Display the confusion matrix on the test data

6. Kappa

How do we determine whether a Kappa value represents a good, bad or some other outcome?

TODO: Explain

7. Improvement

How can we interpret the confusion matrix, and how can we improve in our predictions?

TODO: Explain