Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
If you wish to use a similar header, here’s is the format specification for this document:

format: 
  html:
    embed-resources: true

1. Setup

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. \(K\)NN Concepts

TODO: Explain how the choice of K affects the quality of your prediction when using a \(K\) Nearest Neighbors algorithm.

3. Feature Engineering

Create a version of the year column that is a factor (instead of numeric).
Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.

Take care to handle upper and lower case characters.

Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators.
Remove the description column from the data.

# your code here

4. Preprocessing

Preprocess the dataframe from the previous code block using BoxCox, centering and scaling of the numeric features
Create dummy variables for the year factor column

# your code here

5. Running \(K\)NN

Split the dataframe into an 80/20 training and test set
Use Caret to run a \(K\)NN model that uses our engineered features to predict province

use 5-fold cross validated subsampling
allow Caret to try 15 different values for \(K\)

Display the confusion matrix on the test data

6. Kappa

How do we determine whether a Kappa value represents a good, bad or some other outcome?

TODO: Explain

7. Improvement

How can we interpret the confusion matrix, and how can we improve in our predictions?

TODO: Explain