Abstract:
This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.
0. Quarto Type-setting
- This document is rendered with Quarto, and configured to embed an images using the
embed-resources
option in the header.
- If you wish to use a similar header, here’s is the format specification for this document:
format:
html:
embed-resources: true
1. Setup
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))
2. \(K\)NN Concepts
TODO: Explain how the choice of K affects the quality of your prediction when using a \(K\) Nearest Neighbors algorithm.
3. Feature Engineering
- Create a version of the year column that is a factor (instead of numeric).
- Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.
- Take care to handle upper and lower case characters.
- Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators.
- Remove the description column from the data.
4. Preprocessing
- Preprocess the dataframe from the previous code block using BoxCox, centering and scaling of the numeric features
- Create dummy variables for the
year
factor column
5. Running \(K\)NN
- Split the dataframe into an 80/20 training and test set
- Use Caret to run a \(K\)NN model that uses our engineered features to predict province
- use 5-fold cross validated subsampling
- allow Caret to try 15 different values for \(K\)
- Display the confusion matrix on the test data
6. Kappa
How do we determine whether a Kappa value represents a good, bad or some other outcome?
TODO: Explain
7. Improvement
How can we interpret the confusion matrix, and how can we improve in our predictions?
TODO: Explain