Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
If you wish to use a similar header, here’s is the format specification for this document:

format: 
  html:
    embed-resources: true

1. Setup

Step Up Code:

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. Conditional Probability

Calculate the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.

\[ P({\rm Burgundy}~|~{\rm Fruit}) \]

# TODO

3. Naive Bayes Algorithm

We train a naive bayes algorithm to classify a wine’s province using: 1. An 80-20 train-test split. 2. Three features engineered from the description 3. 5-fold cross validation.

We report Kappa after using the model to predict provinces in the holdout sample.

# TODO

4. Frequency Differences

We find the three words that most distinguish New York Pinots from all other Pinots.

# TODO

5. Extension

Either do this as a bonus problem, or delete this section.

Calculate the variance of the logged word-frequency distributions for each province.