Setup

You may use any libraries, but as a feature engineering assignment tidyverse/pandas are likely sufficient.
- The next most likely are dummy column and textual libraries.
- Pandas has a built-in get_dummies and the Pythonic text library is NLTK

.r

library(tidyverse)

.py

import numpy as np
import pandas as pd

Dataframe

We use the model dataframe.

.r

wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/model.rds")))

.py

wine = pd.read_pickle("https://github.com/cd-public/D505/raw/master/dat/model.pickle")

Engineer Features

This is the reference I use, for both languages actually.
I am actually a contributor to this document - and so could you be!

.r

wine <- wine %>% mutate(points_per_price = points/price)

.py

wine.assign(point_per_price = wine['points']/wine['price'])

Save the dataframe

In addition to a document like this, you will also need to submit your dataframe.
- .rds for R
- .pickle for Python
Specify if you optimized for \(K\)-NN or Naive Bayes

.r

write_rds(wine, file="group_n_knn.rds")

.py

wine.to_pickle("group_m_naive.pickle")

Submission

Reply to the email titled “Group \(n\)” with a link to a GitHub repository containing
- A .rmd or .qmd file.
- A .rds or .pickle file.
You may update this submission as many times as you like until class starts on 10 Mar.

Assessment

.rds assessments will be evaluated as follows:
- With either method = "knn" or method = "naive_bayes"

wine <- readRDS("group_n_method.rds") # or url
split <- createDataPartition(wine$province, p = 0.8, list = FALSE)
train <- wine[split, ]
test <- wine[-split, ]
fit <- train(province ~ .,
             data = train, 
             method = "knn",
             tuneLength = 15,
             metric = "Kappa",
             trControl = trainControl(method = "cv", number = 5))
confusionMatrix(predict(fit, test),factor(test$province))$overall['Kappa']

.rds submissions will be evaluated as follows:
- With either method = "knn" or method = "naive_bayes"
.pickle submissions will be evaluated as follows:
- With either KNeighborsClassifier or GaussianNB

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier #, GaussianNB
from sklearn.metrics import cohen_kappa_score

wine = pd.read_pickle("group_m_method.pickle") # or url
train, test = train_test_split(wine, test_size=0.2, stratify=wine['province'])

# Separate features and target variable
X_train, X_test = train.drop(columns=['province']), test.drop(columns=['province'])
y_train, y_test = train['province'], test['province']

knn = KNeighborsClassifier() # or GaussianNB
knn.fit(X_train, y_train)

kappa = cohen_kappa_score(y_test, knn.predict(X_test))

FAQ

For assignments of this type, I often field questions of form “I wasn’t sure what you wanted”.
I respond as follows:
- I regard these instructions as unambigious.
  - If ambiguities are uncovered, I will issue corrections without sharing a full example.
- I regard following these instructions absent e.g. an end-to-end example or lengthy prose as a component of the assignment.
- I anticipate that outside of this class you will not be provided with markedly less guidance than I provide here.
- I note that in this class you have been provided with the maximum possible guidance, including answer keys, on five homeworks.
- I fed this into LLMs and it only found ambiguities:
  - Related to the wine dataframe and models being underspecified, which I consider addressed by prior coursework.
  - Related to the .r and .py differences, which I regard as optional extensions.