Model 1

Author

Team \(i\)

Published

March 10, 2025

Setup

  • You may use any libraries, but as a feature engineering assignment tidyverse/pandas are likely sufficient.
    • The next most likely are dummy column and textual libraries.
    • Pandas has a built-in get_dummies and the Pythonic text library is NLTK

.r

library(tidyverse)

.py

import numpy as np
import pandas as pd

Dataframe

  • We use the model dataframe.

.r

wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/model.rds")))

.py

wine = pd.read_pickle("https://github.com/cd-public/D505/raw/master/dat/model.pickle")

Engineer Features

  • This is the reference I use, for both languages actually.
  • I am actually a contributor to this document - and so could you be!

.r

wine <- wine %>% mutate(points_per_price = points/price)

.py

wine.assign(point_per_price = wine['points']/wine['price'])

Save the dataframe

  • In addition to a document like this, you will also need to submit your dataframe.
    • .rds for R
    • .pickle for Python
  • Specify if you optimized for \(K\)-NN or Naive Bayes

.r

write_rds(wine, file="group_n_knn.rds")

.py

wine.to_pickle("group_m_naive.pickle")

Submission

  • Reply to the email titled “Group \(n\)” with a link to a GitHub repository containing
    • A .rmd or .qmd file.
    • A .rds or .pickle file.
  • You may update this submission as many times as you like until class starts on 10 Mar.

Assessment

  • .rds assessments will be evaluated as follows:
    • With either method = "knn" or method = "naive_bayes"
wine <- readRDS("group_n_method.rds") # or url
split <- createDataPartition(wine$province, p = 0.8, list = FALSE)
train <- wine[split, ]
test <- wine[-split, ]
fit <- train(province ~ .,
             data = train, 
             method = "knn",
             tuneLength = 15,
             metric = "Kappa",
             trControl = trainControl(method = "cv", number = 5))
confusionMatrix(predict(fit, test),factor(test$province))$overall['Kappa']
  • .rds submissions will be evaluated as follows:
    • With either method = "knn" or method = "naive_bayes"
  • .pickle submissions will be evaluated as follows:
    • With either KNeighborsClassifier or GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier #, GaussianNB
from sklearn.metrics import cohen_kappa_score

wine = pd.read_pickle("group_m_method.pickle") # or url
train, test = train_test_split(wine, test_size=0.2, stratify=wine['province'])

# Separate features and target variable
X_train, X_test = train.drop(columns=['province']), test.drop(columns=['province'])
y_train, y_test = train['province'], test['province']

knn = KNeighborsClassifier() # or GaussianNB
knn.fit(X_train, y_train)

kappa = cohen_kappa_score(y_test, knn.predict(X_test))

FAQ

  • For assignments of this type, I often field questions of form “I wasn’t sure what you wanted”.
  • I respond as follows:
    • I regard these instructions as unambigious.
      • If ambiguities are uncovered, I will issue corrections without sharing a full example.
    • I regard following these instructions absent e.g. an end-to-end example or lengthy prose as a component of the assignment.
    • I anticipate that outside of this class you will not be provided with markedly less guidance than I provide here.
    • I note that in this class you have been provided with the maximum possible guidance, including answer keys, on five homeworks.
    • I fed this into LLMs and it only found ambiguities:
      • Related to the wine dataframe and models being underspecified, which I consider addressed by prior coursework.
      • Related to the .r and .py differences, which I regard as optional extensions.