Setup
- You may use any libraries, but as a feature engineering assignment
tidyverse
/pandas
are likely sufficient.- The next most likely are dummy column and textual libraries.
- Pandas has a built-in
get_dummies
and the Pythonic text library is NLTK
.r
library(tidyverse)
.py
import numpy as np
import pandas as pd
Dataframe
- We use the
model
dataframe.
.r
<- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/model.rds"))) wine
.py
= pd.read_pickle("https://github.com/cd-public/D505/raw/master/dat/model.pickle") wine
Engineer Features
- This is the reference I use, for both languages actually.
- I am actually a contributor to this document - and so could you be!
.r
<- wine %>% mutate(points_per_price = points/price) wine
.py
= wine['points']/wine['price']) wine.assign(point_per_price
Save the dataframe
- In addition to a document like this, you will also need to submit your dataframe.
.rds
for R.pickle
for Python
- Specify if you optimized for \(K\)-NN or Naive Bayes
.r
write_rds(wine, file="group_n_knn.rds")
.py
"group_m_naive.pickle") wine.to_pickle(
Submission
- Reply to the email titled “Group \(n\)” with a link to a GitHub repository containing
- A
.rmd
or.qmd
file. - A
.rds
or.pickle
file.
- A
- You may update this submission as many times as you like until class starts on 10 Mar.
Assessment
.rds
assessments will be evaluated as follows:- With either
method = "knn"
ormethod = "naive_bayes"
- With either
<- readRDS("group_n_method.rds") # or url
wine <- createDataPartition(wine$province, p = 0.8, list = FALSE)
split <- wine[split, ]
train <- wine[-split, ]
test <- train(province ~ .,
fit data = train,
method = "knn",
tuneLength = 15,
metric = "Kappa",
trControl = trainControl(method = "cv", number = 5))
confusionMatrix(predict(fit, test),factor(test$province))$overall['Kappa']
.rds
submissions will be evaluated as follows:- With either
method = "knn"
ormethod = "naive_bayes"
- With either
.pickle
submissions will be evaluated as follows:- With either
KNeighborsClassifier
orGaussianNB
- With either
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier #, GaussianNB
from sklearn.metrics import cohen_kappa_score
= pd.read_pickle("group_m_method.pickle") # or url
wine = train_test_split(wine, test_size=0.2, stratify=wine['province'])
train, test
# Separate features and target variable
= train.drop(columns=['province']), test.drop(columns=['province'])
X_train, X_test = train['province'], test['province']
y_train, y_test
= KNeighborsClassifier() # or GaussianNB
knn
knn.fit(X_train, y_train)
= cohen_kappa_score(y_test, knn.predict(X_test)) kappa
FAQ
- For assignments of this type, I often field questions of form “I wasn’t sure what you wanted”.
- I respond as follows:
- I regard these instructions as unambigious.
- If ambiguities are uncovered, I will issue corrections without sharing a full example.
- I regard following these instructions absent e.g. an end-to-end example or lengthy prose as a component of the assignment.
- I anticipate that outside of this class you will not be provided with markedly less guidance than I provide here.
- I note that in this class you have been provided with the maximum possible guidance, including answer keys, on five homeworks.
- I fed this into LLMs and it only found ambiguities:
- Related to the
wine
dataframe and models being underspecified, which I consider addressed by prior coursework. - Related to the
.r
and.py
differences, which I regard as optional extensions.
- Related to the
- I regard these instructions as unambigious.