Setup
- You may use any libraries, but as a feature engineering assignment
tidyverse/pandasare likely sufficient.- The next most likely are dummy column and textual libraries.
- Pandas has a built-in
get_dummiesand the Pythonic text library is NLTK
.r
library(tidyverse).py
import numpy as np
import pandas as pdDataframe
- We use the
modeldataframe.
.r
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/model.rds"))).py
wine = pd.read_pickle("https://github.com/cd-public/D505/raw/master/dat/model.pickle")Engineer Features
- This is the reference I use, for both languages actually.
- I am actually a contributor to this document - and so could you be!
.r
wine <- wine %>% mutate(points_per_price = points/price).py
wine.assign(point_per_price = wine['points']/wine['price'])Save the dataframe
- In addition to a document like this, you will also need to submit your dataframe.
.rdsfor R.picklefor Python
- Specify if you optimized for \(K\)-NN or Naive Bayes
.r
write_rds(wine, file="group_n_knn.rds").py
wine.to_pickle("group_m_naive.pickle")Submission
- Reply to the email titled “Group \(n\)” with a link to a GitHub repository containing
- A
.rmdor.qmdfile. - A
.rdsor.picklefile.
- A
- You may update this submission as many times as you like until class starts on 10 Mar.
Assessment
.rdsassessments will be evaluated as follows:- With either
method = "knn"ormethod = "naive_bayes"
- With either
wine <- readRDS("group_n_method.rds") # or url
split <- createDataPartition(wine$province, p = 0.8, list = FALSE)
train <- wine[split, ]
test <- wine[-split, ]
fit <- train(province ~ .,
data = train,
method = "knn",
tuneLength = 15,
metric = "Kappa",
trControl = trainControl(method = "cv", number = 5))
confusionMatrix(predict(fit, test),factor(test$province))$overall['Kappa'].rdssubmissions will be evaluated as follows:- With either
method = "knn"ormethod = "naive_bayes"
- With either
.picklesubmissions will be evaluated as follows:- With either
KNeighborsClassifierorGaussianNB
- With either
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier #, GaussianNB
from sklearn.metrics import cohen_kappa_score
wine = pd.read_pickle("group_m_method.pickle") # or url
train, test = train_test_split(wine, test_size=0.2, stratify=wine['province'])
# Separate features and target variable
X_train, X_test = train.drop(columns=['province']), test.drop(columns=['province'])
y_train, y_test = train['province'], test['province']
knn = KNeighborsClassifier() # or GaussianNB
knn.fit(X_train, y_train)
kappa = cohen_kappa_score(y_test, knn.predict(X_test))FAQ
- For assignments of this type, I often field questions of form “I wasn’t sure what you wanted”.
- I respond as follows:
- I regard these instructions as unambigious.
- If ambiguities are uncovered, I will issue corrections without sharing a full example.
- I regard following these instructions absent e.g. an end-to-end example or lengthy prose as a component of the assignment.
- I anticipate that outside of this class you will not be provided with markedly less guidance than I provide here.
- I note that in this class you have been provided with the maximum possible guidance, including answer keys, on five homeworks.
- I fed this into LLMs and it only found ambiguities:
- Related to the
winedataframe and models being underspecified, which I consider addressed by prior coursework. - Related to the
.rand.pydifferences, which I regard as optional extensions.
- Related to the
- I regard these instructions as unambigious.