Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

Setup

Change the author of this RMD file to be yourself and delete this line.
Modify if necessary the below code so that you can successfully load wine.rds then delete this line.
In the space provided after the R chunk, explain what thecode is doing (line by line) then delete this line.
Get your GitHub Pages ready.

Step Up Code:

library(tidyverse) # change r to {r} to run this block, then remove this comment

wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/wine.rds"))) %>%
  filter(province=="Oregon" | province=="California" | province=="New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  mutate(lprice=log(price)) %>% 
  select(lprice, points, cherry, province)

Explanataion:

TODO: write your line-by-line explanation of the code here

Multiple Regression

Linear Models

First run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables).

# TODO: hint: m1 <- lm(lprice ~ points + cherry)

Explanataion:

TODO: write your line-by-line explanation of the code here

TODO: report and explain the RMSE

Interaction Models

Add an interaction between ‘points’ and ‘cherry’.

# TODO: hint: Check the slides.

TODO: write your line-by-line explanation of the code here

TODO: report and explain the RMSE

The Interaction Variable

TODO: interpret the coefficient on the interaction variable.
Explain as you would to a non-technical manager.

Applications

Determine which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most?

# TODO:

TODO: write your line-by-line explanation of the code here, and explain your answer.

Scenarios

On Accuracy

Imagine a model to distinguish New York wines from those in California and Oregon. After a few days of work, you take some measurements and note: “I’ve achieved 91% accuracy on my model!”

Should you be impressed? Why or why not?

# TODO: Use simple descriptive statistics from the data to justify your answer.

TODO: describe your reasoning here

On Ethics

Why is understanding this vignette important to use machine learning in an ethical manner?

TODO: describe your reasoning here

Ignorance is no excuse

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the changing federal policy under new presidential administrations. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

TODO: describe your reasoning here