Applied Machine Learning
local({r <- getOption("repos")
r["CRAN"] <- "https://cran.r-project.org"
options(repos=r)
})
# Old
# install.packages("tidyverse")
# install.packages("caret")
# New?
install.packages("fastDummies")
# Just for the slides
# install.packages("thematic")
wine
dataset that is enormous.
wine <- readRDS(gzcon(url("https://cd-public.github.io/D505/dat/wine.rds")))
# performance concession
# wall <- wine
# wine = wall[sample(nrow(wall), 100), ]
summary(wine)
id country description designation
Min. : 1 Length:89556 Length:89556 Length:89556
1st Qu.: 32742 Class :character Class :character Class :character
Median : 65613 Mode :character Mode :character Mode :character
Mean : 65192
3rd Qu.: 97738
Max. :129970
points price province region_1
Min. : 80.00 Min. : 4.00 Length:89556 Length:89556
1st Qu.: 87.00 1st Qu.: 17.00 Class :character Class :character
Median : 89.00 Median : 25.00 Mode :character Mode :character
Mean : 88.65 Mean : 35.56
3rd Qu.: 91.00 3rd Qu.: 42.00
Max. :100.00 Max. :3300.00
region_2 taster_name taster_twitter_handle title
Length:89556 Length:89556 Length:89556 Length:89556
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
variety winery year
Length:89556 Length:89556 Min. :1995
Class :character Class :character 1st Qu.:2010
Mode :character Mode :character Median :2012
Mean :2011
3rd Qu.:2014
Max. :2015
wine %>%
group_by(winery, year) %>%
summarize(avg_score = mean(points), num_reviews = n_distinct(id)) %>%
select(year, winery, num_reviews, avg_score) %>%
arrange(winery, year) %>%
mutate(score_change = avg_score - lag(avg_score)) %>%
drop_na(score_change) %>%
summarize(mean(score_change))
# A tibble: 8,409 × 2
winery `mean(score_change)`
<chr> <dbl>
1 100 Percent Wine -1.5
2 12 Linajes -0.333
3 12C Wines 1.25
4 14 Hands -0.111
5 2 Lads -2.25
6 2 Up 0
7 21 Grams 1
8 29 & Oak Wines -2
9 2Hawk 0.75
10 2Plank -1.25
# ℹ 8,399 more rows
dummy_cols()
adds dummy columns efficiently.# A tibble: 6 × 4
taster_name `taster_name_Alexander Peartree` taster_name_Anna Lee C. …¹
<chr> <int> <int>
1 Roger Voss 0 0
2 Paul Gregutt 0 0
3 Alexander Peartree 1 0
4 Paul Gregutt 0 0
5 Michael Schachner 0 0
6 Kerin O’Keefe 0 0
# ℹ abbreviated name: ¹`taster_name_Anna Lee C. Iijima`
# ℹ 1 more variable: `taster_name_Anne Krebiehl MW` <int>
# A tibble: 6 × 6
variety variety_Cabernet Sauvigno…¹ variety_Chardonnay `variety_Pinot Noir`
<fct> <int> <int> <int>
1 Other 0 0 0
2 Other 0 0 0
3 Other 0 0 0
4 Pinot Noir 0 0 1
5 Other 0 0 0
6 Other 0 0 0
# ℹ abbreviated name: ¹`variety_Cabernet Sauvignon`
# ℹ 2 more variables: `variety_Red Blend` <int>, variety_Other <int>
list(normalized = ~(scale(.) %>% as.vector))
: :
scale(.)
: Standardizes the “points” column.%>% as.vector
: Converts back to a vector.# A tibble: 89,556 × 16
id country description designation points price province region_1 region_2
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1 Portug… This is ri… Avidagos 87 15 Douro <NA> <NA>
2 2 US Tart and s… <NA> 87 14 Oregon Willame… Willame…
3 3 US Pineapple … Reserve La… 87 13 Michigan Lake Mi… <NA>
4 4 US Much like … Vintner's … 87 65 Oregon Willame… Willame…
5 5 Spain Blackberry… Ars In Vit… 87 15 Norther… Navarra <NA>
6 6 Italy Here's a b… Belsito 87 16 Sicily … Vittoria <NA>
7 7 France This dry a… <NA> 87 24 Alsace Alsace <NA>
8 8 Germany Savory dri… Shine 87 12 Rheinhe… <NA> <NA>
9 9 France This has g… Les Natures 87 27 Alsace Alsace <NA>
10 10 US Soft, supp… Mountain C… 87 19 Califor… Napa Va… Napa
# ℹ 89,546 more rows
# ℹ 7 more variables: taster_name <chr>, taster_twitter_handle <chr>,
# title <chr>, variety <chr>, winery <chr>, year <dbl>, standardized <dbl>
This chapter has a good overview of interactions.
wine_index <- createDataPartition(wino$lprice, p = 0.8, list = FALSE)
wino_tr <- wino[wine_index, ]
wino_te <- wino[-wine_index, ]
summary(wino_tr)
lprice points fr cab
Min. :1.386 Min. : 80.00 Mode :logical Mode :logical
1st Qu.:2.833 1st Qu.: 87.00 FALSE:59490 FALSE:65223
Median :3.219 Median : 89.00 TRUE :12113 TRUE :6380
Mean :3.315 Mean : 88.64
3rd Qu.:3.738 3rd Qu.: 91.00
Max. :8.102 Max. :100.00
train
to cross validatem1 <- train(lprice ~ .,
data = wino_tr,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3)
)
m1
Linear Regression
71603 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 57283, 57281, 57283, 57283, 57282, 57282, ...
Resampling results:
RMSE Rsquared MAE
0.5112143 0.3949745 0.4028797
Tuning parameter 'intercept' was held constant at a value of TRUE
RMSE Rsquared MAE Resample
1 0.5131905 0.4031985 0.4044921 Fold1.Rep1
2 0.5067739 0.4013975 0.3999158 Fold2.Rep1
3 0.5127722 0.3842549 0.4043832 Fold3.Rep1
4 0.5127571 0.3893579 0.4043396 Fold4.Rep1
5 0.5106289 0.3967107 0.4012980 Fold5.Rep1
6 0.5083489 0.3990065 0.4015508 Fold1.Rep2
7 0.5125498 0.3962390 0.4035383 Fold2.Rep2
8 0.5096002 0.3950261 0.4012386 Fold3.Rep2
9 0.5163545 0.3829817 0.4054555 Fold4.Rep2
10 0.5091791 0.4015038 0.4025819 Fold5.Rep2
11 0.5156619 0.3931615 0.4061765 Fold1.Rep3
12 0.5117820 0.3997736 0.4022512 Fold2.Rep3
13 0.5103424 0.3979847 0.4016429 Fold3.Rep3
14 0.5110664 0.4012680 0.4024770 Fold4.Rep3
15 0.5072064 0.3827522 0.4018533 Fold5.Rep3
Linear Regression
71603 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 57283, 57281, 57283, 57283, 57282, 57282, ...
Resampling results:
RMSE Rsquared MAE
0.5112143 0.3949745 0.4028797
Tuning parameter 'intercept' was held constant at a value of TRUE
Harrell (2015) provides a comprehensive indictment of the method that can be encapsulated by the statement:
“… if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing.”
Reference: Harrell, F. 2015. Regression Modeling Strategies. Springer.
wino <- wine %>%
mutate(country = fct_lump(country, 4)) %>% # 1:4,
mutate(variety = fct_lump(variety, 4)) %>% # 5:8,
mutate(lprice = log(price)) %>% # 9
select(lprice, points, country, variety) %>%
drop_na(.)
head(wino)
# A tibble: 6 × 4
lprice points country variety
<dbl> <dbl> <fct> <fct>
1 2.71 87 Other Other
2 2.64 87 US Other
3 2.56 87 US Other
4 4.17 87 US Pinot Noir
5 2.71 87 Spain Other
6 2.77 87 Italy Other
wino
!renamer <- function(s) {
s %>% tolower() %>% str_replace("-| ", "_")
}
wino <- wino %>%
dummy_cols(remove_selected_columns = TRUE) %>%
rename_with(.fn = renamer) %>%
select(-ends_with("other"))
head(wino)
# A tibble: 6 × 10
lprice points country_france country_italy country_spain country_us
<dbl> <dbl> <int> <int> <int> <int>
1 2.71 87 0 0 0 0
2 2.64 87 0 0 0 1
3 2.56 87 0 0 0 1
4 4.17 87 0 0 0 1
5 2.71 87 0 0 1 0
6 2.77 87 0 1 0 0
# ℹ 4 more variables: variety_cabernet_sauvignon <int>,
# variety_chardonnay <int>, variety_pinot_noir <int>, variety_red_blend <int>
mx <- train(lprice ~ .,
data = wino_tr,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3)
)
Linear Regression
71603 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 57282, 57283, 57284, 57281, 57282, 57284, ...
Resampling results:
RMSE Rsquared MAE
0.4882111 0.4479155 0.3797278
Tuning parameter 'intercept' was held constant at a value of TRUE
r
could handle 90k wine samples.control <- rfeControl(functions = rfFuncs, method = "cv", number = 2)
# run the RFE algorithm
results <- rfe(select(wino_tr, -lprice), wino_tr$lprice, sizes = c(1:3), rfeControl = control)
# summarize the results
print(results)
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 0.5004 0.4298 0.3921 0.01955 0.02585 0.010454
2 0.5167 0.4196 0.4099 0.02148 0.04639 0.016532
3 0.5217 0.4423 0.4143 0.01124 0.02951 0.006746
9 0.4800 0.4793 0.3738 0.01580 0.01576 0.010595 *
The top 5 variables (out of 9):
points, country_us, country_italy, variety_cabernet_sauvignon, variety_pinot_noir
Linear regressions have a well-developed statistical theory.
This brings perks like confidence intervals on predictions.
It also has “costs” in that assumptions need to be satisfied.
The dependent variable is a linear combination of the features.
This is less of big deal than it might seem!
If y is actually quadratic in x, then y is linear in x^2!
Or homoscedasticity
The variance of the errors do not depend on the values of the features.
Don’t make bigger prediction errors for some values of x than for others.
The errors should be independent and normally distributed.
A scatter plot of target variable value and residual (model error) should look like white noise.
None predictors should be a perfect linear combination of others.
This can happen if you over-engineer features
Model errors should be independent of the values of the features.
In particular, errors should have mean zero.
It’s always good to look at a histogram of your residuals (see also normality).