Goal
This is a generalized ML assignment
- Predict the profit of future products developed at CravenSpeed
- Using any model (or ensemble) you’d like
- Evaluated using RMSE on holdout sample (you won’t see)
Important
Predict the profit of “future products developed at CravenSpeed”
Information leaks
- Recall the following slides: Overfitting… or a leak?
Important
Predict the profit of “future products developed at CravenSpeed”
- I find it unreasonable to conclude that any of the following could be known about a product before it is developed:
- These are indices
2:29,38,41
- These are indices
1] "First Order (from sales data)" "src_www"
[3] "src_iphone" "src_android"
[5] "src_ipad" "src_manual"
[7] "src_facebookshop" "src_external"
[9] "src_Amazon FBM" "January"
[11] "February" "March"
[13] "April" "May"
[15] "June" "July"
[17] "August" "September"
[19] "October" "November"
[21] "December" "pct_Direct Sales"
[23] "pct_Orders for Stock" "pct_Drop Shipments"
[25] "pct_Platypus" "pct_Moss Motors Only"
[27] "pct_R&D Club" "Units Sold"
[29] "Revenue 2019 to present" "Sales Channel" [
- That is, it is not meaningful to predict profit from metrics that can only be gathered if profit is known.
- I find it reasonable to considering the following:
1] "lookupId" "Base Product Sku"
[3] "Listing Type" "Unit Weight (pounds)"
[5] "retailPrice" "make"
[7] "model" "yearMin"
[9] "yearMax" "Number of Components"
[11] "BOM Cost" "Product Type"
[13] "Designer" "Main Component Material"
[15] "Main Component MFG Process" "Main Component Finishing Process" [
- For this reason, I have included only these fields in the
secret
hold-out data.- Profit, precomputed, is also present (of course).
Assessment
- Assessments will be evaluated via RMSE over the secret data as follows:
<- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/refs/heads/master/dat/secret.rds")))
fast
<- fast["Profit"]
profit <- fast |> select(-Profit) |> engineer() |> select(1:10)
fast "Profit"] = profit
fast[
train(Profit ~ .,
data = fast,
method = "lm",
trControl = trainControl(method = "cv", number = 5))$results$RMSE
- The sample engineering function achieved an RMSE of
2074.117
by selecting the first 10 features and naively modeling.
Special Cases
- In a few cases, students avoided information leaks as far as I can tell, but it was hard to be sure as code contained hard-coded indices and I may have made a mistake.
- I allowed two plausible disagreements over what constituted an information leak, derived from sales data put plausibly known at development time:
- Seasonality, derived from sales data put plausible known at development time, and
- Sales strategy of direct vs. stock sales passed on percentages.
- Ultimately, this led to too few eligible submissions to perserve anonymity and I have closed the rankings.
My Apologies
- The degree of information leakage on the final project could arise from an error in my instruction.
- I apologize.
- Please review these slides and avoid information leakages in future work.
- Overfitting… or a leak?