Goal

This is a generalized ML assignment

Predict the profit of future products developed at CravenSpeed
Using any model (or ensemble) you’d like
Evaluated using RMSE on holdout sample (you won’t see)

Important

Predict the profit of “future products developed at CravenSpeed”

Information leaks

Recall the following slides: Overfitting… or a leak?

Important

Predict the profit of “future products developed at CravenSpeed”

I find it unreasonable to conclude that any of the following could be known about a product before it is developed:
- These are indices 2:29,38,41

 [1] "First Order (from sales data)" "src_www"
 [3] "src_iphone"                    "src_android"
 [5] "src_ipad"                      "src_manual"
 [7] "src_facebookshop"              "src_external"
 [9] "src_Amazon FBM"                "January"
[11] "February"                      "March"
[13] "April"                         "May"
[15] "June"                          "July"
[17] "August"                        "September"
[19] "October"                       "November"
[21] "December"                      "pct_Direct Sales"
[23] "pct_Orders for Stock"          "pct_Drop Shipments"
[25] "pct_Platypus"                  "pct_Moss Motors Only"
[27] "pct_R&D Club"                  "Units Sold"
[29] "Revenue 2019 to present"       "Sales Channel"

That is, it is not meaningful to predict profit from metrics that can only be gathered if profit is known.
I find it reasonable to considering the following:

 [1] "lookupId"                         "Base Product Sku"
 [3] "Listing Type"                     "Unit Weight (pounds)"
 [5] "retailPrice"                      "make"
 [7] "model"                            "yearMin"
 [9] "yearMax"                          "Number of Components"
[11] "BOM Cost"                         "Product Type"
[13] "Designer"                         "Main Component Material"
[15] "Main Component MFG Process"       "Main Component Finishing Process"

For this reason, I have included only these fields in the secret hold-out data.
- Profit, precomputed, is also present (of course).

Assessment

Assessments will be evaluated via RMSE over the secret data as follows:

fast <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/refs/heads/master/dat/secret.rds")))

profit <- fast["Profit"]
fast <- fast |> select(-Profit) |> engineer() |> select(1:10)
fast["Profit"] = profit

train(Profit ~ .,
      data = fast, 
      method = "lm",
      trControl = trainControl(method = "cv", number = 5))$results$RMSE

The sample engineering function achieved an RMSE of 2074.117 by selecting the first 10 features and naively modeling.

Special Cases

In a few cases, students avoided information leaks as far as I can tell, but it was hard to be sure as code contained hard-coded indices and I may have made a mistake.
I allowed two plausible disagreements over what constituted an information leak, derived from sales data put plausibly known at development time:
- Seasonality, derived from sales data put plausible known at development time, and
- Sales strategy of direct vs. stock sales passed on percentages.
Ultimately, this led to too few eligible submissions to perserve anonymity and I have closed the rankings.

My Apologies

The degree of information leakage on the final project could arise from an error in my instruction.
I apologize.
Please review these slides and avoid information leakages in future work.
Overfitting… or a leak?