Link: https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting

Problem Type: Time Series

Input: Game statistics for the current day

Output: a prediction of how engaged fans are with a specific player after a game

  • there are 4 target columns, each representing the engagement level fans have with that player
  • we are predicting the engagement levels 1 day after the game (i.e. 1 day after the x_train data)
    • however, during the evaluation period, “The test data arrives in a data frame identical in format to train.csv, except it does not contain the target values.”
    • so even though you got those engagement levels, you can’t use today’s engagement level to predict tomorrow’s engagement level. Since you don’t know what today’s engagement level is!

Eval Metric: mean column-wise mean absolute error (MCMAE)

Summary

  • Note:
    • Binary columns will have null values as well as zeroes. Zeroes will occur if a player had an opportunity to do something, but did not. Nulls will occur if a player never had the opportunity to do something
      • e.g. a player who does not pitch on a given day cannot possibly pitch a shutout
      • how did they solve this?
  • Data issues:

Important notebooks/discussions

Solutions

Takeaways

  • Use an is present bit if you expect to only use your target variable for a period of time
  • You need a lot of feature engineering with time series. Don’t be afraid to make 1000 features!
    • but dataset length probably matters, cause this dataset was kinda big (8Gb)
  • Lagged Features were very important (e.g. num pitches today - num pitches 30 days ago)