Temporal Fusion Transformers (TFT)

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecastin 1912.09363.pdf (arxiv.org)

the main thing is that this model tries to predict for more than one time horizon
- rather than predicting only the next timestep, it predicts further ahead
it uses GRN layers (Gated Residual network) to perform nonlinear operations
- it’s just a small layer that is used everywhere to sprinkle nonlinearity
it does selection of features to use automatically via the Variable selection networks
- it removes unimportant / noisy features so the model can predict easier
- Here’s how the Variable Selection Network / Layer works:
  - all of the features go into individual GRN layers (left side)
  - but all of the features are ALSO concatenated into one giant feature vector and go into another GRN (the right one, with the softmax)
  - we finally multiply the output of the softmax by the feature outputs of the GRN
    - so the GRN of the concatenated features selects which features to propagate. (kinda like the sigmoid in an LSTM)
after the features have been selected, each of these outputs are fed into a LSTM layer so the model can pick up time-dependent features
static covariate encoder
- As an example, taking ζ to be the output of the static variable selection network, contexts for temporal variable selection would be encoded according to $c_{s}$ = $GR N_{c_{s}}$ (ζ).
The attention mechanism is used to “learn long-term relationships across different time steps”

When designing this model, they placed the attention mechanism later in the network. The first “preprocessed” information with many different layers

🏖️ Kaggle Solutions