- Forecasts for multiple time steps is hard
- Many methods lack explainability
- Complex interaction of known and unknown inputs
- Traditional explainable methods often do not work with time-order
- Forecasts are very important (\($\))
15 November 2020
A data set has \(I\) unique entities (e.g. stores in a chain) and \(t \in [0,T_i]\) timesteps
\[\hat{y}_i(q,t,\tau) = f_q\left(\tau, y_{i, t-k:t}, \mathbb{z}_{i,t-k:t}, \mathbf{x}_{i,t-k:(t+\tau)},\mathbf{s}_i\right)\]
where
Four contexts are produced they feed into
Gating mechanisms for adaptive depth/complexity
Without context, \(c=0\) Unit input and output have same dimension \(d_{\mbox{model}}\) which is shared across whole TFT
Variable selection networks
Local patterns can be trained by a sequence-to-sequence encoder
Attention introduces additional dimensions \(d_{\mbox{attn}}\) and \(d_{\mbox{value}}\)
\[\mbox{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = A\left(\mathbf{Q},\mathbf{K}\right)\mathbf{V}\]
Multi-head attention is less interpretable, so use shared value matrix and average attention
\[\tilde{H} = \left(\frac{1}{m_H} \sum_{1}^{m_H} A\left(\mathbf{Q},\mathbf{K}\right)\right) \mathbf{V}\mathbf{W}_{V}\]
Last step is position-wise feed-forward with shared weights
\[L\left(\Omega, \mathbf{W}\right) = \sum_{y_t \in \Omega} \sum_{q \in \mathcal{Q}}\sum_{\tau=1}^{\tau_{max}} \frac{QL\left(y_t, \hat{y}(q,t-\tau, \tau),q\right)}{M\tau_{max}}\]
\[QL(y,\hat{y},q) = q\left(y-\hat{y}\right)_++(1-q)\left(\hat{y}-y\right)_+\]
for \(M\) samples
Individual Variable Importance
Persistent Temporal Patterns
Identifying interesting regime changes
Extracted from the Variable Selection step by sampling \(\nu_i\) and recording quantiles
Patterns are established by sampling and recording the quantiles of attention layer
1 Step ahead and multi-horizon forecast
Calculate the average attention per entity position and forecast horizon \[\bar{\alpha}(n,\tau) = \frac{1}{T}\sum_{t=1}^T \alpha(t,n,\tau)\] These form a distribution over the positions because \(\sum_n \bar{\alpha}(n,\tau) =1\).
Calculate the "distance" to the long term attention average, averaged over the forecast window
\[dist(t) = \frac{1}{\tau_{max}}\sum_{\tau=1}^{\tau_{max}}\kappa(\bar{\mathbf{\alpha}}(\tau),\alpha(t,\tau))\]