Temporal dependence characterizes time series data because observations close in time tend to be similar compared to cross-sectional data.
Missing mechanisms Rubin, 1976
- MCAR
- MAR
- MNAR
- point missing
- subsequence missing
- block missing
- deletion
- constant imputation
- locf - last obs carried forward
- nocb - next obs carried backward
- mean/median/mode
- rolling statistics
- linear interpolation
- spline interpolation
- KNN
- regression
- seasonal trend decomposition using Loess (STL)
-
TimesNet ---> Github - TSLib
TSLib/TimesNet only supports point-missing pattern. They randomly mask the time-points in the ratio of {12.5%, 25%, 37.5%, 50%}.
Results for Autoformer (Weather dataset)
Mask Rate MSE MAE 12.5% 0.3128 0.4111 25% 0.3024 0.3879 37.5% 0.1488 0.2562 50% 0.1428 0.2470 More masking → model sees less observed data but is forced to learn deeper temporal dependencies and structural patterns.
This leads to more robust representations, like how dropout improves generalization by preventing over-reliance on specific inputs.Results for Timesnet (Weather dataset)
Masking Ratio MAE MSE 0.125 0.04593 0.02517 0.25 0.05506 0.02932 0.375 0.05704 0.03088 0.5 0.06148 0.03413 Timesnet performance decreases with the increase in masking ratio, but Autoformer performance increases.
Although, Timesnet performs much better at TS-imputation task. it also tops the TSlib leaderboard for this task.Extension of this paper - Deep TS Models
-
This paper uses 2 datasets -
- physionet 2012 (clinical dataset) - As the dataset has no ground truth, 10/50/90% of the observed values in the test data are taken as ground truths for which the input data is masked with the Bernoulli distribution.
- beijing air quality - uses block-missing pattern. There is already 13% missing data. For each missing data-point, the succeeding month's data-point is taken as the ground truth. For example, if 24th Feb is missing, the ground truth to this is 24th March.
-
TSI-Bench ---> Github - Awesome Imputation
TSIBench supports all three missing patterns - point, subseq and block.

- Transformers in TS - IJCAI
- DL for TSC - mlp, cnn, rnn/esn, fcn, resnet, encoder, mcnn, t-LeNet, mcdcnn, time-cnn
-
Voice2Series - ICML - Achieves SOTA in 19 datasets (
$xt=Pad(xt)+δ$ ) - Github
Padding reprogramming — where the padded portion is replaced by a trainable additive vector$δ=M⊙θ$ - Aeon library
A discriminative region is the subsequence of a time series that contains the most informative features for classifying the time series into the correct class.
| Method | How it finds the discriminative region |
|---|---|
| Shapelets | Learns short subsequences that best separate classes (e.g., slant) |
| Saliency/Grad-CAM | Highlights time points where gradients w.r.t. output are strongest |
| Attention models | Learn to focus on regions (middle slant) with highest class-relevance |
| Class activation maps | Show which part of input most influences the predicted class |
| Manual inspection | Plotting and observing differences (used in early literature) |
Training loop often doesn't have labels.
The model is trained to re-construct normal (non-anomalous) data.
For normal samples, reconstruction error should be low.
For anomalous samples, reconstruction error should be high (since it is unseen data. Model predicted normal data, but original data has an anomaly)
- Calculate reconstruction error(MSELoss) between pred and output -
score = torch.mean(self.anomaly_criterion(batch_x, outputs), dim=-1) - Concat all errors into a single array
- Find the threshold percentile (Any test sample with a reconstruction error above this threshold (the top 1% highest errors) is flagged as an anomaly.) -
threshold = np.percentile(combined_energy, 100 - self.args.anomaly_ratio) - Filter all predictions by checking which are more than threshold -
pred = (test_energy > threshold).astype(int)
- Point anomalies (point-based) refer to data points that deviate remarkably from the rest of the data.
- Contextual anomalies (point-based) refer to data points within the expected range of the distribution (in contrast to point anomalies) but deviate from the expected data distribution, given a specific context (e.g., a window).
- Collective anomalies (sequence-based) refer to sequences of points that do not repeat a typical (previously observed) pattern.
- Dive into TS AD - describes many methods
- AnomalyBert - ICLR - Github - processes time series in patches (small groups of points). Unlike the original Transformer or ViT, we do not use sinusoidal positional encodings or absolute position embeddings to inject positional information. We instead add 1D relative position bias to each attention matrix to consider the relative positions between features in a window.
-
training
Aspect Short-Term Long-Term Difference Time features ❌ Not used ✅ Uses batch_x_markandbatch_y_markShort-term often uses raw ts only Model Call self.model(batch_x, None, dec_inp, None)self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)Long-term uses full context Loss Calculation criterion(batch_x, freq_map, outputs, batch_y, batch_y_mark)+ optional sharpness losscriterion(outputs, batch_y)Short-term loss may include frequency/temporal sharpness terms Sharpness Regularization ✅ MSE(output diffs, target diffs)— optional❌ Not applied Unique to short-term variant Use of Frequency Map ✅ Passed to loss (for frequency-aware loss function) ❌ Not used in long-term training Short-term focuses on frequency -
validation
Feature Short-Term Long-Term # of test samples 1 (last training slice) Many (rolling across test set) Loop over batches ❌ ✅ Decoder input Single sample Reconstructed per batch Time marks used ❌ ✅ Inverse scaling Optional, less common Common in scaled datasets Evaluation metrics Often skipped Full set + DTW
Statistical methods for forecasting Paper
- Simple Exponential Smoothing
- Holt's method (Double Exponential Smoothing)
- Holt-Winter's method (Triple Exponential Smoothing)
- Holt-Winter's Method with Multiplicative Seasonality
- Holt-Winter's Method with Additive Seasonality





