Data Augmentation for High Dimensional Multivariate Time-Series Data Using Generative Adversarial Networks (GANs)
This work covers the generation and evaluation of synthetic human activity data generated by GANs. The overall aim is to generate realistic, synthetic data that can be used to improve classification perfomance by extending the original dataset.
The generation pipeline takes real-world data as input and produces ten times as much synthetic data for each activity. This process is depicted in more detail in the following image:
The evaluation of the generated data is done in four ways:
- Visualize how well the distributions of each activity resemble the original ones using PCA and t-SNE
- Apply MMD as a sample-based metric to analyze the similarity of the distributions
- Use TSTR/TRTS to evaluate the ability of the synthetic data to be used as substitute for real-world data
- Mix real and synthetic data with the aim to improve classification performance
Three different datasets are used as benchmarks, two of which were recorded in the course of this work.
- PAMAP2 contains simple activities of daily living
- SONAR/SONAR-LAB contain a variety of complex nursing activities with a high number of sensor channels
- Link to datasets will be added once published
The pipelines can be extended to include further HAR datasets, provided that they can be integrated into the Recording structure.
All requirements are listed in requirements.txt
. Use the following command to install all dependencies automatically:
pip install -r requirements.txt
The src
folder contains the following directories / files:
runner.py
- Entry point to program execution. See Executing program.
datatypes/
- Contains basic data types
Recording
andWindow
that are used to handle different datasets consitently.
- Contains basic data types
evaluation/
- Metrics and utility functions for evalution.
execute/
- Contains the actual pipelines that are being executed by
runner.py
. - Each dataset has a pipeline for generating synthetic data and one for evalution.
- Contains the actual pipelines that are being executed by
loader/
- Functions used to read datasets and to fit them into the
Recording
structure. - Preprocessing functions
- Functions used to read datasets and to fit them into the
models/
- Contains TensorFlow models.
scripts/
andvisualization/
- Scripts to visualize and analyze the datasets.
TimeGAN/
- Contains a modified TimeGAN framework which is used to generate synthetic data. See Acknowledgments.
utils/
- Utility functions for reading, windowing and processing the data
settings.py
stores dataset specific constants
labels.json
(and similar)- Contain all activities performed in SONAR/SONAR-LAB
Run runner.py
with the following options:
--dataset {pamap2,sonar,sonar_lab}
: Dataset to use--mode {gen,eval}
: Pipeline to use (generation or evaluation)--data_path DATA_PATH
: Path to the dataset directory--synth_data_path SYNTH_DATA_PATH
: Path to directory where the generated data is stored (used for evaluation only)--random_data_path RANDOM_DATA_PATH
: Path to random data file (used for evaluation only)--window_size WINDOW_SIZE
: Window size--stride_size STRIDE_SIZE
: Stride size
Example command
$ python3 runner.py --dataset sonar_lab --mode eval --data_path PATH_TO_DATASET
--synth_data_path PATH_TO_SYNTHETIC_DATA --random_data_path PATH_TO_RANDOM_DATA_FILE
--window_size 300 --stride_size 300 > output.txt
Note: To run only some of the evaluations, the flags in the according evaluation pipelines have to be set manually.
PCA visualization | t-SNE visualization |
---|---|
- TimeGAN: https://github.com/jsyoon0823/TimeGAN
- RBF calculation: https://github.com/jindongwang/transferlearning/blob/master/code/distance/mmd_numpy_sklearn.py
- DeepConvLSTM implementation: https://github.com/AniMahajan20/DeepConvLSTM-NNFL/blob/master/DeepConvLSTM.ipynb
- Project's repository: https://github.com/Sensors-in-Paradise