A key challenge when training machine learning models is collecting a large, diverse dataset that sufficiently captures the variability observed in the real world. Due to the cost of collecting and labeling datasets, data augmentation has emerged as a promising alternative.
The central idea in data augmentation is to transform examples in the dataset in order to generate additional augmented examples that can then be added to the data. These additional examples typically increase the diversity of the data seen by the model, and provide additional supervision to the model. The foundations of data augmentation originate in tangent propagation, where model invariances were expressed by adding constraints on the derivates of the learned model.
Early successes in augmentation such as AlexNet focused on inducing invariances in an image classifier by generating examples that encouraged translational or rotational invariance. These examples made augmentation a de-facto part of pipelines for a wide-ranging tasks such as image, speech and text classification, machine translation, etc.
The choice of transformations used in augmentation is an important consideration, since it dictates the behavior and invariances learned by the model. While heuristic augmentations have remained popular, it was important to be able to control and program this augmentation pipeline carefully. TANDA initiated a study of the problem of programming augmentation pipelines by composing a selection of data transformations. This area has seen rapid growth in recent years with both deeper theoretical understanding and practical implementations such as AutoAugment. A nascent line of work leverages conditional generative models to learn-rather than specify-these transformations, further extending this programming paradigm.
This document provides a detailed breakdown of resources in data augmentation.
Augmentation has been instrumental to achieving high-performing models since the original AlexNet paper on ILSVRC, which used random crops, translation & reflection of images for training, and test-time augmentation for prediction.
Since then, augmentation has become a de-facto part of image training pipelines and an integral part of text applications such as machine translation.
- Tangent Propagation expresses desired model invariances induced by a data augmentation as tangent constraints on the directional derivatives of the learned model.
- Kernel Theory of Data Augmentation connects the tangent propagation view of data augmentation to kernel-based methods.
- A Group-Theoretic Framework for Data Augmentation develops a theoretical framework to study data augmentation, showing how it can reduce variance and improve generalization.
- On the Generalization Effects of Linear Transformations in Data Augmentation studies an over-parameterized linear regression setting and studies the generalization effect of applying a familar of linear transformations in this setting.
A large body of work utilizes hand-crafted data augmentation primitives in order to improve model performance. These hand-crafted primitives are designed based on domain knowledge about data properties, e.g. rotating an image preserves the content of the image, and should typically not change the class label.
The next few sections provide a sampling of work across several different modalities (images, text, audio) that take this approach.
Heuristic transformations are commonly used in image augmentations, such as rotations, flips or crops (e.g. AlexNet, Inception).
Recent work has proposed more sophisticated hand-crafted primitives:
- Cutout randomly masks patches of the input image during training.
- Mixup augments a training dataset with convex combinations of training examples. There is substantial empirical evidence that Mixup can improve generalization and adversarial robustness. A recent theoretical analysis helps explain these gains, showing that the Mixup loss can be approximated by standard ERM loss with regularization terms.
- CutMix combines the two approaches above: instead of summing two input images (like Mixup), CutMix pastes a random patch from one image onto the other and updates the label to be weighted sum of the two image labels proportional to the size of the cutouts.
- MixMatch and ReMixMatch extend the utility of these techniques to semi-supervised settings.
While these primitives have culminated in compelling performance gains, they can often produce unnatural images and distort image semantics. However, data augmenation techniques such as AugMix can mix together various unnatural augmentations and lead to images that appear more natural.
Heuristic transformations for text typically involve paraphrasing text in order to produce more diverse samples.
- On a token level, synonym substitution methods replace words with their synonyms. Synonyms might be chosen based on
- a knowledge base such as a thesaurus: e.g. Character-level Convolutional Networks for Text Classification and An Analysis of Simple Data Augmentation for Named Entity Recognition
- neighbors in a word embedding space: e.g. That’s So Annoying!!!
- probable words according to a language model that takes the sentence context into account: e.g. Model-Portability Experiments for Textual Temporal Analysis, Data Augmentation for Low-Resource Neural Machine Translation and Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
- Sentence parts can be reordered by manipulating the syntax tree of a sentence: e.g. Data augmentation via dependency tree morphing for low-resource languages
- The whole sentence can be modified via Backtranslation. There a round-trip translation from a source to target language and back is used to generate a paraphrase. Examples of use include QANet and Unsupervised Data Augmentation for Consistency Training.
- Vocal Tract Length Warping approaches, such as Audio Augmentation for Speech Recognition and Vocal Tract Length Perturbation (VTLP) improves speech recognition
- Stochastic Feature Mapping approaches, such as in Data Augmentation for Deep Neural Network Acoustic Modeling and Continuous Probabilistic Transform for Voice Conversion
An interesting idea is to learn augmentation pipelines, a study initiated by TANDA. This area has seen rapid growth in recent years with both deeper theoretical understanding and practical implementations like AutoAugment.
The idea is to determine the right subset of augmentation primitives, and the order in which they should be applied. These pipelines are primarily built on top of a fixed set of generic transformations. Methods vary by the learning algorithm used, which can be
- reinforcement learning approaches led by the TANDA work, and extended by AutoAugment;
- computationally efficient algorithms for learning augmentation policies have also been proposed such as Population-Based Augmentation, Fast AutoAugment, and Faster AutoAugment;
- random sampling such as in RandAugment and an uncertainty-based random sampling scheme such as in Dauphin.
There is substantial prior work in learning transformations that produce semantic, rather than superficial changes to an input.
One paradigm is to learn a semantically meaningful data representation, and manipulate embeddings in this representation to produce a desired transformation.
- several methods express these transformations as vector operations over embeddings, such as in Deep Visual Analogy Making, Deep feature interpolation for image content changes
- other methods look towards manifold traversal techniques such as Deep Manifold Traversal: Changing Labels with Convolutional Features, Learning to disentangle factors of variation with manifold interaction
- other methods, such as DeepAugment, simply use existing image-to-image models and manipulate embeddings randomly to produce diverse image outputs
Another class of approaches relies on training conditional generative models, that learn a mapping between two or more data distributions. A prominent use case focuses on imbalanced datasets, where learned augmentations are used to generate examples for underrepresented classes or domains. Examples of these approaches include BaGAN, DAGAN, TransferringGAN, Synthetic Examples Improve Generalization for Rare Classes, Learning Data Manipulation for Augmentation and Weighting, Generative Models For Deep Learning with Very Scarce Data, Adversarial Learning of General Transformations for Data Augmentation, DADA and A Bayesian Data Augmentation Approach for Learning Deep Models
Recent approaches use a combination of learned domain translation models with consistency training to further improve performance e.g. Model Patching.
Several open questions remain in data augmentation and synthetic data generation.
- While augmentation has been found to have a strong positive effect on performance: what kind of augmentations maximize model robustness? How should such augmentations be specified or learned?
- Augmentation adds several sources of noise to training. The inputs are transformed or corrupted, and may no longer be likely to occur in the data distribution. The common assumption that augmentation leaves the label unmodified is often violated in discrete data such as text, where small changes can make a large impact on the label. What is the effect of the noise added by data augmentation? Can we tolerate larger amounts of noise to improve performance further?
- The "Automating the Art of Data Augmentation" series of blog posts by Sharon Li provide an overview of data augmentation.