Skip to content

Latest commit

 

History

History
139 lines (98 loc) · 13.1 KB

augmentation.md

File metadata and controls

139 lines (98 loc) · 13.1 KB

Data Augmentation

A key challenge when training machine learning models is collecting a large, diverse dataset that sufficiently captures the variability observed in the real world. Due to the cost of collecting and labeling datasets, data augmentation has emerged as a promising alternative.

The central idea in data augmentation is to transform examples in the dataset in order to generate additional augmented examples that can then be added to the data. These additional examples typically increase the diversity of the data seen by the model, and provide additional supervision to the model. The foundations of data augmentation originate in tangent propagation, where model invariances were expressed by adding constraints on the derivates of the learned model.

Early successes in augmentation such as AlexNet focused on inducing invariances in an image classifier by generating examples that encouraged translational or rotational invariance. These examples made augmentation a de-facto part of pipelines for a wide-ranging tasks such as image, speech and text classification, machine translation, etc.

The choice of transformations used in augmentation is an important consideration, since it dictates the behavior and invariances learned by the model. While heuristic augmentations have remained popular, it was important to be able to control and program this augmentation pipeline carefully. TANDA initiated a study of the problem of programming augmentation pipelines by composing a selection of data transformations. This area has seen rapid growth in recent years with both deeper theoretical understanding and practical implementations such as AutoAugment. A nascent line of work leverages conditional generative models to learn-rather than specify-these transformations, further extending this programming paradigm.

This document provides a detailed breakdown of resources in data augmentation.

History

Augmentation has been instrumental to achieving high-performing models since the original AlexNet paper on ILSVRC, which used random crops, translation & reflection of images for training, and test-time augmentation for prediction.

Since then, augmentation has become a de-facto part of image training pipelines and an integral part of text applications such as machine translation.

Theoretical Foundations

Augmentation Primitives

Hand-Crafted Primitives

A large body of work utilizes hand-crafted data augmentation primitives in order to improve model performance. These hand-crafted primitives are designed based on domain knowledge about data properties, e.g. rotating an image preserves the content of the image, and should typically not change the class label.

The next few sections provide a sampling of work across several different modalities (images, text, audio) that take this approach.

Images

Heuristic transformations are commonly used in image augmentations, such as rotations, flips or crops (e.g. AlexNet, Inception).

Recent work has proposed more sophisticated hand-crafted primitives:

  • Cutout randomly masks patches of the input image during training.
  • Mixup augments a training dataset with convex combinations of training examples. There is substantial empirical evidence that Mixup can improve generalization and adversarial robustness. A recent theoretical analysis helps explain these gains, showing that the Mixup loss can be approximated by standard ERM loss with regularization terms.
  • CutMix combines the two approaches above: instead of summing two input images (like Mixup), CutMix pastes a random patch from one image onto the other and updates the label to be weighted sum of the two image labels proportional to the size of the cutouts.
  • MixMatch and ReMixMatch extend the utility of these techniques to semi-supervised settings.

While these primitives have culminated in compelling performance gains, they can often produce unnatural images and distort image semantics. However, data augmenation techniques such as AugMix can mix together various unnatural augmentations and lead to images that appear more natural.

Text

Heuristic transformations for text typically involve paraphrasing text in order to produce more diverse samples.

Audio

Assembled Pipelines

An interesting idea is to learn augmentation pipelines, a study initiated by TANDA. This area has seen rapid growth in recent years with both deeper theoretical understanding and practical implementations like AutoAugment.

The idea is to determine the right subset of augmentation primitives, and the order in which they should be applied. These pipelines are primarily built on top of a fixed set of generic transformations. Methods vary by the learning algorithm used, which can be

Learned Primitives

There is substantial prior work in learning transformations that produce semantic, rather than superficial changes to an input.

One paradigm is to learn a semantically meaningful data representation, and manipulate embeddings in this representation to produce a desired transformation.

Another class of approaches relies on training conditional generative models, that learn a mapping between two or more data distributions. A prominent use case focuses on imbalanced datasets, where learned augmentations are used to generate examples for underrepresented classes or domains. Examples of these approaches include BaGAN, DAGAN, TransferringGAN, Synthetic Examples Improve Generalization for Rare Classes, Learning Data Manipulation for Augmentation and Weighting, Generative Models For Deep Learning with Very Scarce Data, Adversarial Learning of General Transformations for Data Augmentation, DADA and A Bayesian Data Augmentation Approach for Learning Deep Models

Recent approaches use a combination of learned domain translation models with consistency training to further improve performance e.g. Model Patching.

Future Directions

Several open questions remain in data augmentation and synthetic data generation.

  • While augmentation has been found to have a strong positive effect on performance: what kind of augmentations maximize model robustness? How should such augmentations be specified or learned?
  • Augmentation adds several sources of noise to training. The inputs are transformed or corrupted, and may no longer be likely to occur in the data distribution. The common assumption that augmentation leaves the label unmodified is often violated in discrete data such as text, where small changes can make a large impact on the label. What is the effect of the noise added by data augmentation? Can we tolerate larger amounts of noise to improve performance further?

Further Reading