Revision

Back to ML System Design

Introduction

Data Augmentation are methods to synthetically augment the quantity of data.

Goals

Improve model’s performance overall or on certain classes,
Generalize better,
Enforce certain behaviors.

Methods

Simple label-preserving transformation,
Perturbation,
Data synthesis,
Generative adversarial network (GAN).

Label-preserving transformations

Label-preserving transformations modify a bit a date without changing its label. It is widely used in computer vision and NLP.

In Computer Vision

In NLP

Perturbations

Small perturbations can be added to data in order to make the classification difficult.

In Computer Vision

From DeepFool: a simple and accurate method to fool deep neural networks by SM Moosavi-Dezfooli et al..

In NLP

BERT has been trained on data with pertubations and has been trained on finding the right word in place of the pertubation:

Data synthesis

Since collecting data is expensive and slow with many potential privacy concerns, it’d be a dream if we could sidestep it altogether and train our models with synthesized data. Even though we’re still far from being able to synthesize all training data, it’s possible to synthesize some training data to boost a model’s performance.

In Computer Vision

In computer vision, a straightforward way to synthesize new data is to combine existing examples with discrete labels to generate continuous labels. Consider a task of classifying images with two possible labels: DOG (encoded as 0) and CAT (encoded as 1). From example \(x_1\) of label DOG and example \(x_2\) of label CAT, you can generate \(x'\) such as:

\[x' = \gamma x_1 + (1 - \gamma) x_2\]

The label of \(x'\) is a combination of the labels of \(x_1\) and \(x_2\) : \(\gamma \times 0 + (1 - \gamma) \times 1\). This method is called mixup. The authors showed that mixup improves models’ generalization, reduces their memorization of corrupt labels, increases their robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

Mixup us useful for:

Incentivize models to learn linear relationships,
Improves generalization on speech and tabular data,
Can be used to stabilize the training of GANs/

In NLP

In NLP, templates can be a cheap way to bootstrap your model. It is possible to bootstrap training data for a conversational AI (chatbot). A template might look like: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION].” With lists of all possible cuisines, reasonable numbers (you would probably never want to search for restaurants beyond 1000 miles), and locations (home, office, landmarks, exact addresses) for each city, you can generate thousands of training queries from a template.

Generative adversarial network

GAN are neural networks that are trained to fool a classification network not to recognize if the data is a real data or has been generated by the GAN network.

Both network (discrimant and GAN) are trained in the same time and if the discrimant is better and better at recognizing real data compare to generated data, then the GAN will improve, and then the discriminator will improve again, etc.

Data augmentation using GAN. Results from model trained on augmented data using GAN vs expert and standard training. From Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks by V Sandfort et al..

Resources

See:

CS329, lecture 4.