Revision

Back to ML System Design


Introduction

More data leads to better predictions.


Data Biais


Sampling

Sampling is an integral part of the ML workflow. Sampling happens in many steps of an ML project lifecycle, such as sampling from all possible real-world data to create training data, sampling from a given dataset to create splits for training, validation, and testing, or sampling from all possible events that happen within your ML system for monitoring purposes.

See Cross validation for some sampling methods.

Here is a list of sampling methods:

Non-Probability sampling

Probability sampling:


Labeling

Getting labels and accurate labels is mandatory to train an ML model.


Labeling mutliplicity

Labellers can’t agree on the label.

Solutions:


Handling the Lack of Hand Labels

Because of the challenges in acquiring sufficient high-quality labels, many techniques have been developed to address the problems that result.:



Here is a decision trees showing how to get more labeled data:


Weak Supervision: Programmatic labeling

The insight behind weak supervision is that people rely on heuristics, which can be developed with subject matter expertise, to label data. One of the most popular open-source tools for weak supervision is Snorkel.


Programmatic labeling is a good solution to get more labels:


Programmatic labeling can be performed through labeling functions. Here are some examples of labeling functions:


However, labeling functions are:


Semi-supervision

If weak supervision leverages heuristics to obtain noisy labels, semi-supervision leverages structural assumptions to generate new labels based on a small set of initial labels. Unlike weak supervision, semi-supervision requires an initial set of labels.


Examples of semi-supervision methods

Self-training:

Structural assumption:

Perturbation-based methods:


Transfer Learning

Transfer learning refers to the family of methods where a model developed for a task is reused as the starting point for a model on a second task. It is closely linked to fine tuning.

It is widely used in:


Active learning

Active learning is a method for improving the efficiency of data labels. The hope here is that ML models can achieve greater accuracy with fewer training labels if they can choose which data samples to learn from. Active learning is sometimes called query learning — though this term is getting increasingly unpopular — because a model (active learner) sends back queries in the form of unlabeled samples to be labeled by annotators (usually humans).




Resources

See: