Revision

Back to ML System Design


Introduction

A failure happens when one or more expectations of the system is violated.

Two types of expectations:



Causes of operational failures


Causes of ML failures

  1. Production data differing from training data
  2. Edge cases
  3. Degenerate feedback loops


Production data differing from training data


Edge cases

Edge cases are the data samples so extreme that they cause the model to make catastrophic mistakes. Even though edge cases generally refer to data samples drawn from the same distribution, if there is a sudden increase in the number of data samples in which your model doesn’t perform well on, it could be an indication that the underlying data distribution has shifted.

Autonomous vehicles are often used to illustrate how edge cases can prevent an ML system from being deployed. But this is also true for any safety-critical application such as medical diagnosis, traffic control, eDiscovery, etc. It can also be true for non-safety-critical applications. Imagine a customer service chatbot that gives reasonable responses to most of the requests, but sometimes, it spits out outrageously racist or sexist content. This chatbot will be a brand risk for any company that wants to use it, thus rendering it unusable.

An ML model that performs well on most cases but fails on a small number of cases might not be usable if these failures cause catastrophic consequences. For this reason, major self-driving car companies are focusing on making their systems work on edge cases.


Edge cases and outliers


Feedback loops

Feedback loops is a method to get new labels from the responses of users to the model.

Natural labels

Natural labels:

Delayed labels



Degenerate feedback loops



Detect Degenerate feedback loops



Mitigate Degenerate feedback loops

  1. Randomization:



  1. Positional features



Data Distribution shift



Covariate shift

Covariate in ML are the features \(X\).

Mathematically, covariate shift is when \(P(X)\) changes, but \(P(Y \vert X)\) remains the same, which means that the distribution of the input changes, but the conditional probability of a label given an input remains the same.


Example

Consider the task of detecting breast cancer. You know that the risk of breast cancer is higher for women over the age of 40, so you have a variable ‘age’ as your input. You might have more women over the age of 40 in your training data than in your inference data, so the input distributions differ for your training and inference data. In this case \(P(Y)\) is higher in the training data than in the inference that as the distribution of \(X\) differ. However for a given age, in the training and in the inference data, the probability \(P(Y \vert X)\) of having breast cancer for a women of this age remains the same.


Label shift

Label shift, also known as prior shift, prior probability shift or target shift, is when \(P(Y)\) changes but \(P(X \vert Y)\) remains the same. You can think of this as the case when the output distribution changes but for a given output, the input distribution stays the same.


Example

Assume spread and/or letality of a disease (Covid) \(P(Y)\) decreased for the whole population. However for a given person that died from this disease, the probability \(P(X \vert Y)\) for this person be older than 60 years old when he died remains the same.


Concept Drift

Concept drift, also known as posterior shift, is when the input distribution remains the same but the conditional distribution of the output given an input changes is $$P(Y \vert X) changes. You can think of this as “same input, different output”.


Example

Consider you’re in charge of a model that predicts the price of a house based on its features. Before COVID-19, a 3 bedroom apartment in San Francisco could cost $2,000,000. However, at the beginning of COVID-19, many people left San Francisco, so the same house would cost only $1,500,000. So even though the distribution of house features remains the same, the conditional distribution of the price of a house given its features has changed.


Other data changes


Detecting Data Distribution shift

  1. Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
  2. Two-sample hypothesis test:


Two-sample test: KS test (Kolmogorov–Smirnov)

See Kolmogorov–Smirnov two samples test.

Pros:


Type of shifts

Temporal shifts: time window scale matters


Addressing Data Distribution shift

  1. Train model using a massive dataset,
  2. Retrain model with new data from new distribution:


Monitoring vs. observability

As the industry realized that many things can go wrong with an ML system, many companies started investing in monitoring and observability for their ML systems in production. Monitoring and observability are sometimes used exchangeably, but they are different.

Setting up our system:


Monitoring


Operational metrics


ML metrics



  1. Accuracy-related metrics,
  2. Predictions,
  3. Features.



Predictions


Features


Feature monitoring problems:


Monitoring Toolbox


Monitoring -> Continual Learning


Resources

See: