Revision

Back to ML System Design


Introduction

Some form of the label “leaks” into the features This same information is not available during inference


Causes of Data Leakage

  1. Splitting time-correlated data randomly instead of by time,
  2. Data processing before splitting: use the whole dataset (including valid/test) to generate global statistics/info
  3. Poor handling of data duplication before splitting: test set includes data from the train set
  4. Group leakage: a group of examples have strongly correlated labels but are divided into different splits
  5. Leakage from data generation & collection process


How to detect Data Leakage?

  1. Measure correlation of a feature with labels:
  2. Feature ablation study:
  3. Monitor model performance as more features are added:


Resources

See: