Revision

Back to ML System Design

Some form of the label “leaks” into the features This same information is not available during inference

Splitting time-correlated data randomly instead of by time,
Data processing before splitting: use the whole dataset (including valid/test) to generate global statistics/info
- Solution: split your data before scaling/filling in missing values / split even before any EDA to ensure you’re blind to the test set,
Poor handling of data duplication before splitting: test set includes data from the train set
- Solution: deduplicate data before splitting, oversample after splitting,
Group leakage: a group of examples have strongly correlated labels but are divided into different splits
- Solution: understand your data and keep track of its metadata,
Leakage from data generation & collection process
- Solution: data normalization + subject matter expertise.

Measure correlation of a feature with labels:
- A feature alone might not cause leakage, but 2 features together might,
Feature ablation study:
- If removing a feature causes the model performance to decrease significantly, figure out why,
Monitor model performance as more features are added:
- Sudden increase: either a very good feature or leakage.

See: