Introduction
Some form of the label “leaks” into the features
This same information is not available during inference
Causes of Data Leakage
- Splitting time-correlated data randomly instead of by time,
- Data processing before splitting: use the whole dataset (including valid/test) to generate global statistics/info
- Solution: split your data before scaling/filling in missing values / split even before any EDA to ensure you’re blind to the test set,
- Poor handling of data duplication before splitting: test set includes data from the train set
- Solution: deduplicate data before splitting, oversample after splitting,
- Group leakage: a group of examples have strongly correlated labels but are divided into different splits
- Solution: understand your data and keep track of its metadata,
- Leakage from data generation & collection process
- Solution: data normalization + subject matter expertise.
How to detect Data Leakage?
- Measure correlation of a feature with labels:
- A feature alone might not cause leakage, but 2 features together might,
- Feature ablation study:
- If removing a feature causes the model performance to decrease significantly, figure out why,
- Monitor model performance as more features are added:
- Sudden increase: either a very good feature or leakage.
Resources
See: