Baseline
- Random baseline:
- Predict at random:
- Uniform,
- Following label distribution,
- Zero rule baseline:
- Always predict the most common class,
- Simple heuristics:
- E.g.: classify tweets based on whether they contain links to unreliable sources,
- Human baseline:
- What’s human-level performance?
- Existing solutions.
Evaluation methods
- Perturbation Test
- Invariance Tests
- Directional Expectation Tests
- Model Calibration
- Confidence Measurement
- Slice-based Evaluation
Perturbation Tests
Problem: users input might contain noise, making it different from test data
Examples:
- Speech recognition: background noise
- Object detection: different lighting
- Text inputs: typos, intentional misspelling (e.g. looooooooong)
Model does well on test set, but fails in production.
- Motivation: users input might contain noise, making it different from test data
- Idea: randomly add small noise to test data to see how much outputs change
The more sensitive the model is to noise:
- The harder it is to maintain
- The more vulnerable the model is to adversarial attacks
Solutions
If small changes cause model’s performance to fluctuate, you might want to make model more robust:
- Add noise to training data,
- Add more training data,
- Choose another model.
Invariance Tests
Motivation: some input changes shouldn’t lead to changes in outputs:
- Changing race/gender info shouldn’t change predicted approval outcome,
- Changing name shouldn’t affect resume screening results.
Idea:
- Keep certain features the same, but randomly change values of sensitive features.
Directional Expectation Tests
Motivation: some changes to inputs should cause predictable changes in outputs
Example, when predicting housing prices:
- Increasing lot size shouldn’t decrease the predicted price,
- Decreasing square footage shouldn’t increase the predicted price.
Model Calibration
If you predict team A wins in A vs. B match with 60% probability:
- In 100 A vs. B match, A should win 60% of the time!
Among all samples predicted POSITIVE with propa 80%, 80% of them should be POSITIVE!
Confidence Measurement
Slice-based Evaluation
- Classes
- Might perform worse on minority classes
- Subgroups
- Gender
- Location
- Time of using the app
- etc.
- Evaluate your model on different slices
- Check for consistency over time
Pros
- Improve model’s performance both overall and on critical data
- Help avoid biases
- Even when you don’t think slices matter, slicing can:
Resources
See: