Revision

Back to ML System Design

Baseline

Random baseline:
- Predict at random:
  - Uniform,
  - Following label distribution,
Zero rule baseline:
- Always predict the most common class,
Simple heuristics:
- E.g.: classify tweets based on whether they contain links to unreliable sources,
Human baseline:
- What’s human-level performance?
Existing solutions.

Evaluation methods

Perturbation Test
Invariance Tests
Directional Expectation Tests
Model Calibration
Confidence Measurement
Slice-based Evaluation

Perturbation Tests

Problem: users input might contain noise, making it different from test data

Examples:

Speech recognition: background noise
Object detection: different lighting
Text inputs: typos, intentional misspelling (e.g. looooooooong)

Model does well on test set, but fails in production.

Motivation: users input might contain noise, making it different from test data
Idea: randomly add small noise to test data to see how much outputs change

The more sensitive the model is to noise:

The harder it is to maintain
The more vulnerable the model is to adversarial attacks

Solutions

If small changes cause model’s performance to fluctuate, you might want to make model more robust:

Add noise to training data,
Add more training data,
Choose another model.

Invariance Tests

Motivation: some input changes shouldn’t lead to changes in outputs:

Changing race/gender info shouldn’t change predicted approval outcome,
Changing name shouldn’t affect resume screening results.

Idea:

Keep certain features the same, but randomly change values of sensitive features.

Directional Expectation Tests

Motivation: some changes to inputs should cause predictable changes in outputs

Example, when predicting housing prices:

Increasing lot size shouldn’t decrease the predicted price,
Decreasing square footage shouldn’t increase the predicted price.

Model Calibration

If you predict team A wins in A vs. B match with 60% probability:

In 100 A vs. B match, A should win 60% of the time!

Among all samples predicted POSITIVE with propa 80%, 80% of them should be POSITIVE!

Confidence Measurement

Usefulness threshold for each individual prediction,
Uncertain predictions can cause annoyance & catastrophic consequences.
How to measure the confidence level of each prediction?
What to do with predictions below the confidence threshold?
- Skip,
- Ask for more information,
- Loop in humans.

Slice-based Evaluation

Classes
- Might perform worse on minority classes
Subgroups
- Gender
- Location
- Time of using the app
- etc.
Evaluate your model on different slices
Check for consistency over time

Pros

Improve model’s performance both overall and on critical data
Help avoid biases
Even when you don’t think slices matter, slicing can:

Resources

See:

CS329, lecture 7.