Revision

Back to ML System Design


Baseline


Evaluation methods

  1. Perturbation Test
  2. Invariance Tests
  3. Directional Expectation Tests
  4. Model Calibration
  5. Confidence Measurement
  6. Slice-based Evaluation


Perturbation Tests

Problem: users input might contain noise, making it different from test data

Examples:

Model does well on test set, but fails in production.

The more sensitive the model is to noise:


Solutions

If small changes cause model’s performance to fluctuate, you might want to make model more robust:


Invariance Tests

Motivation: some input changes shouldn’t lead to changes in outputs:

Idea:


Directional Expectation Tests

Motivation: some changes to inputs should cause predictable changes in outputs

Example, when predicting housing prices:


Model Calibration

If you predict team A wins in A vs. B match with 60% probability:

Among all samples predicted POSITIVE with propa 80%, 80% of them should be POSITIVE!


Confidence Measurement


Slice-based Evaluation


Pros


Resources

See: