Steps
1. Clarify Requirements
- Constrains: latency, resources: computation power, time, number of users, etc.
- Use cases,
- Is ML required? System has capability to learn? Complex? Patterns?
2. Define metrics (Offline and Online)
- ML Metrics: classification: asymetric or not (accuracy, F1-Score, ROC Curve, AUC, …); regression (MSE, MAE, …),
- Business metrics: ROI, click through rate, etc.
3. Design Architecture (Offline and Online):
a) Data
- Data: How and where to get data (features and labels)?
- Data Sources: Internal databases, API, System brokers, …
- Data Storage: SQL, NoSLQ?
- Predictions publication: offline, online? Database, API, System brokers?
b) Labelling
- Labelling: Hand labelling, Programmatic labeling,
- If lack of labelling: weak supervision, semi-supervision, transfer learning, active learning.
c) Data Preprocessing
- Feature Engineering: missing values, scaling, class unbalanced, categorical features, data augmentation, etc.
- Feature Importance: model specific: RF, XGBoost, … ; model agnostic: SHAP values, …
- Data Leakage!
d) Model
- Baseline: simple model, expert rules, current solution,
- Selection: explainable? black-box? ensemble? auto-ML,
- Training: Distributed Training (split data and/or model),
4. Offline Evaluation
- ML Metric,
- Tests: Perturbation Tests, Invariance Tests, Directional Expectation Tests, Model Calibration, Confidence Measurement, Slice-based Evaluation, …
5. Deployment
- Cloud computing / Edge computing (optimization: quantization, knowledge distillation, pruning, etc.).
6. Monitoring and Online training:
- System Failures Monitoring:
- Distribution shift (covariate, label, concept),
- Degenerate feedback loops,
- Online evaluation:
- ML Metric,
- Canary Testing, A/B Testing, Interleaved Experiments, Shadow Testing,
- Model retraining.
Resources:
See: