Revision

Back to ML System Design


9 best practices for feature engineering

  1. Split data by time instead of doing it randomly,

  2. If you oversample your data, do it after splitting,

  3. Use statistics/info from the train split, instead of the entire data, for feature engineering: scaling, normalizing, handling missing values, creating n-gram count, item encoding, etc,

  4. Understand how your data is generated, collected, and processed. Involve domain experts if necessary,

  5. Keep track of data lineage,

  6. Understand feature importance to your model,

  7. Measure correlation between features and labels,

  8. Use features that generalize well,

  9. Remove stale features from your models.


Feature Engineering steps

  1. Handling missing values

  2. Scaling

  3. Discretization

  4. Categorical features

  5. Feature crossing

  6. Positional embeddings


Handling missing values

Not all missing values are equal:

2 solutions:


Deletion

Column deletion: remove columns with too many missing entries.

Row deletion:


Imputation

Fill missing fields with certain values:


Scaling


Computing some statistics (mean, std, min, max) generally requires to know the value for a sampling of the population. It can generate data leakage if, for example the statistics are computed with future values (and if temporality is important).


Log-scaling: help with skewed data and often gives performance gain:


Discretization


Categorical features

Different solutions exist:

  1. One-hot encoding
  2. Represent each category with its attribute:
  3. Hashing trick

Hashing trick is widely used in industry and in machine learning frameworks and is useful in practice for continual learning in production.


Feature crossing


Positional embeddings

See Positional embeddings in Transformers.


Resources

See: