Revision

Back to ML System Design

Training and Inference pipeline example

Different architectures exist: using API, databases or brokers:

Online prediction

To perform online prediction, an ML system requires two key elements:

Near real-time data pipeline,
High model inference speed,
Compiling.

Stream processing

Stream processing is a concept that enables us to build ML systems (pipelines) that can respond in real-time and near real-time.

A ride-sharing application is composed of 3 parts that need to exchange data:

Different architectures exist: using API, databases or brokers:

Real-Time Transport: Service Broker

A Broker or Service Broker implements native in-database asynchronous message processing functionalities. It monitors the completion of tasks, usually command messages, between two different applications in the database engine. It is responsible for the safe delivery of messages from one end to another.

When two applications (within or outside of SQL Server) communicate, neither can access the technical details at the opposite end. It is the job of Service Broker to protect sensitive messages and reliably deliver them to the designated location. Service Broker is highly integrated and provides a simple Transact-SQL interface for sending and receiving messages, combined with a set of strong guarantees for message delivery and processing.

A broker is a real-time transport solution.

Any service can publish to a stream [producer],
Any service can subscribe to a stream to get info they need [consumer].

Request driven vs Event driven

Request driven processes send requests,
Event driven processes send informations.

Code example using Kafka

Example of real-time transport Solutions

Resources

See:

Batch processing vs. stream processing

Batch prediction vs. online prediction

Here is a comparison of both methods:

Batch prediction

Online prediction

HTTP protocol:

Streaming:

Unified

Cloud computing vs. Edge computing

Cloud computing means a large chunk of computation is done on the cloud, either public clouds or private clouds.

Edge computing means a large chunk of computation is done on the consumer devices

Edge computing

Benefits:

Can work without (Internet) connections or with unreliable connections:
- Many companies have strict no-Internet policy,
- Caveat: devices are capable of doing computations but apps need external information:
  - e.g. ETA needs external real-time traffic information to work well,
Don’t have to worry about network latency:
- Network latency might be a bigger problem than inference latency,
- Many use cases are impossible with network latency:
  - e.g. predictive texting,
Fewer concerns about privacy:
- Don’t have to send user data over networks (which can be intercepted),
- Cloud database breaches can affect many people,
- Easier to comply with regulations (e.g. GDPR),
- Caveat: edge computing might make it easier to steal user data by just taking the device,
Cheaper:
- The more computations we can push to the edge, the less we have to pay for servers.

Challenges of ML on the edge:

Device not powerful enough to run models:
- Energy constraint,
- Computational power constraint,
- Memory constraint.

Solutions:

Hardware: Make hardware more powerful,
Model compression: Make models smaller,
Model optimization: Make models faster.

Hybrid

Common predictions are precomputed and stored on device,
Local data centers: e.g. each warehouse has its own server rack,
Predictions are generated on cloud and cached on device.

Model optimization

Solution to speed up model inference:

Quantization
Knowledge distillation
Pruning
Low-ranked factorization

Quantization

Reduces the size of a model by using fewer bits to represent parameter values:

E.g. half-precision (16-bit) or integer (8-bit) instead of full-precision (32-bit)
1-bit representation: BinaryConnect, Xnor-Net

Quantization in PyTorch:

Post-training quantization:

Knowledge distillation

Train a small model (“student”) to mimic the results of a larger model (“teacher”):

Teacher & student can be trained at the same time,
E.g. DistillBERT, reduces size of BERT by 40%, and increases inference speed by 60%, while retaining 97% language understanding.

Pros:

Fast to train student network if teacher is pre-trained,
Teacher and student can be completely different architectures.

Cons:

If teacher is not pre-trained, may need more data & time to first train teacher,
Sensitive to applications and model architectures.

Pruning

Originally used for decision trees to remove uncritical sections,
Neural networks: reducing over-parameterization.

Remove nodes,
Find least useful params & set to 0:
- Number of params remains the same,
- Reducing number of non-zero params,
- Makes models more sparse:
  - Lower memory footprint,
  - Increased inference speed.

Low-ranked factorization

Compiling

Framework developers tend to focus on providing support to only a handful of server-class hardware, and hardware vendors tend to offer their own kernel libraries for a narrow range of frameworks. Deploying ML models to new hardware requires significant manual effort.

Instead of targeting new compilers and libraries for every new hardware backend, what if we create a middle man to bridge frameworks and platforms? Framework developers will no longer have to support every type of hardware, only need to translate their framework code into this middle man. Hardware vendors can then support one middle man instead of multiple frameworks.

This type of “middle man” is called an intermediate representation (IR). IRs lie at the core of how compilers work. This process is also called “lowering”, as in you “lower” your high-level framework code into low-level hardware-native code.

Examples of CPU, GPU and TPU

The compute primitive of CPUs used to be a number (scalar), the compute primitive of GPUs used to be a one-dimensional vector, whereas the compute primitive of TPUs is a two-dimensional vector (tensor). Performing a convolution operator will be very different with 1-dimensional vectors compared to 2-dimensional vectors. You’d need to take this into account to use them efficiently.

ML in browsers

It is possible to generate code that can run on just any hardware backends by running that code in browsers. If you can run your model in a browser, you can run your model on any device that supports browsers: Macbooks, Chromebooks, iPhones, Android phones, and more. You wouldn’t need to care what chips those devices use. If Apple decides to switch from Intel chips to ARM chips, it’s not your problem.

JavaScript:

Tools exist to help you compile your models into JavaScript, such as TensorFlow.js, Synaptic, and brain.js,
JavaScript is slow, and its capacity as a programming language is limited for complex logics such as extracting features from data.

WebAssembly (WASM):

Open standard that allows running executable programs in browsers,
Performant, easy to use, has an ecosystem that is growing,
Supported by 93% of devices worldwide,
Still slower than running code natively on devices (but faster than JavaScript).

Resources

See:

CS329, lecture 8.

Revision

Back to ML System Design

Training and Inference pipeline example

Online prediction

Stream processing

Example of ride-sharing (Uber) service

Real-Time Transport: Service Broker

Request driven vs Event driven

Code example using Kafka

Example of real-time transport Solutions

Resources

Batch processing vs. stream processing

Batch prediction vs. online prediction

Batch prediction

Online prediction

Unified

Cloud computing vs. Edge computing

Edge computing

Hybrid

Model optimization

Quantization

Quantization in PyTorch:

Knowledge distillation

Pruning

Low-ranked factorization

Compiling

Examples of CPU, GPU and TPU

ML in browsers

Resources