Efficient ML Execution: Current Trends

Inference Runtimes

ONNX Runtime A cross-platform engine by Microsoft that abstracts hardware through Execution Providers (EP) such as OpenVINO or XNNPACK.

OpenVINO Intel's framework optimized for Intel hardware, focusing on INT8 quantization and operator fusion.

TensorFlow Lite A framework for mobile and edge devices utilizing FlatBuffers and the XNNPACK library for reduced memory footprints.

Core ML Apple's on-device framework designed for heterogeneous execution across CPU, GPU, and Neural Engine (ANE).

ML Compilers

Compilers such as Apache TVM and XLA automate the generation of optimized machine code. The objective is to reduce overhead through graph-level optimizations like constant folding and kernel fusion.

Niche & Historical Runtimes

Apache MXNet: A highly scalable deep learning framework that pioneered early memory-efficient execution and symbol-based graph optimization.
Paddle Lite: Baidu's high-performance inference engine for mobile, embedded, and IoT devices.
SNPE (Snapdragon Neural Processing Engine): Qualcomm's SDK for execution on Hexagon DSPs and Adreno GPUs, providing early examples of mobile-first AI acceleration.
Tengine: An open-source intelligent software framework for ARM-based IoT devices.

Lower-Level Kernel Libraries

oneDNN: Foundational deep learning primitives for x86 and ARM architectures.
XNNPACK: Optimized primitives for mobile, server, and WebAssembly (WASM).
FBGEMM: Meta's high-performance library for low-precision inference on server-side CPUs.