High-Fidelity Runtimes: BF16 on CPU

Ziroh Labs and Kompact AI

Ziroh Labs focuses on the development of runtimes that circumvent the GPU barrier. Their Kompact AI project explores inference without the fidelity loss typical of low-bit quantization.

Full-Precision (BF16) Execution

The technical objective is to run models at Full Precision (BF16) on CPUs by optimizing memory cycles and computation scheduling. This is applicable in domains where the stochastic errors introduced by 4-bit quantization are not tolerable.

Semantic Caching Layers

The implementation of an internal semantic caching layer (designated "Elephant") aims to detect input similarities to bypass redundant inference cycles. This architecture is designed for repetitive enterprise workloads.

Performance Analysis: Benchmarks indicate throughput levels of 164 tokens/sec on standard CPUs, targeting efficiency comparable to discrete accelerators for specific batch-size configurations.

Infrastructure and Independence

The development of these runtimes enables the creation of high-performance AI infrastructure using commercially available commodity hardware, reducing dependency on proprietary specialized accelerators.

Foundational Math Libraries

BLIS (BLAS-like Library Instantiation Software): A framework for instantiating high-performance BLAS-like software libraries, providing extreme control over micro-kernels.
OpenBLAS: An optimized BLAS library based on GotoBLAS, historically critical for high-precision scientific computing on CPUs.
Intel MKL: The industry standard for math kernels on x86, providing the performance baseline for all other high-fidelity runtimes.
AMD AOCL: AMD's suite of libraries optimized for EPYC processors, targeting maximum numerical precision and throughput.