BitNet & 1.58-bit LLMs: Ternary Quantization

The 1.58-bit Paradigm ({-1, 0, 1})

Microsoft's BitNet b1.58 represents a shift in LLM architecture. Weights are constrained to ternary values: -1, 0, or 1, requiring approximately 1.58 bits per parameter.

Computational Efficiency

In standard Transformer architectures, the Matrix Multiplication (GEMM) operation is the primary bottleneck. BitNet replaces standard multiplications with integer addition and subtraction. This optimization is particularly effective for CPU architectures, which can execute these operations with high throughput.

6.25x Speedup over FP16

80% Energy Reduction

~4x Memory Reduction

Technical Implementation

BitLinear Layers: Utilizes a quantization-aware ternary strategy to maintain performance parity with full-precision models.
Benchmarking: BitNet 1.58b achieves competitive results on benchmarks such as MMLU and GSM8K compared to FP16 models.
bitnet.cpp: Provides optimized kernels for x86 and ARM architectures, facilitating low-latency inference on standard CPUs.

Historical Low-Bit Initiatives

BinaryNet (2016): One of the first successful attempts to train neural networks with binary weights and activations.
XNOR-Net: A milestone research paper demonstrating that convolution operations can be performed using XOR and bitcounting.
DoReFa-Net: A framework for training low bitwidth convolutional neural networks with arbitrary bitwidth weights, activations, and gradients.
Ternary Weight Networks (TWN): Early research into discretizing weights to {-1, 0, 1} which paved the way for modern ternary LLMs.

Objective

The 1.58-bit paradigm enables the deployment of high-parameter models on low-power devices and legacy hardware by minimizing memory footprint and eliminating floating-point dependency.