llama.cpp and the GGML/GGUF Format
Developed by Georgi Gerganov, llama.cpp is an implementation of Transformer inference in pure C/C++. The objective is to enable large model execution on commodity CPUs by eliminating the runtime overhead of high-level frameworks.
Aggressive Quantization (k-quants)
The project focuses on mapping 16-bit floating-point weights to 4-bit and 5-bit integers. This reduction allows a 7B parameter model, which typically requires ~14GB VRAM, to operate within ~5GB of system RAM.
Hardware-Specific SIMD Optimization
Implementation details include manual optimization for AVX2/AVX-512 (x86) and NEON (ARM). On Apple Silicon, the framework leverages high-bandwidth unified memory to achieve performance parity with entry-level discrete accelerators.