4-Bit Efficiency, 16-Bit Accuracy Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Published

May 21, 2025

Reading time

2 min read

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.

What’s new: Ruizhe Wang and colleagues at Microsoft and University of Science and Technology of China trained large language models (LLMs) using FP4 for matrix multiplications and achieved accuracy comparable to LLMs trained using the popular BF16 format. Since matrix multiplications account for 95 percent of computation in LLM training, FP4 could significantly accelerate computation and reduce memory costs.

Key insight: Quantization functions, which accelerate computation by reducing the precision of model weights and layer outputs, make typical training impossible because they’re not differentiable. A common workaround passes the derivative through, as though quantization didn’t occur, but this degrades the resulting model’s accuracy. A differentiable approximation of a quantization function enables quantization to reduce training computation while maintaining the accuracy of the trained model.

How it works: The authors pretrained Llama 2 13B on 100 billion tokens of text scraped from the web. They used FP4 for matrix multiplications and FP8, BF16, or FP16 for the other operations such as optimizer updates.

To quantize the model weights to FP4 (which ranges between -6 and 6), the authors scaled the values in the weight matrices relative to the maximum absolute value. They computed the updates on a higher-precision copy of the weights, which made it necessary to re-quantize them at each training step during the forward pass through the network.
Although the weights had been quantized to 4 bits, matrix multiplication between the weights and outputs of the previous layer could produce values outside the FP4 range. So, in each layer, if a value exceeded the 99th percentile of the values of the layer’s input, the authors limited the input to the 99th-percentile value. Then they converted the layer’s inputs to FP4. Limiting outliers prevented high values from affecting the scaling during FP4 conversion.
Limiting outliers introduced a degree of error, so they computed a matrix to correct the result of the matrix multiplication. They computed this matrix in FP16 using sparse matrix multiplication between the weights and the outliers.
During backpropagation, the authors computed the gradients through a differentiable function that approximated the quantization function.

Results: The authors simulated FP4 hardware on Nvidia H100 GPUs, which don’t directly support that number format. FP4 achieved accuracy similar to that of BF16 during training and across a wide variety of tasks at inference.

On question-answering tasks, FP4 approached or outperformed BF16. Averaged across nine benchmarks including BoolQ (answering yes-no questions), HellaSwag (completing an incomplete narrative), and ARC-C (answering multiple-choice questions that involve reasoning), FP4 achieved 54.95 accuracy, while BF16 achieved 54.44 accuracy.
Specifically, on Hellaswag, FP4 training achieved 54.12 percent accuracy, while BF16 achieved 53.56 accuracy.
On BoolQ, FP4 achieved 55.90 percent accuracy, while BF16 achieved 57.40 accuracy.

Why it matters: Training LLMs at FP4 precision ought to reduce computation dramatically on hardware that supports FP4 matrix multiplications.

We’re thinking: FP4-ready hardware became available in the cloud only early this year, so the authors weren’t able to measure the actual acceleration. As capable hardware becomes more widely used, FP4 promises faster, more energy-efficient training.

Subscribe to The Batch