Falcon‑H1R-FP8: Accelerating Inference with Quantized Precision
Falcon CHAT Hugging Face DISCORD Introducing Falcon H1R 7B FP8, a fully quantized version of the Falcon H1R 7B model that packs both weights and activations into FP8 format. Using NVIDIA Model Optimizer and post-training quantization (PTQ) workflow, the FP8 quantized model preserves the original BF16 quality performance while delivering a 1.2×–1.5× throughput boost and halving GPU memory footprint. Evaluations The FP8 variant retains essentially the same accuracy as BF16 across all three major reasoning tasks: AIME25 drops only 0.8 % (from 83.1 % to 82.3 %), LCB‑v6 falls by 1 % (68.6 % → 67.6 %), and GPQA‑D shows a negligible 0.1 % difference (61.3 % → 61.2 %). These results confirm that the FP8 PTQ preserves benchmark performance while delivering substantial memory and throughput gains. ...
