Falcon CHAT Hugging Face DISCORD
Introducing Falcon H1R 7B FP8, a fully quantized version of the Falcon H1R 7B model that packs both weights and activations into FP8 format. Using NVIDIA Model Optimizer and post-training quantization (PTQ) workflow, the FP8 quantized model preserves the original BF16 quality performance while delivering a 1.2×–1.5× throughput boost and halving GPU memory footprint.
Evaluations
The FP8 variant retains essentially the same accuracy as BF16 across all three major reasoning tasks: AIME25 drops only 0.8 % (from 83.1 % to 82.3 %), LCB‑v6 falls by 1 % (68.6 % → 67.6 %), and GPQA‑D shows a negligible 0.1 % difference (61.3 % → 61.2 %). These results confirm that the FP8 PTQ preserves benchmark performance while delivering substantial memory and throughput gains.
Under the DeepConf test‑time filtering regime (5 repetitions, 128 rollouts per prompt), Falcon H1R 7B FP8 attains 89.3 % accuracy on AIME 2025 and 94.0 % on AIME 2024, compared to 91.3 % and 95.3 % for the BF16 baseline. The modest 1–2 % drop confirms that FP8 quantization preserves the model’s reasoning performance while still benefiting from DeepConf’s efficient trace pruning.
Inference profiling
Memory
Falcon H1R 7B FP8 cuts the weight memory footprint from 14.2 GB to just 7.9 GB, a reduction of roughly 44 % that enables deployment on GPUs with lower VRAM while preserving the model’s performance.
Offline inference benchmarking
Inference was benchmarked using offline vLLM on a single NVIDIA H100 GPU.
Throughput
The plot below shows that Falcon H1R 7B FP8 consistently outperforms its BF16 counterpart across all batch sizes.
- For the Input = 512 / Output = 8k workload, FP8 yields a 20–22 % speed‑up (≈1.2×) over BF16, reaching 2682 tokens/s/GPU at a batch size of 32.
- For the Input = 8k / Output = 8k workload, the improvement grows to 24–31 % (≈1.3×), achieving 2220 tokens/s/GPU at a batch size of 32.
Online inference benchmarking
To conduct the online serving performance analysis, we utilized NVIDIA AIPerf. AIPerf is a client-side generative AI benchmarking tool that supports any inference service conforming to the OpenAI API specification. It is designed to capture critical performance metrics including Time to First Token (TTFT), Inter-Token Latency (ITL), and overall throughput.
Models were served using online vLLM with 1k input tokens and 1k output tokens across various concurrency levels. All performance numbers are measured on a single NVIDIA H200 GPU.
GPU Efficiency vs User Experience
This chart illustrates the trade-off between total output throughput (tokens/sec) and per-user throughput (tokens/sec/user) across different concurrency levels. As concurrency increases, total throughput grows but per-user experience degrades. FP8 maintains higher throughput on both axes, demonstrating superior efficiency.
Key Performance Improvements:
- Output Throughput: FP8 achieves up to 12.7% higher output token throughput at concurrency 64 (4543.5 vs 4030.9 tokens/sec)
- Per-User Throughput: FP8 delivers 20% better tokens/sec per user at low concurrency (188.7 vs 156.7 at concurrency 1)
P50 Inter-Token Latency
This chart shows the median (P50) time between consecutive tokens at different concurrency levels. Lower latency means faster token generation. FP8 maintains consistently lower P50 latency across all concurrency levels, with ~11% improvement at concurrency 64 (13ms vs 14.6ms)
Time to First Token (TTFT)
This chart measures the average time until the first token is generated after receiving a request. Lower TTFT is critical for perceived responsiveness. FP8 reduces TTFT by up to 28% at high concurrency (686.2ms vs 954.3ms at concurrency 64), significantly improving user experience.
FP8 Post-Training Quantization Process with NVIDIA Model Optimizer
This section details the optimization of Falcon-H1R-7B’s inference performance through FP8 Post-Training Quantization (PTQ). To achieve this, we utilized NVIDIA Model Optimizer (ModelOpt), an open-sourced unified library designed to accelerate AI inference by compressing models using state-of-the-art optimization techniques. ModelOpt is an essential toolkit for efficient downstream deployment on NVIDIA hardware compatible with frameworks such as vLLM, TensorRT-LLM, SGLang and Dynamo.
While ModelOpt supports various optimization strategies such as structured pruning and knowledge distillation, we focused specifically on PTQ, which offers the fastest path to model optimization by compressing weights from higher precisions (like FP16 or BF16) down to FP8 using a small calibration dataset. This conversion significantly reduces the model’s size and computational requirement, enabling higher inference throughput and reduced latency.
Per‑Tensor FP8 Post‑Training Quantization
ModelOpt supports FP8 recipes with different granularities balancing inference performance and model accuracy preservation. For this specific implementation, we employed Per-Tensor FP8 quantization, a technique where a single scaling factor is calculated for an entire tensor to map high-precision values into the 8-bit format. To generate these scales, ModelOpt executes a calibration step using either a public dataset or a custom dataset. To quantize Falcon-H1R-7B, the team used the default calibration dataset comprising of cnn_dailymail and nemotron-post-training-dataset-v2. During this process, ModelOpt analyzes the dynamic range of activations and weights to determine the optimal static scaling factors that minimize accuracy degradation. The final output is a quantized checkpoint containing the FP8 weights and the scaling factors for weights and activations, ready to be built into an inference engine for efficient deployment.
Quantization Steps
We started from our pre-trained checkpoint and applied per-tensor FP8 quantization using ModelOpt’s LLM quantization pipeline steps. In addition, we also quantize the KV cache to FP8 in order to save more memory footprint.
Environment Setup
# Start from a vLLM-enabled Docker image
docker run --gpus all -it vllm/vllm-openai:latest
# Clone NVIDIA Model Optimizer
git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer
# Checkout to the most recent stable branch
git checkout 0.39.0
# Install Model Optimizer with development dependencies
pip install -e .[dev]
# Navigate to the LLM PTQ example directory
cd examples/llm_ptq
Applying FP8 Post-Training Quantization
We can directly apply FP8 quantization on Falcon-H1R-7B model using:
python3 hf_ptq.py \
--pyt_ckpt_path=/path/to/your/hf/checkpoint/ \
--export_path=/path/to/save/fp8_quantized_model/ \
--qformat=fp8 \
--kv_cache_qformat=fp8 \
--calib_size=512 \
--batch_size=0 \
--inference_tensor_parallel=1 \
--inference_pipeline_parallel=1 \
--export_fmt=hf \
--trust_remote_code
Key flags of the FP8 quantization process:
--qformat=fp8: applies per-tensor FP8 quantization to all model weights--kv_cache_qformat=fp8: quantizes the KV cache to FP8--calib_size=512: selects the samples to pass for the scales calibration, usually 512 are enough--batch_size=0: automatically find the maximum batch size--inference_tensor_parallel & --inference_pipeline_parallel: can be tuned if your model doesn’t fit in 1 GPU for the calibration process
Conclusion
Falcon H1R 7B FP8 preserves BF16 accuracy while cutting weight memory to 7.9 GB (≈44 %) and boosting throughput by ~20–30 % on H100/H200 GPUs. It therefore enables high‑scale, low‑VRAM inference without sacrificing performance.
Citation
@article{falcon-h1r-fp8,
title={Falcon H1R 7B FP8},
author={TII and NVIDIA},
url={https://falcon-lm.github.io/blog/falcon-h1r-7b-fp8},
year={2026}
}
Contributors - TII

Puneesh Khanna

Slim Frikha

Iheb Chaabane

Mohamed El Amine Seddik

Saarah Abdulla

Hakim Hacid
Contributors - NVIDIA

Sergio Perez

Mireille Fares

Liana Mikaelyan

Amit Kushwaha
