Falcon-Arabic: A Breakthrough in Arabic Language Models

Check out the Arabic version translated by Falcon-Arabic

We are excited to introduce Falcon-Arabic, a 7B parameter Language Model that sets a new benchmark for Arabic NLP. Built on the Falcon 3 architecture, Falcon-Arabic is a multilingual model that supports Arabic, English, and several other languages. It excels in general knowledge, Arabic grammar, mathematical reasoning, complex problem solving, and understanding the rich diversity of Arabic dialects. Falcon-Arabic supports a context length of 32,000 tokens, allowing it to handle long documents and enabling advanced applications like retrieval-augmented generation (RAG), in-depth content creation, and knowledge-intensive tasks.

Falcon-Arabic redefines the boundaries of what is possible for Arabic Language Models. It significantly outperforms other Arabic LLMs in its size category and even models up to four times larger across both Arabic-native models and those adapted from other languages. This makes Falcon-Arabic not only a state-of-the-art model in terms of performance, but also a uniquely efficient and accessible solution for developers and researchers working with the Arabic language.

🚀 Introducing Falcon-Arabic: Advancing LLMs for the Arabic-Speaking World

In recent years, Large Language Models (LLMs) have transformed Artificial Intelligence, powering tools for translation, content creation, virtual assistance, and more. Yet much of this progress has focused on highly represented languages like English, leaving languages such as Arabic underrepresented. Arabic presents unique challenges it’s morphologically rich, diglossic (spanning both Modern Standard Arabic (MSA) and diverse regional dialects), and used across a vast and culturally varied population. Developing robust Arabic LLMs is essential to ensure Arabic-speaking communities are fully included in the AI revolution.

With this goal in mind, we’re introducing Falcon-Arabic a specialized adaptation of the Falcon 3 model family, developed by the Technology Innovation Institute (TII) in the UAE. The Falcon models have earned global recognition for their multilingual strength and open-source approach. Falcon-Arabic builds on this legacy, bringing advanced language understanding and generation to Arabic. By training the model to handle both Modern Standard Arabic and key dialects, Falcon-Arabic fills a critical gap in language technology enabling more natural, intelligent, and inclusive Arabic AI across the Gulf, Middle East, and North Africa.

🦅 Falcon-Arabic Has Landed - Here’s the Training Recipe 🧪

Building Falcon-Arabic started with a strategic decision: rather than training a model from scratch, we chose to adapt a strong multilingual foundation. In the Arabic LLM landscape, three main approaches exist: training from scratch (e.g., Jais-native), adapting multilingual models (like Allam or Fanar), or using models that natively support Arabic alongside other languages (such as Qwen or LLaMA). Observing the Open Arabic LLM Leaderboard, it became clear that adapted and multilingual models consistently outperformed others in both efficiency and capability. To build on that momentum, we selected Falcon 3-7B, a model that strikes a practical balance between performance and resource efficiency within the Falcon 3 family developed by the Technology Innovation Institute (TII).

The core challenge was adapting Falcon 3-7B, which originally lacked Arabic support at the tokenizer and embedding level. We addressed this by extending the tokenizer’s vocabulary with 32,000 Arabic-specific tokens, and applying a novel embedding initialization strategy based on textual similarity. This technique mapped new Arabic tokens to semantically related embeddings from the existing vocabulary, allowing the model to inherit prior knowledge and accelerate learning particularly around sentiment, abstract concepts, and reasoning patterns. This gave Falcon-Arabic a head start in understanding and generating high-quality Arabic text.

With the tokenizer and embeddings in place, we began continuous pretraining on high-quality, 100% native Arabic datasets, avoiding the use of machine-translated content to minimize cultural bias and preserve linguistic authenticity. Training followed a multi-stage curriculum: early stages focused on general knowledge and dialect-rich Arabic content to stabilize the model and reinforce logical capabilities, while later phases emphasized math, code, and reasoning. The result is a model that not only speaks Arabic fluently across dialects, but also retains Falcon’s multilingual and reasoning strengths pushing the boundaries for Arabic-first AI.

Average Performance of Pretrained Models

📊 Falcon-Arabic: Raising the Bar in Arabic LLMs

We evaluated Falcon-Arabic on OALL v2, the leading benchmark for Arabic Language Models. It includes six multiple-choice tasks such as Arabic MMLU (native and translated), Arabic Exams, Alghafa, MadinahQA, Aratrust and one generative benchmark, Alrage. Falcon-Arabic outperforms all existing Arabic LLMs in its size range and even surpasses models up to 4× larger. It leads in key benchmarks like Arabic MMLU, Exams, MadinahQA, and Aratrust, setting a new standard for Arabic-first Language Models.

Comparison Table of Pretrained Models

Model	Average	ALGhafa	ArabicMMLU	Exams	MadinahQA	AraTrust	ALRAGE	ArbMMLU-HT
AceGPT-v2-32B	61.74	54.93	63.15	48.6	59.71	83.96	68.96	52.87
Qwen2.5-14B	54.26	69.32	46.37	37.43	30.38	70.46	74.03	51.84
AceGPT-13B	47.21	48.23	41.38	36.87	35.37	56.51	79.96	32.12
Llama-3.1-8B	51.64	64.34	52.28	40.04	43.08	71.98	47.08	42.67
Qwen2.5-7B	41.97	31.72	37.36	37.99	27.11	53.66	62.68	43.30
Falcon-Arabic-7B-Base	62.57	67.17	64.85	52.89	48.79	85.36	63.71	55.25

The evaluation details (log probabilities, predictions and LLM as judge metrics) of Falcon-Arabic-7B-Base are available on https://huggingface.co/datasets/tiiuae/Falcon-Arabic-7B-Base-details

🗣️ From Pretraining to Instruct: Aligning Falcon-Arabic for Conversations

After finalizing the base model training, we performed a post-training alignment phase to fine-tune Falcon-Arabic’s responses according to human preferences. This phase began with supervised fine-tuning (SFT) using a combination of high-quality public datasets and internally collected native Arabic instruction data, covering a range of tasks and conversational scenarios.

To further enhance alignment, we applied Direct Preference Optimization (DPO) a reinforcement learning-based method that tunes the model to prefer outputs that humans rate as more helpful, safe, and relevant. This two-step process ensures that Falcon-Arabic Instruct not only understands Arabic well but responds in a way that aligns with real user expectations.

Average Performance of Instruct Models

As shown in the results plots, Falcon-Arabic Instruct leads the pack, outperforming all other Instruct-aligned Arabic LLMs in its size class and even models significantly larger across multiple benchmarks. The model demonstrates strong performance in both instruction following and open-ended dialogue, setting a new standard for Arabic conversational AI.

Performance of Instruct Models by Benchmark

Comparison Table of Instruct Models

Model	Average	ALGhafa	ALRAGE	AraTrust	ArabicMMLU	ArbMMLU-HT	Exams	MadinahQA
aya-expanse-32b	67.17	77.61	79.64	89.00	60.63	58.86	51.02	53.45
c4ai-command-r7b-arabic-02-2025	67.07	74.84	75.90	80.47	59.34	50.14	64.99	63.84
ALLaM-7B-Instruct-preview	65.25	69.49	76.81	86.93	64.90	52.81	51.58	54.24
Yehia-7B-preview	65.68	70.81	76.64	87.49	64.90	53.40	52.14	54.37
Qwen2-7B-Instruct	63.61	73.24	71.13	82.77	60.01	51.30	47.30	59.50
Falcon-Arabic-7B-Instruct	68.03	72.40	71.77	82.54	68.23	55.37	53.25	72.95

The evaluation details (log probabilities, predictions and LLM as judge metrics) of Falcon-Arabic-7B-Instruct are available on https://huggingface.co/datasets/tiiuae/Falcon-Arabic-7B-Instruct-details

🔓 Unlocking the Potential of Arabic AI

Falcon-Arabic sets a new benchmark for Arabic Language Models. With only 7B parameters, it delivers state-of-the-art performance outperforming models of similar size and even those several times larger across key benchmarks like Arabic MMLU, MadinahQA, and Aratrust. It combines fluency in Modern Standard Arabic, strong understanding of regional dialects, and robust reasoning and multilingual capabilities, making it ideal for a wide range of applications: from Arabic-first chatbots and educational tools to content generation, code assistance, and document understanding.

To give you a hands-on feel for what Falcon-Arabic can do, we built a simple demo that showcases its capabilities in machine translation even though the model hasn’t been fine-tuned specifically for that task. The tool runs purely on Falcon-7B-Arabic-Instruct, and the results are surprisingly strong across various translation directions. You can try it yourself through the demo linked just below. In fact, we used the same setup to translate this blog post into Arabic for our Arabic-speaking audience. Check it out here 🚀. And if you’re curious to explore more, we also provide access to a live playground where you can interact with Falcon-Arabic Instruct and experience its performance across different tasks ✨.

🗣️ Chat with Falcon-Arabic 📊 View evaluation details on 🤗 collection

⚠️ Limitations

Like all Large Language Models, Falcon-Arabic inherits some common limitations. These include occasional hallucinations (producing plausible but incorrect outputs), sensitivity to how prompts are phrased, and varying performance across very long contexts. While Falcon-Arabic is designed to reduce these issues especially for Arabic tasks users should still apply critical thinking when interpreting results, particularly in high-stakes or fact-sensitive use cases.

Citation

If you find this work helpful for your research or projects, please consider citing it.

@misc{falcon-arabic,
    title = {Falcon-Arabic: A Breakthrough in Arabic Language Models},
    author = {Falcon-LLM Team},
    month = {May},
    url = {https://falcon-lm.github.io/blog/falcon-arabic},
    year = {2025}
}