Стас Юрасов AI Eng 6 April 2025, 10:48

Meta introduces a new generation of open AI models: Llama 4. Here's everything we know about them

Key Features of Llama 4

Llama 4 Scout
- A model with 17 billion active parameters and 16 experts (total 109 billion parameters).
- The best multimodal model in its class, surpassing models such as Gemini 3, Gemini 2.0 Flash-Lite and Mistral 3.1 .
- The main feature is a record context window of 10 million tokens and the ability to work on a single H100 GPU with Int4 quantization.
Llama 4 Maverick
- Powerful model with 17 billion active parameters and 128 experts (400 billion parameters in total).
- According to Meta, this model outperforms GPT-4o and Gemini 2.0 Flash in numerous benchmarks. It also shows results comparable to DeepSeek v3 in reasoning and encoding tasks, but with half the number of parameters.
- The experimental chat version reached ELO 1417 on LMArena.
Llama 4 Behemoth
- A teacher model with 288 billion active parameters and 16 experts (almost 2 trillion parameters in total).
- Beats GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.
- This model is currently in the training process and has not yet been released publicly.

Technical innovations

Mixture of Experts (MoE) architecture

One of the new features of Llama 4 is the Mixture of Experts (MoE) architecture, where only a subset of the model parameters are activated for each token. This reduces computational costs and latency while maintaining high performance. In Llama 4 Maverick, each token is processed by a common expert and one of 128 routed experts.

Native multimodality

Llama 4 models integrate text and visual tokens into a single model architecture through early fusion, allowing for joint pre-training on large amounts of text, image, and video data. An improved visual encoder based on MetaCLIP provides better adaptation to language models.

Extremely long context

The Llama 4 Scout model supports an incredibly long context of 10 million tokens thanks to its iRoPE (interleaved attention layers) architecture. This allows the model to work efficiently with large text arrays.

New teaching methods

MetaP: A technique that provides robust tuning of critical model hyperparameters, such as the learning rate for each layer.
FP8-precision: using 8-bit floating-point precision, allowing you to train models with high performance without losing quality.
Co-distillation: Using Llama 4 Behemoth as a teacher to train smaller models.
Fully asynchronous online learning with reinforcement: a new infrastructure for large-scale learning that increases efficiency by 10 times.

Benchmark results

Cost: $0.19-$0.49 per 1 million tokens (depending on settings), compared to $4.38 per 1 million tokens in GPT-4o .
Image processing:
- MMMU: 73.4 (vs. 71.7 in Gemini 2.0 Flash and 69.1 in GPT-4o).
- MathVista: 73.7 (vs. 73.1 in Gemini and 63.8 in GPT-4o).
- ChartQA: 90.0 (vs. 88.3 in Gemini and 85.7 in GPT-4o).
- DocVQA: 94.4 (versus 92.8 in GPT-4o).
Coding:
- LiveCodeBench: 43.4 (leader — DeepSeek v3.1 with 45.8/49.2).
Understanding and knowledge:
- MMLU Pro: 80.5 (vs. 77.6 for Gemini, DeepSeek leads with 81.2).
- GPQA Diamond: 69.8 (vs. 60.1 for Gemini, 68.4 for DeepSeek, and 53.6 for GPT-4o).
Multilingualism:
- Multilingual MMLU: 84.6 (vs. 81.5 in GPT-4o).
Long context:
- MTOB (full book): 50.8/46.7 (vs. 45.5/39.6 in Gemini).

Behemoth vs. competitors

The Llama 4 Behemoth model surpasses the flagships of other companies in many ways:

LiveCodeBench: 49.4 (vs. 36.0 in Gemini 2.0 Pro).
MATH-500: 95.0 (vs. 82.2 in Claude Sonnet 3.7 and 91.8 in Gemini 2.0 Pro).
MMLU Pro: 82.2 (vs. 79.1 in Gemini 2.0 Pro).
GPQA Diamond: 73.7 (vs. 71.4 in GPT-4.5, 68.0 in Claude, and 64.7 in Gemini).

Availability and application

The Llama 4 Scout and Llama 4 Maverick models are available for download now at llama.com and Hugging Face. They are also integrated into Meta AI for use in WhatsApp, Messenger, Instagram Direct, and the Meta.AI website. For developers, enterprises, and researchers, these models offer a great balance between high performance and affordability.

Safety and ethics

Meta has placed a significant emphasis on security and reducing bias in the new models. A number of security tools have been developed, such as Llama Guard, Prompt Guard, and CyberSecEval. In addition, the rejection rate for questions about controversial political and social topics has been significantly reduced (from 7% in Llama 3.3 to less than 2%).