Meta has introduced a new generation of multimodal open AI models, Llama 4, which combines the ability to understand images, video, and text in a single architecture. These models are the first natively multimodal with open weights, allowing developers and enterprises to work with them to solve a wide range of tasks.
Key Features of Llama 4
-
Llama 4 Scout
-
A model with 17 billion active parameters and 16 experts (total 109 billion parameters).
-
The best multimodal model in its class, surpassing models such as Gemini 3, Gemini 2.0 Flash-Lite and Mistral 3.1 .
-
The main feature is a record context window of 10 million tokens and the ability to work on a single H100 GPU with Int4 quantization.
-
-
Llama 4 Maverick
-
Powerful model with 17 billion active parameters and 128 experts (400 billion parameters in total).
-
According to Meta, this model outperforms GPT-4o and Gemini 2.0 Flash in numerous benchmarks. It also shows results comparable to DeepSeek v3 in reasoning and encoding tasks, but with half the number of parameters.
-
The experimental chat version reached ELO 1417 on LMArena.
-
-
Llama 4 Behemoth
-
A teacher model with 288 billion active parameters and 16 experts (almost 2 trillion parameters in total).
-
Beats GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.
-
This model is currently in the training process and has not yet been released publicly.
-
Technical innovations
Mixture of Experts (MoE) architecture
One of the new features of Llama 4 is the Mixture of Experts (MoE) architecture, where only a subset of the model parameters are activated for each token. This reduces computational costs and latency while maintaining high performance. In Llama 4 Maverick, each token is processed by a common expert and one of 128 routed experts.
Native multimodality
Llama 4 models integrate text and visual tokens into a single model architecture through early fusion, allowing for joint pre-training on large amounts of text, image, and video data. An improved visual encoder based on MetaCLIP provides better adaptation to language models.
Extremely long context
The Llama 4 Scout model supports an incredibly long context of 10 million tokens thanks to its iRoPE (interleaved attention layers) architecture. This allows the model to work efficiently with large text arrays.
New teaching methods
-
MetaP: A technique that provides robust tuning of critical model hyperparameters, such as the learning rate for each layer.
-
FP8-precision: using 8-bit floating-point precision, allowing you to train models with high performance without losing quality.
-
Co-distillation: Using Llama 4 Behemoth as a teacher to train smaller models.
-
Fully asynchronous online learning with reinforcement: a new infrastructure for large-scale learning that increases efficiency by 10 times.
Benchmark results
-
Cost: $0.19-$0.49 per 1 million tokens (depending on settings), compared to $4.38 per 1 million tokens in GPT-4o .
-
Image processing:
-
MMMU: 73.4 (vs. 71.7 in Gemini 2.0 Flash and 69.1 in GPT-4o).
-
MathVista: 73.7 (vs. 73.1 in Gemini and 63.8 in GPT-4o).
-
ChartQA: 90.0 (vs. 88.3 in Gemini and 85.7 in GPT-4o).
-
DocVQA: 94.4 (versus 92.8 in GPT-4o).
-
-
Coding:
-
LiveCodeBench: 43.4 (leader — DeepSeek v3.1 with 45.8/49.2).
-
-
Understanding and knowledge:
-
MMLU Pro: 80.5 (vs. 77.6 for Gemini, DeepSeek leads with 81.2).
-
GPQA Diamond: 69.8 (vs. 60.1 for Gemini, 68.4 for DeepSeek, and 53.6 for GPT-4o).
-
-
Multilingualism:
-
Multilingual MMLU: 84.6 (vs. 81.5 in GPT-4o).
-
-
Long context:
-
MTOB (full book): 50.8/46.7 (vs. 45.5/39.6 in Gemini).
-
Behemoth vs. competitors
The Llama 4 Behemoth model surpasses the flagships of other companies in many ways:
-
LiveCodeBench: 49.4 (vs. 36.0 in Gemini 2.0 Pro).
-
MATH-500: 95.0 (vs. 82.2 in Claude Sonnet 3.7 and 91.8 in Gemini 2.0 Pro).
-
MMLU Pro: 82.2 (vs. 79.1 in Gemini 2.0 Pro).
-
GPQA Diamond: 73.7 (vs. 71.4 in GPT-4.5, 68.0 in Claude, and 64.7 in Gemini).
Availability and application
The Llama 4 Scout and Llama 4 Maverick models are available for download now at llama.com and Hugging Face. They are also integrated into Meta AI for use in WhatsApp, Messenger, Instagram Direct, and the Meta.AI website. For developers, enterprises, and researchers, these models offer a great balance between high performance and affordability.
Safety and ethics
Meta has placed a significant emphasis on security and reducing bias in the new models. A number of security tools have been developed, such as Llama Guard, Prompt Guard, and CyberSecEval. In addition, the rejection rate for questions about controversial political and social topics has been significantly reduced (from 7% in Llama 3.3 to less than 2%).