Gemma 3 Technical Report

Overview

Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by adding native multimodal vision capabilities, extending context windows to 128K tokens, and substantially improving multilingual support -- all while maintaining efficiency for deployment on consumer hardware. The core architectural innovation is an interleaved local/global attention mechanism with a 5:1 ratio that reduces KV-cache memory by approximately 5x at 128K context length compared to full global attention.

The efficiency gains are striking: Gemma 3 4B-IT matches Gemma 2 27B-IT performance across benchmarks, and the 27B variant achieves results comparable to Gemini 1.5 Pro despite being far smaller. This compression is achieved through a combination of knowledge distillation from larger teacher models and novel post-training techniques including reinforcement learning from human feedback. The model family ranked in the top 10 on Chatbot Arena with lower computational requirements than competing models.

Vision capabilities are added through a SigLIP-based vision encoder that converts images into soft token sequences, enabling the model to process interleaved text and image inputs natively. A Pan-and-Scan strategy crops high-resolution images into multiple sub-images for detailed understanding. Gemma 3 also demonstrates dramatically reduced memorization compared to Gemma 2: the 1B model shows approximately 0.0001% exact memorization versus 0.03% for Gemma 2 2B -- orders of magnitude improvement through architectural changes and data filtering.

Key Contributions

Interleaved local/global attention: A 5:1 ratio of local sliding-window attention layers to global attention layers reduces KV-cache memory by ~5x at 128K context while maintaining quality, making long-context inference practical on consumer GPUs.
Efficient multimodal integration: SigLIP vision encoder with Pan-and-Scan image processing adds vision capabilities without architectural bloat, enabling text-image interleaved understanding.
Knowledge distillation at scale: Distillation from larger teacher models enables the 4B model to match the prior-generation 27B model, demonstrating that model compression can yield generation-over-generation efficiency gains.
128K context via RoPE scaling: Increasing the RoPE base frequency from 10K to 1M on global attention layers extends context to 128K tokens while preserving short-context quality.
Dramatic memorization reduction: Orders-of-magnitude reduction in exact memorization (~0.0001% vs. 0.03%) through combined architectural and data-filtering techniques.
Multilingual expansion: Broad multilingual support across dozens of languages, extending the model's utility beyond English-centric benchmarks.

Architecture / Method

┌──────────────────────────────────────────────────────────────┐
│                    Gemma 3 Architecture                       │
│                                                              │
│  Image Input          Text Input                             │
│      │                    │                                  │
│      ▼                    ▼                                  │
│  ┌────────┐         ┌──────────┐                             │
│  │ SigLIP │         │Tokenizer │                             │
│  │ Vision │         └────┬─────┘                             │
│  │Encoder │              │                                   │
│  └───┬────┘              │                                   │
│      │                   │                                   │
│      ▼                   │                                   │
│  Pan-and-Scan            │                                   │
│  (full + crops)          │                                   │
│      │                   │                                   │
│      └──► [soft visual tokens] + [text tokens] ◄────┘       │
│                          │                                   │
│                          ▼                                   │
│           ┌──────────────────────────────┐                   │
│           │  Decoder-Only Transformer     │                  │
│           │  (5:1 Local/Global Attention) │                  │
│           │                              │                   │
│           │  Layer 1: Local (sliding window, 1024)             │
│           │  Layer 2: Local                │                 │
│           │  Layer 3: Local                │                 │
│           │  Layer 4: Local                │                 │
│           │  Layer 5: Local                │                 │
│           │  Layer 6: Global (full attn, RoPE 1M base)      │
│           │  Layer 7: Local                │                 │
│           │  ...repeat pattern...         │                  │
│           │                              │                   │
│           │  ──► ~5x KV-cache savings    │                   │
│           │       at 128K context        │                   │
│           └──────────────┬───────────────┘                   │
│                          ▼                                   │
│                    Output Tokens                             │
│                                                              │
│  Sizes: 1B (text-only) | 4B | 12B | 27B (all multimodal)   │
└──────────────────────────────────────────────────────────────┘

Gemma 3 architecture overview

Gemma 3 uses a decoder-only Transformer architecture with several key modifications:

Interleaved Local/Global Attention

The defining architectural choice is the 5:1 interleaving of local sliding-window attention and global full attention layers. In a typical configuration:

Local layers (5 out of every 6): Use sliding-window attention with a limited receptive field, requiring only a fixed-size KV cache regardless of sequence length.
Global layers (1 out of every 6): Use full causal attention across the entire sequence, with RoPE base frequency increased from 10K to 1M to handle long contexts.

This design reduces KV-cache memory by approximately 5x at 128K context compared to full global attention, making long-context inference feasible on hardware with limited memory.

Model Configurations

Parameter	1B	4B	12B	27B
Parameters	1B	4B	12B	27B
Context length	32K	128K	128K	128K
Multimodal	Text-only	Text + Vision	Text + Vision	Text + Vision

Vision Encoder

The 4B, 12B, and 27B variants incorporate a SigLIP-based vision encoder that processes images into soft token sequences fed into the language model. The Pan-and-Scan strategy handles high-resolution images by:

Taking the full image as one view for global context
Cropping the image into multiple sub-images to capture fine-grained details
Encoding each crop independently through SigLIP
Concatenating the resulting soft tokens into the text token stream

This approach enables detailed visual understanding without requiring extremely large vision encoders or massive token counts.

Training

Performance across model sizes

Pre-training: Trained on hundreds of billions of tokens of multilingual text and image-text data.
Knowledge distillation: Smaller models are distilled from larger teacher models during pre-training, which is the primary mechanism behind the 4B matching 27B-class performance.
Post-training: Multi-stage pipeline including supervised fine-tuning (SFT) on high-quality instruction data and reinforcement learning from human feedback (RLHF) for alignment and helpfulness.
Quantization-aware training: Models are trained with quantization awareness to enable efficient deployment at reduced precision without significant quality loss.

Results

Benchmark results

Key Performance Highlights

Model	MMLU	GSM8K	MATH	HumanEval	Chatbot Arena
Gemma 3 27B-IT	Comparable to Gemini 1.5 Pro	Strong	Strong	Strong	Top 10
Gemma 3 4B-IT	Matches Gemma 2 27B-IT	Matches 27B-class	Improved	Improved	Competitive
Gemma 2 27B-IT	Baseline	Baseline	Baseline	Baseline	—

The most notable result is the efficiency gain: Gemma 3 4B achieves parity with the previous generation's 27B model across mathematics, reasoning, and chat benchmarks, representing a roughly 7x parameter reduction for equivalent capability.

Memorization

Memorization analysis

Model	Exact Memorization Rate
Gemma 2 2B	~0.03%
Gemma 3 1B	~0.0001%

This ~300x reduction in memorization is achieved through a combination of architectural changes (the local/global attention pattern limits the model's ability to memorize long verbatim sequences) and aggressive data deduplication and filtering.

Limitations & Open Questions

Vision capabilities are add-on, not native: The SigLIP encoder is a separate module grafted onto the language model, rather than a truly unified multimodal architecture. Whether this limits deep vision-language reasoning compared to natively multimodal models remains unclear.
Distillation ceiling: Knowledge distillation compresses teacher capability but cannot exceed it. The quality ceiling is ultimately set by the (unreleased) teacher models.
Closed training data: While the model weights are open, the training data composition and distillation teacher details are not fully disclosed, limiting reproducibility.
128K context quality: While the architecture supports 128K tokens, the quality of retrieval and reasoning at very long contexts relative to models specifically optimized for long-context (like Gemini 1.5 Pro) is not exhaustively benchmarked.
Sparse vs. dense scaling: Gemma 3 remains a dense model family. Whether sparse MoE variants (like Mixtral) would be more efficient at these scales is an open question.