Gemma 3 Technical Report
Overview
Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by adding native multimodal vision capabilities, extending context windows to 128K tokens, and substantially improving multilingual support -- all while maintaining efficiency for deployment on consumer hardware. The core architectural innovation is an interleaved local/global attention mechanism with a 5:1 ratio that reduces KV-cache memory by approximately 5x at 128K context length compared to full global attention.
The efficiency gains are striking: Gemma 3 4B-IT matches Gemma 2 27B-IT performance across benchmarks, and the 27B variant achieves results comparable to Gemini 1.5 Pro despite being far smaller. This compression is achieved through a combination of knowledge distillation from larger teacher models and novel post-training techniques including reinforcement learning from human feedback. The model family ranked in the top 10 on Chatbot Arena with lower computational requirements than competing models.
Vision capabilities are added through a SigLIP-based vision encoder that converts images into soft token sequences, enabling the model to process interleaved text and image inputs natively. A Pan-and-Scan strategy crops high-resolution images into multiple sub-images for detailed understanding. Gemma 3 also demonstrates dramatically reduced memorization compared to Gemma 2: the 1B model shows approximately 0.0001% exact memorization versus 0.03% for Gemma 2 2B -- orders of magnitude improvement through architectural changes and data filtering.
Key Contributions
- Interleaved local/global attention: A 5:1 ratio of local sliding-window attention layers to global attention layers reduces KV-cache memory by ~5x at 128K context while maintaining quality, making long-context inference practical on consumer GPUs.
- Efficient multimodal integration: SigLIP vision encoder with Pan-and-Scan image processing adds vision capabilities without architectural bloat, enabling text-image interleaved understanding.
- Knowledge distillation at scale: Distillation from larger teacher models enables the 4B model to match the prior-generation 27B model, demonstrating that model compression can yield generation-over-generation efficiency gains.
- 128K context via RoPE scaling: Increasing the RoPE base frequency from 10K to 1M on global attention layers extends context to 128K tokens while preserving short-context quality.
- Dramatic memorization reduction: Orders-of-magnitude reduction in exact memorization (~0.0001% vs. 0.03%) through combined architectural and data-filtering techniques.
- Multilingual expansion: Broad multilingual support across dozens of languages, extending the model's utility beyond English-centric benchmarks.
Architecture / Method
┌──────────────────────────────────────────────────────────────┐
│ Gemma 3 Architecture │
│ │
│ Image Input Text Input │
│ │ │ │
│ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ │
│ │ SigLIP │ │Tokenizer │ │
│ │ Vision │ └────┬─────┘ │
│ │Encoder │ │ │
│ └───┬────┘ │ │
│ │ │ │
│ ▼ │ │
│ Pan-and-Scan │ │
│ (full + crops) │ │
│ │ │ │
│ └──► [soft visual tokens] + [text tokens] ◄────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Decoder-Only Transformer │ │
│ │ (5:1 Local/Global Attention) │ │
│ │ │ │
│ │ Layer 1: Local (sliding window, 1024) │
│ │ Layer 2: Local │ │
│ │ Layer 3: Local │ │
│ │ Layer 4: Local │ │
│ │ Layer 5: Local │ │
│ │ Layer 6: Global (full attn, RoPE 1M base) │
│ │ Layer 7: Local │ │
│ │ ...repeat pattern... │ │
│ │ │ │
│ │ ──► ~5x KV-cache savings │ │
│ │ at 128K context │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ Output Tokens │
│ │
│ Sizes: 1B (text-only) | 4B | 12B | 27B (all multimodal) │
└──────────────────────────────────────────────────────────────┘

Gemma 3 uses a decoder-only Transformer architecture with several key modifications:
Interleaved Local/Global Attention
The defining architectural choice is the 5:1 interleaving of local sliding-window attention and global full attention layers. In a typical configuration:
- Local layers (5 out of every 6): Use sliding-window attention with a limited receptive field, requiring only a fixed-size KV cache regardless of sequence length.
- Global layers (1 out of every 6): Use full causal attention across the entire sequence, with RoPE base frequency increased from 10K to 1M to handle long contexts.
This design reduces KV-cache memory by approximately 5x at 128K context compared to full global attention, making long-context inference feasible on hardware with limited memory.
Model Configurations
| Parameter | 1B | 4B | 12B | 27B |
|---|---|---|---|---|
| Parameters | 1B | 4B | 12B | 27B |
| Context length | 32K | 128K | 128K | 128K |
| Multimodal | Text-only | Text + Vision | Text + Vision | Text + Vision |
Vision Encoder
The 4B, 12B, and 27B variants incorporate a SigLIP-based vision encoder that processes images into soft token sequences fed into the language model. The Pan-and-Scan strategy handles high-resolution images by:
- Taking the full image as one view for global context
- Cropping the image into multiple sub-images to capture fine-grained details
- Encoding each crop independently through SigLIP
- Concatenating the resulting soft tokens into the text token stream
This approach enables detailed visual understanding without requiring extremely large vision encoders or massive token counts.
Training

- Pre-training: Trained on hundreds of billions of tokens of multilingual text and image-text data.
- Knowledge distillation: Smaller models are distilled from larger teacher models during pre-training, which is the primary mechanism behind the 4B matching 27B-class performance.
- Post-training: Multi-stage pipeline including supervised fine-tuning (SFT) on high-quality instruction data and reinforcement learning from human feedback (RLHF) for alignment and helpfulness.
- Quantization-aware training: Models are trained with quantization awareness to enable efficient deployment at reduced precision without significant quality loss.
Results

Key Performance Highlights
| Model | MMLU | GSM8K | MATH | HumanEval | Chatbot Arena |
|---|---|---|---|---|---|
| Gemma 3 27B-IT | Comparable to Gemini 1.5 Pro | Strong | Strong | Strong | Top 10 |
| Gemma 3 4B-IT | Matches Gemma 2 27B-IT | Matches 27B-class | Improved | Improved | Competitive |
| Gemma 2 27B-IT | Baseline | Baseline | Baseline | Baseline | — |
The most notable result is the efficiency gain: Gemma 3 4B achieves parity with the previous generation's 27B model across mathematics, reasoning, and chat benchmarks, representing a roughly 7x parameter reduction for equivalent capability.
Memorization

| Model | Exact Memorization Rate |
|---|---|
| Gemma 2 2B | ~0.03% |
| Gemma 3 1B | ~0.0001% |
This ~300x reduction in memorization is achieved through a combination of architectural changes (the local/global attention pattern limits the model's ability to memorize long verbatim sequences) and aggressive data deduplication and filtering.
Limitations & Open Questions
- Vision capabilities are add-on, not native: The SigLIP encoder is a separate module grafted onto the language model, rather than a truly unified multimodal architecture. Whether this limits deep vision-language reasoning compared to natively multimodal models remains unclear.
- Distillation ceiling: Knowledge distillation compresses teacher capability but cannot exceed it. The quality ceiling is ultimately set by the (unreleased) teacher models.
- Closed training data: While the model weights are open, the training data composition and distillation teacher details are not fully disclosed, limiting reproducibility.
- 128K context quality: While the architecture supports 128K tokens, the quality of retrieval and reasoning at very long contexts relative to models specifically optimized for long-context (like Gemini 1.5 Pro) is not exhaustively benchmarked.
- Sparse vs. dense scaling: Gemma 3 remains a dense model family. Whether sparse MoE variants (like Mixtral) would be more efficient at these scales is an open question.
Connections
Related papers in the wiki:
- Attention Is All You Need — the Transformer architecture Gemma 3 builds on
- Palm Scaling Language Modeling With Pathways — PaLM, Google's earlier large-scale LLM; Gemma 3 follows the same dense decoder-only lineage with multi-query attention and SwiGLU
- Language Models Are Few Shot Learners — GPT-3, the model that established few-shot LLM capabilities; Gemma 3 achieves comparable quality at far smaller scale via distillation
- Training Compute Optimal Large Language Models — Chinchilla scaling laws; Gemma 3's distillation approach is complementary, compressing capability beyond what data-optimal training alone achieves
- Mixtral Of Experts — Mixtral sparse MoE; an alternative scaling strategy to Gemma 3's dense + distillation approach
- Visual Instruction Tuning — LLaVA's vision-language architecture; Gemma 3 uses a similar vision encoder + language model pattern but with SigLIP instead of CLIP
- Learning Transferable Visual Models From Natural Language Supervision — CLIP, the contrastive vision-language paradigm that SigLIP (Gemma 3's vision encoder) builds upon
- Llama 2 Open Foundation And Fine Tuned Chat Models — Llama 2, the primary open-weight competitor family; Gemma 3 targets similar deployment scenarios
- Pi0 A Vision Language Action Flow Model For General Robot Control — pi0 uses PaliGemma (Gemma's vision-language variant) as its VLM backbone; Gemma 3 advances the underlying model family
- Openvla An Open Source Vision Language Action Model — OpenVLA uses SigLIP (Gemma 3's vision encoder) as part of its dual encoder; Gemma 3 improves the SigLIP integration
- Foundation Models — broader context on foundation models, scaling, and the pretrain-then-adapt paradigm
- Machine Learning — transformer era and scaling laws context