S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

Overview

S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-view camera imagery to predict future trajectories directly, addressing two fundamental challenges: MLLMs' reliance on 2D representations when operating in 3D space, and the limited scale of annotated training data in public driving datasets.

The core insight is that annotation-free training can match or exceed supervised approaches when combined with architectural innovations that transform MLLMs' 2D visual representations into effective 3D spatio-temporal reasoning. Built on PaLI3-5B, S4-Driver introduces a hierarchical coarse-to-fine planning strategy using "meta-decisions" (high-level acceleration behavior descriptions generated through heuristics), a novel sparse 3D volume representation, and multi-frame temporal fusion -- all without requiring human annotations for perception or prediction tasks.

S4-Driver represents a paradigm shift toward annotation-free autonomous driving that can leverage vast unlabeled driving data. By eliminating dependence on expensive human labeling for perception and prediction tasks, the framework dramatically reduces the cost and time required to develop capable autonomous driving systems, demonstrating clear scaling benefits on datasets 100x larger than nuScenes.

Key Contributions

Self-supervised MLLM for driving: Achieves state-of-the-art motion planning without any human annotation for perception or prediction, using only raw driving data and self-supervised objectives
3D spatio-temporal visual representation: Transforms 2D MLLM features into 3D volumes through dense volume projection, sparse volume representation, and local feature aggregation with relative position bias
Hierarchical meta-decision planning: Introduces coarse-to-fine planning with heuristic-generated meta-decisions describing future acceleration behavior, eliminating annotation requirements
Multi-decoding strategy: Generates K=16 plausible trajectories and averages them to counteract the conservative prediction problem inherent in imitation learning
Scalability validation: Demonstrates consistent improvement with data scale on WOMD (100x larger than nuScenes), validating the self-supervised approach at scale

Architecture / Method

┌─────────────────────────────────────────────────────────────┐
│                     S4-Driver Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐    │
│  │ Multi-View  │  │ Historical  │  │ Behavior Command   │   │
│  │ Cameras     │  │ Ego States  │  │ (meta-decision)    │   │
│  └─────┬──────┘  └─────┬──────┘  └─────────┬──────────┘   │
│        │               │                    │              │
│        ▼               │                    │              │
│  ┌───────────────┐     │                    │              │
│  │ PaLI3-5B      │     │                    │              │
│  │ Vision Encoder │     │                    │              │
│  └─────┬─────────┘     │                    │              │
│        │               │                    │              │
│        ▼               │                    │              │
│  ┌────────────────────────────────────┐     │              │
│  │  3D Spatio-Temporal Visual Module  │     │              │
│  │  ┌──────────────────────────────┐  │     │              │
│  │  │ Dense Volume Projection      │  │     │              │
│  │  │         ▼                    │  │     │              │
│  │  │ Sparse Volume Representation │  │     │              │
│  │  │         ▼                    │  │     │              │
│  │  │ Local Feature Aggregation    │  │     │              │
│  │  │   (3D position bias)         │  │     │              │
│  │  │         ▼                    │  │     │              │
│  │  │ Multi-Frame Temporal Fusion  │  │     │              │
│  │  └──────────────────────────────┘  │     │              │
│  └──────────────┬─────────────────────┘     │              │
│                 │                            │              │
│                 ▼                            ▼              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │            PaLI3-5B Language Decoder                  │  │
│  │   Coarse: Meta-decision ("accelerating forward")     │  │
│  │   Fine:   Trajectory Waypoints (K=16 decoded)        │  │
│  └──────────────────────┬───────────────────────────────┘  │
│                         ▼                                  │
│               ┌──────────────────┐                         │
│               │ Averaged Planned │                         │
│               │    Trajectory    │                         │
│               └──────────────────┘                         │
└─────────────────────────────────────────────────────────────┘

S4-Driver architecture overview

The architecture processes multi-view camera images, historical ego-vehicle states, and behavior commands through PaLI3-5B as a unified text interface. The key innovation is the 3D Spatio-Temporal Visual Representation module:

Dense Volume Projection projects 2D image features from the MLLM's vision encoder into 3D feature volumes centered on the ego-vehicle, creating a volumetric representation of the surrounding scene.

Sparse Volume Representation learns to retain only relevant voxels through adaptive sparse features. Ablation studies show the sparse volumes concentrate on semantically important regions (front area, road center), dramatically reducing computational cost while preserving planning-relevant information.

Local Feature Aggregation injects relative 3D position bias into attention mechanisms, enabling the MLLM to perform genuine 3D spatial reasoning rather than relying on implicit 2D heuristics.

Multi-Frame Temporal Fusion incorporates temporal information by ego-motion compensating historical frames and fusing multi-frame features, giving the model awareness of scene dynamics.

The hierarchical planning approach first generates coarse meta-decisions (e.g., "accelerating forward") then refines these into precise trajectory waypoints, decomposing the planning problem into manageable subtasks.

Results

Benchmark	Metric	S4-Driver	S4-Driver* (pretrained)	Comparison
nuScenes	Avg L2 (1-3s)	0.38m	0.31m	OmniDrive: 0.33m, VAD: 0.37m, UniAD: 0.66m
WOMD	ADE@5s	0.693	0.655	Vanilla PaLI: 0.798, MotionLM: 0.697
WOMD	bADE@5s	0.928	0.830	Vanilla PaLI: 1.069

On nuScenes, the base S4-Driver achieves 0.38m average L2 error; S4-Driver* (pretrained on large internal dataset) achieves 0.31m, outperforming supervised baselines OmniDrive (0.33m) and VAD (0.37m). On the proprietary WOMD-Planning-ADE benchmark (100x larger than nuScenes), the base S4-Driver performs competitively against MotionLM (0.697 vs 0.693 ADE@5s) despite using only raw camera inputs versus MotionLM's richer inputs.

Meta-decision accuracy ranges 67-85% across driving behaviors. Ablation studies confirm that each component (sparse volumes, temporal fusion, multi-decoding, stronger backbones) contributes measurably, and that stronger MLLM backbones yield significant gains on large datasets.

Limitations & Open Questions

Proprietary large-scale data: The strongest results (S4-Driver*) rely on a large internal dataset not publicly available, making full reproduction difficult
Meta-decision heuristics: The coarse planning stage relies on hand-designed heuristics for meta-decision generation; learning these end-to-end could improve performance
Closed-loop evaluation absent: All evaluation is open-loop; the gap between open-loop metrics and closed-loop driving competence remains untested

Connections

Extends the MLLM-for-driving paradigm explored by Emma End To End Multimodal Model For Autonomous Driving but eliminates the need for annotations
Competes directly with supervised end-to-end systems like Vad Vectorized Scene Representation For Efficient Autonomous Driving and Planning Oriented Autonomous Driving (UniAD)
The BEV representation builds on the lift-splat paradigm from Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D
Self-supervised scaling philosophy parallels Scaling Laws For Neural Language Models -- more data improves performance predictably