Interactive Explainer

World Models:
How AI Learns to Simulate Reality

Language models predict the next word. World models predict the next moment. They are AI systems that learn how reality works - not by reading about it, but by building an internal simulation of the world that can be imagined, explored, and acted upon.

March 2026 · 15 min read

Close your eyes and imagine pushing a ball off a table. You can see it falling, hear the bounce, predict where it rolls. You did not need to actually push the ball - your brain ran a simulation. This internal simulator is what cognitive scientists call a world model.

Now AI is learning to do the same thing. Below is a tiny world - a grid with a ball that obeys gravity. Click Step to advance time. The orange overlay shows what a simple predictive model thinks will happen next.

Figure 1 — A Tiny Physics World

Step: 0

A ball falls under gravity. The orange ghost shows the model's prediction for the next frame. Notice how the prediction aligns perfectly in this simple, deterministic world. Real worlds are far less predictable.

In this toy example, prediction is trivial - gravity is a simple rule. But what about a world with hundreds of objects, occlusion, friction, other agents making decisions? The prediction problem becomes enormously complex. This is the challenge that world models tackle.

The Core Idea

A world model is a learned function that, given the current state of an environment and an action, predicts what happens next:

s_t+1 = f(s_t, a_t)

Where s_t is the state at time t, a_t is the action taken, and s_t+1 is the predicted next state. The model learns f from data - by observing thousands or millions of state transitions.

Try it yourself. Use the arrow keys or buttons below to move the agent through a grid world. The left panel shows the actual result; the right shows what the model predicted would happen.

Figure 2 — Predict, Then Verify

Accuracy: 100%

Move the agent with arrow keys or buttons. In deterministic mode, prediction is perfect. Toggle Add Noise to introduce randomness - the model's predictions start to diverge. This is the fundamental challenge: the real world is stochastic.

The key insight: In a deterministic world, a world model only needs to learn the rules. In a stochastic world, it must learn a distribution over possible futures. This distinction drives the three major approaches to world modeling that have emerged in 2024-2026.

Three Philosophies

Three distinct research programs have emerged, each with a fundamentally different answer to the question: what should a world model predict?

Genie Predict the next video frame. Generate pixels directly, conditioned on actions.

JEPA Predict the next representation. Work in abstract embedding space, not pixels.

World Labs Predict the next 3D scene. Understand spatial structure and geometry.

Google DeepMind

Genie: Imagining Worlds Frame by Frame

Google DeepMind's Genie models (Genie 2, Genie 3, Project Genie) take a direct approach: generate the next video frame conditioned on the user's action. Give it an image of a world and a keyboard input, and it produces what you would see next.

Technically, Genie is an autoregressive latent diffusion model. It compresses frames into a latent space using an autoencoder, then a transformer predicts the next latent frame given past frames and the action. The latent is decoded back into pixels.

z_t+1 = Transformer(z_1:t, a_t) | frame_t+1 = Decoder(z_t+1)

Below, you can step through a simulated version of this process. Pick an action at each step and watch the "world" generate the next frame. Each frame is generated autoregressively - it depends on every frame before it.

Figure 3 — Action-Conditioned Frame Generation

Action:

Frame 1 of 12

Each action generates the next frame. The filmstrip shows your trajectory through the world. Notice how the scene maintains consistency - the model remembers what came before. Genie 3 achieves this at 24fps in real time.

Yann LeCun / AMI Labs

JEPA: Predicting Without Generating

Yann LeCun argues that predicting pixels is wasteful. Most pixel-level detail is irrelevant for understanding - the exact texture of a leaf does not matter for predicting that a ball will fall. JEPA (Joint Embedding Predictive Architecture) instead predicts in representation space.

Two encoders map observations into embeddings. A predictor module then learns to predict the target embedding from the context embedding:

s_x = Enc(x) · s_y = Enc(y) · ŝ_y = Predict(s_x)

The loss minimizes the distance between the predicted and actual target embeddings. Crucially, predictions happen in an abstract space that discards irrelevant detail - the model learns what matters for prediction.

The visualization below shows this idea. Points in 2D represent embeddings. The context encoder maps an observation to a point. The predictor maps it to where it thinks the next observation's embedding will be. Drag the context point to explore the prediction landscape.

Figure 4 — Prediction in Embedding Space

Drag the green point to explore

Green = context embedding, Orange = predicted target embedding, Gray = actual target positions. The predictor learns a smooth mapping. Toggle the energy landscape to see regions of high/low compatibility - JEPA uses an energy function where compatible pairs have low energy.

Why not just predict pixels? Consider predicting the next frame of a video showing a person at a crossroads. They might go left or right - both equally likely. A pixel-level model averaging these futures would produce a blurry ghost going in both directions. JEPA avoids this by predicting in abstract space where "left" and "right" are discrete representations, not overlapping pixel clouds.

Fei-Fei Li / World Labs

Spatial Intelligence: Worlds in 3D

Fei-Fei Li's World Labs takes yet another angle: true understanding of the world requires spatial intelligence - grasping the 3D geometry, layout, and physical structure of scenes. Their model, Marble, generates persistent 3D environments that can be explored from any viewpoint.

This is fundamentally different from generating flat video frames. A spatially intelligent model knows that walking around a chair reveals its back, that objects have volume, that light comes from somewhere. It builds an internal 3D scene graph.

The demo below illustrates this. The same scene is shown from different viewpoints. A 2D model would need to "hallucinate" each view independently. A spatial model maintains a coherent 3D structure that you can orbit, revealing consistent geometry from any angle.

Figure 5 — Spatial Consistency

Drag to orbit · Scroll to zoom

A simple 3D scene with geometric objects. As you rotate, notice how objects maintain consistent spatial relationships - occluding and revealing each other correctly. This is the kind of structural understanding that spatial intelligence provides.

Why It Matters

World models represent a fundamental shift in AI: from systems that process language to systems that understand physical reality. The implications span robotics (simulate before acting), gaming (infinite procedural worlds), autonomous vehicles (predict other drivers), and scientific discovery (simulate experiments).

2018

David Ha & Jurgen Schmidhuber publish "World Models" - an agent learns a compressed representation of its environment using a VAE + RNN.

2023

Yann LeCun publishes his vision for autonomous machine intelligence, proposing JEPA as the architecture for learning world models.

Feb 2024

Google DeepMind releases Genie - the first generative interactive environment model, trained on 2D platformer videos.

Sep 2024

Fei-Fei Li founds World Labs, raising $230M to build spatially intelligent AI.

Dec 2024

Genie 2 generates consistent 3D worlds for up to a minute from a single image.

Aug 2025

Genie 3 achieves real-time generation at 24fps, 720p - the first real-time interactive world model.

Nov 2025

World Labs launches Marble, their first commercial product for 3D world generation.

Mar 2026

            Yann LeCun founds AMI Labs with $1B to commercialize world models.
            World Labs raises $1B. NVIDIA's Cosmos scales to 9,000 trillion tokens.
            The world model race is on.
          

The convergence is striking: three different philosophies, each attracting billions in funding within months of each other. The field has moved from academic curiosity to one of AI's most heavily invested frontiers.

Open Questions

World models are advancing rapidly, but fundamental challenges remain unsolved.

Long-horizon consistency

Genie 3 maintains visual coherence for a few minutes. But a useful world model for robotics or long-form planning needs to stay consistent for hours or days. Errors compound autoregressively - small drift becomes catastrophic over time.

Grounding in physics

Current models learn statistical correlations from video, not actual physics. They might learn that dropped objects fall, but fail on novel scenarios - a ball in zero gravity, an unfamiliar material. True physical reasoning versus pattern matching remains an open problem.

The representation question

Should world models predict pixels (Genie), embeddings (JEPA), or 3D geometry (World Labs)? Each has tradeoffs. Pixel prediction is grounded but wasteful. Embedding prediction is efficient but hard to interpret. 3D prediction is structured but makes strong assumptions about the world.

Perhaps the answer is all three, at different levels of a hierarchy - and indeed LeCun's Hierarchical JEPA (H-JEPA) proposes exactly this: different abstraction levels for different prediction horizons.

Evaluation

How do you measure the quality of a world model? FID scores for video quality? Task success rates for robotics? Physics prediction accuracy? The field lacks a standard benchmark, making it difficult to compare approaches directly.

The world model race is not just a technical competition - it reflects a deeper question about intelligence itself. Is understanding the world fundamentally about seeing (pixels), knowing (representations), or structuring (geometry)? The answer may reshape how we build AI for decades to come.