World Models:
How AI Learns to Simulate Reality
Language models predict the next word. World models predict the next moment. They are AI systems that learn how reality works - not by reading about it, but by building an internal simulation of the world that can be imagined, explored, and acted upon.
Close your eyes and imagine pushing a ball off a table. You can see it falling, hear the bounce, predict where it rolls. You did not need to actually push the ball - your brain ran a simulation. This internal simulator is what cognitive scientists call a world model.
Now AI is learning to do the same thing. Below is a tiny world - a grid with a ball that obeys gravity. Click Step to advance time. The orange overlay shows what a simple predictive model thinks will happen next.
In this toy example, prediction is trivial - gravity is a simple rule. But what about a world with hundreds of objects, occlusion, friction, other agents making decisions? The prediction problem becomes enormously complex. This is the challenge that world models tackle.
The Core Idea
A world model is a learned function that, given the current state of an environment and an action, predicts what happens next:
Where st is the state at time t, at is the action taken, and st+1 is the predicted next state. The model learns f from data - by observing thousands or millions of state transitions.
Try it yourself. Use the arrow keys or buttons below to move the agent through a grid world. The left panel shows the actual result; the right shows what the model predicted would happen.
The key insight: In a deterministic world, a world model only needs to learn the rules. In a stochastic world, it must learn a distribution over possible futures. This distinction drives the three major approaches to world modeling that have emerged in 2024-2026.
Three Philosophies
Three distinct research programs have emerged, each with a fundamentally different answer to the question: what should a world model predict?
Genie: Imagining Worlds Frame by Frame
Google DeepMind's Genie models (Genie 2, Genie 3, Project Genie) take a direct approach: generate the next video frame conditioned on the user's action. Give it an image of a world and a keyboard input, and it produces what you would see next.
Technically, Genie is an autoregressive latent diffusion model. It compresses frames into a latent space using an autoencoder, then a transformer predicts the next latent frame given past frames and the action. The latent is decoded back into pixels.
Below, you can step through a simulated version of this process. Pick an action at each step and watch the "world" generate the next frame. Each frame is generated autoregressively - it depends on every frame before it.
JEPA: Predicting Without Generating
Yann LeCun argues that predicting pixels is wasteful. Most pixel-level detail is irrelevant for understanding - the exact texture of a leaf does not matter for predicting that a ball will fall. JEPA (Joint Embedding Predictive Architecture) instead predicts in representation space.
Two encoders map observations into embeddings. A predictor module then learns to predict the target embedding from the context embedding:
The loss minimizes the distance between the predicted and actual target embeddings. Crucially, predictions happen in an abstract space that discards irrelevant detail - the model learns what matters for prediction.
The visualization below shows this idea. Points in 2D represent embeddings. The context encoder maps an observation to a point. The predictor maps it to where it thinks the next observation's embedding will be. Drag the context point to explore the prediction landscape.
Why not just predict pixels? Consider predicting the next frame of a video showing a person at a crossroads. They might go left or right - both equally likely. A pixel-level model averaging these futures would produce a blurry ghost going in both directions. JEPA avoids this by predicting in abstract space where "left" and "right" are discrete representations, not overlapping pixel clouds.
Spatial Intelligence: Worlds in 3D
Fei-Fei Li's World Labs takes yet another angle: true understanding of the world requires spatial intelligence - grasping the 3D geometry, layout, and physical structure of scenes. Their model, Marble, generates persistent 3D environments that can be explored from any viewpoint.
This is fundamentally different from generating flat video frames. A spatially intelligent model knows that walking around a chair reveals its back, that objects have volume, that light comes from somewhere. It builds an internal 3D scene graph.
The demo below illustrates this. The same scene is shown from different viewpoints. A 2D model would need to "hallucinate" each view independently. A spatial model maintains a coherent 3D structure that you can orbit, revealing consistent geometry from any angle.
Why It Matters
World models represent a fundamental shift in AI: from systems that process language to systems that understand physical reality. The implications span robotics (simulate before acting), gaming (infinite procedural worlds), autonomous vehicles (predict other drivers), and scientific discovery (simulate experiments).
The convergence is striking: three different philosophies, each attracting billions in funding within months of each other. The field has moved from academic curiosity to one of AI's most heavily invested frontiers.
Open Questions
World models are advancing rapidly, but fundamental challenges remain unsolved.
Long-horizon consistency
Genie 3 maintains visual coherence for a few minutes. But a useful world model for robotics or long-form planning needs to stay consistent for hours or days. Errors compound autoregressively - small drift becomes catastrophic over time.
Grounding in physics
Current models learn statistical correlations from video, not actual physics. They might learn that dropped objects fall, but fail on novel scenarios - a ball in zero gravity, an unfamiliar material. True physical reasoning versus pattern matching remains an open problem.
The representation question
Should world models predict pixels (Genie), embeddings (JEPA), or 3D geometry (World Labs)? Each has tradeoffs. Pixel prediction is grounded but wasteful. Embedding prediction is efficient but hard to interpret. 3D prediction is structured but makes strong assumptions about the world.
Perhaps the answer is all three, at different levels of a hierarchy - and indeed LeCun's Hierarchical JEPA (H-JEPA) proposes exactly this: different abstraction levels for different prediction horizons.
Evaluation
How do you measure the quality of a world model? FID scores for video quality? Task success rates for robotics? Physics prediction accuracy? The field lacks a standard benchmark, making it difficult to compare approaches directly.
The world model race is not just a technical competition - it reflects a deeper question about intelligence itself. Is understanding the world fundamentally about seeing (pixels), knowing (representations), or structuring (geometry)? The answer may reshape how we build AI for decades to come.