GRASP: Overcoming Optimization Barriers in Long-Horizon World Model Planning

Introduction

Large learned world models are rapidly becoming more powerful. They can predict extended sequences of future observations in complex visual spaces and generalize across diverse tasks in ways that seemed impossible just a few years ago. As these models scale, they increasingly resemble general-purpose simulators rather than task-specific predictors.

GRASP: Overcoming Optimization Barriers in Long-Horizon World Model Planning — Source: bair.berkeley.edu

However, having a strong predictive model does not automatically translate into effective control or planning. In practice, long-horizon planning with modern world models remains fragile: optimization becomes ill-conditioned, non-greedy structures create stubborn local minima, and high-dimensional latent spaces introduce subtle failure modes. This article explores the challenges that motivated our project and presents GRASP, a new gradient-based planner designed to make long-horizon planning practical.

GRASP addresses these issues through three key innovations: (1) lifting the trajectory into virtual states to parallelize optimization across time, (2) adding stochasticity directly to state iterates for robust exploration, and (3) reshaping gradients so actions receive clean signals while avoiding brittle state-input gradients through high-dimensional vision models.

The Challenge of Long-Horizon Planning

Planning over many steps with learned dynamics models is inherently difficult. When the horizon grows, the optimization landscape becomes increasingly ill-conditioned. Gradients from early actions can vanish or explode as they propagate through many time steps, making it hard to discover good sequences of actions.

Moreover, non-greedy plans—where actions taken now only pay off far in the future—create deep valleys and narrow basins in the objective function. Standard gradient-based optimizers can easily get trapped in local minima, especially when combined with high-dimensional latent representations that add noise and complexity.

These issues compound: a world model that predicts accurately for short horizons may still be unusable for planning over dozens or hundreds of steps. This is the real stress test for modern world models, and it demands new algorithmic thinking.

What Are World Models?

The term "world model" is broadly used today. It can refer to an explicit dynamics model or an implicit internal state that a generative model relies on (for instance, when an LLM generates chess moves, whether it holds an internal representation of the board). For our purposes, we adopt a loose working definition.

Consider you take actions a_t ∈ A and observe states s_t ∈ S (images, latent vectors, proprioception). A world model is a learned model that, given the current state and a sequence of future actions, predicts what will happen next. Formally, it defines a predictive distribution on the next observed state:

P_θ ( s_t+1 | s_t-h:t, a_t )

This distribution approximates the environment's true dynamics. World models are trained on experience, often using large datasets of agent interactions. Their power lies in their ability to imagine future outcomes, which is essential for planning and reinforcement learning.

GRASP's Three Key Innovations

GRASP builds on gradient-based trajectory optimization but adds three specific modifications that dramatically improve robustness for long horizons.

1. Virtual States for Parallel Optimization

Instead of optimizing a single trajectory sequentially, GRASP lifts the trajectory into a set of virtual states. These virtual states represent the system at different future times, but they are treated as independent optimization variables. The world model is then used to enforce consistency between consecutive virtual states via a penalty or constraint. This reformulation allows the optimization to be parallelized across time, breaking the sequential dependency that makes gradients ill-conditioned.

2. Stochasticity in State Iterates

To escape bad local minima and explore the space of trajectories, GRASP injects stochasticity directly into the state iterates. During optimization, each virtual state receives a small random perturbation. This noise helps the optimizer avoid shallow basins and find better long-horizon plans. The stochasticity is carefully controlled—it is reduced over time (like simulated annealing) to ensure convergence to a high-quality solution.

3. Gradient Reshaping for Clean Action Signals

A major challenge in planning with high-dimensional visual world models is that gradients backpropagated through the perception network can be noisy or brittle. GRASP reshapes the gradients so that the actions receive clean, informative signals. Instead of relying on full state-input gradients through the vision model, it uses an approximation that decouples the action optimization from the perception path. This reduces variance and makes the planning process much more stable, especially when the world model operates in latent spaces.

How GRASP Improves Robustness

The combination of these three innovations leads to a planner that works reliably even for very long horizons. In our experiments, GRASP consistently finds good plans where standard gradient-based methods fail. It achieves lower cumulative cost, fewer constraint violations, and better exploration of alternative strategies.

Importantly, GRASP is agnostic to the specific architecture of the world model. It can be applied to both recurrent and transformer-based predictors, and it scales gracefully with horizon length. This makes it a practical tool for using large learned models in downstream control tasks.

Conclusion

Planning with powerful world models is a critical capability for autonomous agents. However, long horizons expose fundamental weaknesses in naive gradient-based approaches. GRASP addresses these weaknesses through a principled combination of virtual states, stochastic optimization, and gradient reshaping. The result is a robust planner that unlocks the potential of modern world models for complex, multi-step decision-making.

This project—done in collaboration with Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar—shows that gradient-based planning can be made practical for long horizons with the right design choices. For more details, please see the full paper and code release.