{"id":14076,"date":"2026-04-24T02:18:13","date_gmt":"2026-04-24T02:18:13","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=14076"},"modified":"2026-04-24T02:18:14","modified_gmt":"2026-04-24T02:18:14","slug":"gradient-based-planning-for-world-fashions-at-longer-horizons-the-berkeley-synthetic-intelligence-analysis-weblog","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=14076","title":{"rendered":"Gradient-based Planning for World Fashions at Longer Horizons \u2013 The Berkeley Synthetic Intelligence Analysis Weblog"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div id=\"\">\n  <br \/>\n<meta name=\"twitter:title\" content=\"Gradient-based Planning for World Models at Longer Horizons\"\/><\/p>\n<p><meta name=\"twitter:card\" content=\"summary_large_image\"\/><\/p>\n<p><meta name=\"twitter:image\" content=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/pusht_zoomout.gif\"\/><\/p>\n<p><meta name=\"keywords\" content=\"world models, planning, adversarial robustness, gradient-based optimization\"\/><\/p>\n<p><meta name=\"description\" content=\"GRASP is a new gradient-based planner for learned dynamics (a world model) that makes long-horizon planning practical by lifting trajectories into virtual states, stochastic state iterates for exploration, and gradient reshaping so action signals stay clean.\"\/><\/p>\n<p><meta name=\"author\" content=\"Michael Psenka, Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar\"\/><\/p>\n<div style=\"display: flex; flex-direction: column; align-items: center; gap: 1em; margin-bottom: 1.5em;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/ballnav_demo.gif\" alt=\"BallNav demo\" style=\"max-width: 60%;\"\/><br \/>\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/pusht_zoomout.gif\" alt=\"Push-T demo\" style=\"max-width: 90%;\"\/>\n<\/div>\n<p><strong>GRASP<\/strong> is a brand new gradient-based planner for realized dynamics (a \u201cworld mannequin\u201d) that makes long-horizon planning sensible by (1) lifting the trajectory into digital states so optimization is parallel throughout time, (2) including stochasticity on to the state iterates for exploration, and (3) reshaping gradients so actions get clear indicators whereas we keep away from brittle \u201cstate-input\u201d gradients via high-dimensional imaginative and prescient fashions.<\/p>\n<p><\/p>\n<p>Giant, realized world fashions have gotten more and more succesful. They&#8217;ll predict lengthy sequences of future observations in high-dimensional visible areas and generalize throughout duties in ways in which had been troublesome to think about just a few years in the past. As these fashions scale, they begin to look much less like task-specific predictors and extra like general-purpose simulators.<\/p>\n<p>However having a strong predictive mannequin is just not the identical as having the ability to use it successfully for management\/studying\/planning. In apply, long-horizon planning with trendy world fashions stays fragile: optimization turns into ill-conditioned, non-greedy construction creates dangerous native minima, and high-dimensional latent areas introduce delicate failure modes.<\/p>\n<p>On this weblog publish, I describe the issues that motivated this undertaking and our method to handle them: why planning with trendy world fashions could be surprisingly fragile, why lengthy horizons are the true stress take a look at, and what we modified to make gradient-based planning far more sturdy.<\/p>\n<hr\/>\n<blockquote>\n<p>This weblog publish discusses work executed with Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar (* denotes equal advisorship), the place we suggest GRASP.<\/p>\n<\/blockquote>\n<hr\/>\n<h2 id=\"what-is-a-world-model\">What&#8217;s a world mannequin?<\/h2>\n<p>Lately, the time period \u201cworld mannequin\u201d is kind of overloaded, and relying on the context can both imply an express dynamics mannequin or some implicit, dependable inside state {that a} generative mannequin depends on (e.g. when an LLM generates chess strikes, whether or not there may be some inside illustration of the board). We give our unfastened working definition under.<\/p>\n<p>Suppose you are taking actions $a_t in mathcal{A}$ and observe states $s_t in mathcal{S}$ (photographs, latent vectors, proprioception). A <strong>world mannequin<\/strong> is a realized mannequin that, given the present state and a sequence of future actions, predicts what is going to occur subsequent. Formally, it defines a predictive distribution on a sequence of noticed states $s_{t-h:t}$ and present motion $a_t$:<\/p>\n<p>[P_theta(s_{t+1} mid s_{t-h:t},; a_t)]<\/p>\n<p>that approximates the surroundings\u2019s true conditional $P(s_{t+1} mid s_{t-h:t},; a_t)$. For this weblog publish, we\u2019ll assume a Markovian mannequin $P(s_{t+1} mid s_{t-h:t},; a_t)$ for simplicity (all outcomes right here could be prolonged to the extra basic case), and when the mannequin is deterministic it reduces to a map over states:<\/p>\n<p>[s_{t+1} = F_theta(s_t, a_t).]<\/p>\n<p>In apply the state $s_t$ is usually a realized latent illustration (e.g., encoded from pixels), so the mannequin operates in a (theoretically) compact, differentiable area. The important thing level is {that a} world mannequin provides you a <em>differentiable simulator<\/em>; you may roll it ahead beneath hypothetical motion sequences and backpropagate via the predictions.<\/p>\n<hr\/>\n<h2 id=\"planning-choosing-actions-by-optimizing-through-the-model\">Planning: selecting actions by optimizing via the mannequin<\/h2>\n<p>Given a begin $s_0$ and a aim $g$, the best planner chooses an motion sequence $mathbf{a}=(a_0,dots,a_{T-1})$ by rolling out the mannequin and minimizing terminal error:<\/p>\n<p>[min_{mathbf{a}} ; | s_T(mathbf{a}) &#8211; g |_2^2, quad text{where } s_T(mathbf{a}) = mathcal{F}_{theta}^{T}(s_0,mathbf{a}).]<\/p>\n<p>Right here we use $mathcal{F}^T$ as shorthand for the total rollout via the world mannequin (dependence on mannequin parameters $theta$ is implicit):<\/p>\n<p>[mathcal{F}_{theta}^{T}(s_0, mathbf{a}) = F_theta(F_theta(cdots F_theta(s_0, a_0), cdots, a_{T-2}), a_{T-1}).]<\/p>\n<p>Briefly horizons and low-dimensional programs, this may work moderately effectively. However as horizons develop and fashions grow to be bigger and extra expressive, its weaknesses grow to be amplified.<\/p>\n<p>So why doesn\u2019t this simply work at scale?<\/p>\n<hr\/>\n<h2 id=\"why-long-horizon-planning-is-hard-even-when-everything-is-differentiable\">Why long-horizon planning is tough (even when all the pieces is differentiable)<\/h2>\n<p>There are two separate ache factors for the extra basic world mannequin, plus a 3rd that&#8217;s particular to realized, deep learning-based fashions.<\/p>\n<h3 id=\"1-long-horizon-rollouts-create-deep-ill-conditioned-computation-graphs\">1) Lengthy-horizon rollouts create deep, ill-conditioned computation graphs<\/h3>\n<p>These acquainted with backprop via time (BPTT) could discover that we\u2019re differentiating via a mannequin utilized to itself repeatedly, which is able to result in the <strong>exploding\/vanishing gradients<\/strong> drawback. Particularly, if we take derivatives (be aware we\u2019re differentiating vector-valued features, leading to Jacobians that we denote with $D_x (cdots)$) with respect to earlier actions (e.g. $a_0$):<\/p>\n<p>[D_{a_0} mathcal{F}_{theta}^{T}(s_0, mathbf{a}) = Bigl(prod_{t=1}^T D_s F_theta(s_t, a_t)Bigr) D_{a_0}F_theta(s_0, a_0).]<\/p>\n<p>We see that the Jacobian\u2019s conditioning scales exponentially with time $T$:<\/p>\n<p>[sigma_{text{max\/min}}(D_{a_0}mathcal{F}_{theta}^{T}) sim sigma_{text{max\/min}}(D_s F_theta)^{T-1},]<\/p>\n<p>resulting in exploding or vanishing gradients.<\/p>\n<h3 id=\"2-the-landscape-is-non-greedy-and-full-of-traps\">2) The panorama is non-greedy and stuffed with traps<\/h3>\n<p>At brief horizons, the grasping resolution, the place we transfer straight towards the aim at each step, is usually adequate. Should you solely must plan just a few steps forward, the optimum trajectory normally doesn\u2019t deviate a lot from \u201chead towards $g$\u201d at every step.<\/p>\n<p>As horizons develop, two issues occur. First, longer duties usually tend to require <em>non-greedy<\/em> habits: going round a wall, repositioning earlier than pushing, backing as much as take a greater path. And as horizons develop, extra of those non-greedy steps are usually wanted. Second, the optimization area itself scales with horizon: $mathrm{dim}(mathcal{A} occasions cdots occasions mathcal{A}) = Tmathrm{dim}(mathcal{A})$, additional increasing the area of native minima for the optimization drawback.<\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/loss-landscape.jpg\" alt=\"Loss landscape\" style=\"max-width: 80%;\"\/><figcaption><em>Distance to aim alongside the optimum path is non-monotonic, and the ensuing loss panorama could be tough.<\/em><\/figcaption><\/figure>\n<hr\/>\n<h2 id=\"a-long-horizon-fix-lifting-the-dynamics-constraint\">A protracted-horizon repair: lifting the dynamics constraint<\/h2>\n<p>Suppose we deal with the dynamics constraint $s_{t+1} = F_{theta}(s_t, a_t)$ as a tender constraint, and we as a substitute optimize the next penalty operate over each actions $(a_0,ldots,a_{T-1})$ and states $(s_0,ldots,s_T)$:<\/p>\n<p>[min_{mathbf{s},mathbf{a}} mathcal{L}(mathbf{s}, mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(s_t,a_t) &#8211; s_{t+1}big|_2^2,<br \/>\nquad text{with } s_0 text{ fixed and } s_T=g.]<\/p>\n<p>That is additionally generally known as <em>collocation<\/em> in planning\/robotics literature. Notice the lifted formulation shares the identical <em>world<\/em> minimizers as the unique rollout goal (each are zero precisely when the trajectory is dynamically possible). However the optimization landscapes are very totally different, and we get two instant advantages:<\/p>\n<ul>\n<li>Every world mannequin analysis $F_{theta}(s_t,a_t)$ relies upon solely on native variables, so all $T$ phrases could be computed <em>in parallel throughout time<\/em>, leading to an enormous speed-up for longer horizons, and<\/li>\n<li>You now not backpropagate via a single deep $T$-step composition to get a studying sign, because the earlier product of Jacobians now splits right into a sum, e.g.:<\/li>\n<\/ul>\n<p>[D_{a_0} mathcal{L} = 2(F_theta(s_0, a_0) &#8211; s_1).]<\/p>\n<p>With the ability to optimize states instantly additionally helps with exploration, as we will quickly navigate via unphysical domains to search out the optimum plan:<\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/ballnav_demo.gif\" alt=\"Collocation planning in BallNav\" style=\"max-width: 60%;\"\/><figcaption><em>Collocation-based planning permits us to instantly perturb states and discover midpoints extra successfully.<\/em><\/figcaption><\/figure>\n<p>Nevertheless, lunch isn&#8217;t free. And certainly, particularly for deep learning-based world fashions, there&#8217;s a essential situation that makes the above optimization fairly troublesome in apply.<\/p>\n<h2 id=\"an-issue-for-deep-learning-based-world-models-sensitivity-of-state-input-gradients\">A difficulty for deep learning-based world fashions: sensitivity of state-input gradients<\/h2>\n<p>The <strong>tl;dr<\/strong> of this part is: instantly optimizing states via a deep learning-based $F_{theta}$ is extremely brittle, \u00e0 la <em>adversarial robustness<\/em>. Even when you practice your world mannequin in a lower-dimensional state area, the coaching course of for the world mannequin makes unseen state landscapes very sharp, whether or not or not it&#8217;s an unseen state itself or just a traditional\/orthogonal course to the info manifold.<\/p>\n<h3 id=\"adversarial-robustness-and-the-dimpled-manifold-model\">Adversarial robustness and the \u201cdimpled manifold\u201d mannequin<\/h3>\n<p>Adversarial robustness initially checked out classification fashions $f_theta : mathbb{R}^{wtimes h occasions c} to mathbb{R}^Ok$, and confirmed that by following the gradient of a specific logit $nabla f_theta^ok$ from a base picture $x$ (not of sophistication $ok$), you didn&#8217;t have to maneuver far alongside $x\u2019 = x + epsilonnabla f_theta^ok$ to make $f_theta$ classify $x\u2019$ as $ok$ (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1312.6199\">Szegedy et al., 2014<\/a>; <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1412.6572\">Goodfellow et al., 2015<\/a>):<\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/adversarial_animated.gif\" alt=\"Adversarial example\" style=\"max-width: 70%;\"\/><figcaption><em>Depiction of the basic instance from (Goodfellow et al., 2015).<\/em><\/figcaption><\/figure>\n<p>Later work has painted a geometrical image for what\u2019s occurring: for knowledge close to a low-dimensional manifold $mathcal{M}$, the coaching course of controls habits in tangential instructions, however doesn&#8217;t regularize habits in orthogonal instructions, thus resulting in delicate habits (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1812.00740\">Stutz et al., 2019<\/a>). One other manner acknowledged: $f_theta$ has an affordable Lipschitz fixed when contemplating solely tangential instructions to the info manifold $mathcal{M}$, however can have very excessive Lipschitz constants in regular instructions. In actual fact, it usually advantages the mannequin to be sharper in these regular instructions, so it may well match extra difficult features extra exactly.<\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/manifold_adversarial.gif\" alt=\"Adversarial perturbations leave the data manifold\" style=\"max-width: 70%;\"\/><br \/>\n<\/figure>\n<p>Consequently, such adversarial examples are extremely widespread even for a single given mannequin. Additional, this isn&#8217;t simply a pc imaginative and prescient phenomenon; adversarial examples additionally seem in LLMs (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1908.07125\">Wallace et al., 2019<\/a>) and in RL (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1905.10615\">Gleave et al., 2019<\/a>).<\/p>\n<p>Whereas there are strategies to coach for extra adversarially sturdy fashions, there&#8217;s a identified trade-off between mannequin efficiency and adversarial robustness (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1805.12152\">Tsipras et al., 2019<\/a>): particularly within the presence of many weakly-correlated variables, the mannequin <em>should<\/em> be sharper to attain larger efficiency. Certainly, most trendy coaching algorithms, whether or not in laptop imaginative and prescient or LLMs, don&#8217;t practice adversarial robustness out. Thus, no less than till deep studying sees a significant regime change, <strong>this can be a drawback we\u2019re caught with<\/strong>.<\/p>\n<h3 id=\"why-is-adversarial-robustness-an-issue-for-world-model-planning\">Why is adversarial robustness a difficulty for world mannequin planning?<\/h3>\n<p>Contemplate a single part of the dynamics loss we\u2019re optimizing within the lifted state method:<\/p>\n<p>[min_{s_t, a_t, s_{t+1}} |F_theta(s_t, a_t) &#8211; s_{t+1}|_2^2]<\/p>\n<p>Let\u2019s additional deal with simply the bottom state:<\/p>\n<p>[min_{s_t} |F_theta(s_t, a_t) &#8211; s_{t+1}|_2^2.]<\/p>\n<p>Since world fashions are usually educated on state\/motion trajectories $(s_1, a_1, s_2, a_2, ldots)$, the state-data manifold for $F_{theta}$ has dimensionality bounded by the motion area:<\/p>\n<p>[mathrm{dim}(mathcal{M}_s) le mathrm{dim}(mathcal{A}) + 1 + mathrm{dim}(mathcal{R}),]<\/p>\n<p>the place $mathcal{R}$ is a few elective area of augmentations (e.g. translations\/rotations). Thus, we will usually anticipate $mathrm{dim}(mathcal{M}_s)$ to be a lot decrease than $mathrm{dim}(mathcal{S})$, and thus: <strong>it is extremely simple to search out adversarial examples that hack any state to some other desired state.<\/strong><\/p>\n<p>Consequently, the dynamics optimization<\/p>\n<p>[sum_{t=0}^{T-1} big|F_theta(s_t,a_t) &#8211; s_{t+1}big|_2^2]<\/p>\n<p>feels extremely \u201csticky,\u201d as the bottom factors $s_t$ can simply trick $F_{theta}$ into pondering it\u2019s already made its native aim.<sup><a rel=\"nofollow\" target=\"_blank\" href=\"#fn1\" id=\"ref1\">1<\/a><\/sup><\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/pusht_adversarial.gif\" alt=\"Adversarial world model example\" style=\"max-width: 70%;\"\/><br \/>\n<\/figure>\n<hr\/>\n<div id=\"fn1\" style=\"font-size: 0.88em; margin: 0.75em 0; padding-left: 1em; border-left: 3px solid #ddd; color: #5f5f5f;\">\n<p><strong>1.<\/strong> This adversarial robustness situation, whereas significantly dangerous for lifted-state approaches, is just not distinctive to them. Even for serial optimization strategies that optimize via the total rollout map $mathcal{F}^T$, it&#8217;s attainable to get into unseen states, the place it is extremely simple to have a traditional part fed into the delicate regular parts of $D_s F_{theta}$. The motion Jacobian\u2019s chain rule enlargement is<\/p>\n<p>[Bigl(prod_{t=1}^T D_s F_theta(s_t, a_t)Bigr) D_{a_0}F_theta(s_0, a_0).]<\/p>\n<p>See what occurs if any stage of the product has any part regular to the info manifold. <a rel=\"nofollow\" target=\"_blank\" href=\"#ref1\" style=\"color: #4d6b92;\">\u21a9<\/a><\/p>\n<\/div>\n<hr\/>\n<h3 id=\"our-fix\">Our repair<\/h3>\n<p>That is the place our new planner GRASP is available in. The principle statement: whereas $D_s F_{theta}$ is untrustworthy and adversarial, the motion area is normally low-dimensional and exhaustively educated, so $D_a F_{theta}$ is definitely cheap to optimize via and doesn\u2019t undergo from the adversarial robustness situation!<\/p>\n<figure style=\"text-align: center;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/network_diagram.jpg\" alt=\"Network diagram showing high-dim state vs low-dim action\" style=\"max-width: 65%;\"\/><figcaption><em>The motion enter is normally lower-dimensional and densely educated (the mannequin has seen each motion course), so motion gradients are a lot better behaved.<\/em><\/figcaption><\/figure>\n<p>At its core, <strong>GRASP builds a first-order lifted state \/ collocation-based planner that&#8217;s solely depending on motion Jacobians via the world mannequin.<\/strong> We thus exploit the differentiability of realized world fashions $F_{theta}$, whereas not falling sufferer to the inherent sensitivity of the state Jacobians $D_s F_{theta}$.<\/p>\n<h2 id=\"grasp-gradient-relaxed-stochastic-planner\">GRASP: Gradient <strong>RelAxed<\/strong> <strong>S<\/strong>tochastic <strong>P<\/strong>lanner<\/h2>\n<p>As famous earlier than, we begin with the collocation planning goal, the place we carry the states and chill out dynamics right into a penalty:<\/p>\n<p>[min_{mathbf{s},mathbf{a}} mathcal{L}(mathbf{s}, mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(s_t,a_t) &#8211; s_{t+1}big|_2^2,<br \/>\nquad text{with } s_0 text{ fixed and } s_T=g.]<\/p>\n<p>We then make two key additions.<\/p>\n<h2 id=\"ingredient-1-exploration-by-noising-the-state-iterates\">Ingredient 1: Exploration by noising the <strong>state iterates<\/strong><\/h2>\n<p>Even with a smoother goal, planning is nonconvex. We introduce exploration by injecting Gaussian noise into the <strong>digital state updates<\/strong> throughout optimization.<\/p>\n<p>A easy model:<\/p>\n<p>[s_t leftarrow s_t &#8211; eta_s nabla_{s_t}mathcal{L} + sigma_{text{state}} xi, qquad xisimmathcal{N}(0,I).]<\/p>\n<p>Actions are nonetheless up to date by non-stochastic descent:<\/p>\n<p>[a_t leftarrow a_t &#8211; eta_a nabla_{a_t}mathcal{L}.]<\/p>\n<p>The state noise helps you \u201chop\u201d between basins within the lifted area, whereas the actions stay guided by gradients. We discovered that particularly noising states right here (versus actions) finds  stability of exploration and the power to search out sharper minima.<sup><a rel=\"nofollow\" target=\"_blank\" href=\"#fn2\" id=\"ref2\">2<\/a><\/sup><\/p>\n<hr\/>\n<div id=\"fn2\" style=\"font-size: 0.88em; margin: 0.75em 0; padding-left: 1em; border-left: 3px solid #ddd; color: #5f5f5f;\">\n<p><strong>2.<\/strong> As a result of we solely noise the states (and never the actions), the corresponding dynamics should not actually Langevin dynamics. <a rel=\"nofollow\" target=\"_blank\" href=\"#ref2\" style=\"color: #4d6b92;\">\u21a9<\/a><\/p>\n<\/div>\n<hr\/>\n<h2 id=\"ingredient-2-reshape-gradients-stop-brittle-state-input-gradients-keep-action-gradients\">Ingredient 2: Reshape gradients: cease brittle state-input gradients, maintain motion gradients<\/h2>\n<p>As mentioned, the delicate pathway is the gradient that flows <em>into the state enter<\/em> of the world mannequin, <span>(D_s F_{theta})<\/span>. Essentially the most easy manner to do that initially is to simply cease state gradients into <span>(F_{theta})<\/span> instantly:<\/p>\n<ul>\n<li>Let $bar{s}_t$ be the identical worth as $s_t$, however with gradients stopped.<\/li>\n<\/ul>\n<p>Outline the <strong>stop-gradient dynamics loss<\/strong>:<\/p>\n<p>[mathcal{L}_{text{dyn}}^{text{sg}}(mathbf{s},mathbf{a})<br \/>\n= sum_{t=0}^{T-1} big|F_theta(bar{s}_t, a_t) &#8211; s_{t+1}big|_2^2.]<\/p>\n<p>This alone doesn&#8217;t work. Discover now states solely observe the earlier state\u2019s step, with out something forcing the bottom states to chase the following ones. Consequently, there are trivial minima for simply stopping on the origin, then just for the ultimate motion making an attempt to get to the aim in a single step.<\/p>\n<h3 id=\"dense-goal-shaping\">Dense aim shaping<\/h3>\n<p>We are able to view the above situation because the aim\u2019s sign being lower off completely from earlier states. One strategy to repair that is to easily add a dense aim time period all through prediction:<\/p>\n<p>[mathcal{L}_{text{goal}}^{text{sg}}(mathbf{s},mathbf{a})<br \/>\n= sum_{t=0}^{T-1} big|F_theta(bar{s}_t, a_t) &#8211; gbig|_2^2.]<\/p>\n<p>In regular settings this is able to over-bias in the direction of the grasping resolution of straight chasing the aim, however that is balanced in our setting by the stop-gradient dynamics loss\u2019s bias in the direction of possible dynamics. The ultimate goal is then as follows:<\/p>\n<p>[mathcal{L}(mathbf{s},mathbf{a}) = mathcal{L}_{text{dyn}}^{text{sg}}(mathbf{s},mathbf{a}) + gamma , mathcal{L}_{text{goal}}^{text{sg}}(mathbf{s},mathbf{a}).]<\/p>\n<p>The result&#8217;s a planning optimization goal that doesn&#8217;t have dependence on state gradients.<\/p>\n<hr\/>\n<h2 id=\"periodic-sync-briefly-return-to-true-rollout-gradients\">Periodic \u201csync\u201d: briefly return to true rollout gradients<\/h2>\n<p>The lifted stop-gradient goal is nice for <strong>quick, guided exploration<\/strong>, however it\u2019s nonetheless an approximation of the unique serial rollout goal.<\/p>\n<p>So each $K_{textual content{sync}}$ iterations, GRASP does a brief refinement section:<\/p>\n<ol>\n<li>Roll out from $s_0$ utilizing present actions $mathbf{a}$, and take just a few small gradient steps on the unique serial loss:<\/li>\n<\/ol>\n<p>[mathbf{a} leftarrow mathbf{a} &#8211; eta_{text{sync}},nabla_{mathbf{a}},|s_T(mathbf{a})-g|_2^2.]<\/p>\n<p>The lifted-state optimization nonetheless supplies the core of the optimization, whereas this refinement step provides some help to maintain states and actions grounded in the direction of actual trajectories. This refinement step can in fact get replaced with a serial planner of your selection (e.g. CEM); the core concept is to nonetheless get among the good thing about the full-path synchronization of serial planners, whereas nonetheless largely utilizing the advantages of the lifted-state planning.<\/p>\n<hr\/>\n<h2 id=\"how-grasp-addresses-long-range-planning\">How GRASP addresses long-range planning<\/h2>\n<p>Collocation-based planners provide a pure repair for long-horizon planning, however this optimization is kind of troublesome via trendy world fashions as a result of adversarial robustness points. <em>GRASP proposes a easy resolution for a smoother collocation-based planner, alongside secure stochasticity for exploration<\/em>. Consequently, longer-horizon planning finally ends up not solely succeeding extra, but additionally discovering such successes quicker:<\/p>\n<figure style=\"text-align: center; margin: 1.25em 0;\">\n  <img decoding=\"async\" src=\"https:\/\/bair.berkeley.edu\/static\/blog\/grasp\/pusht_zoomout.gif\" alt=\"Push-T planning demo\" style=\"max-width: 90%; height: auto;\"\/><figcaption style=\"font-size: 0.95em; margin-top: 0.5em;\"><em>Push-T demo: longer-horizon planning with GRASP.<\/em><\/figcaption><\/figure>\n<div class=\"grasp-results-table\" style=\"overflow-x: auto; margin: 1em 0;\">\n<table>\n<thead>\n<tr>\n<th>Horizon<\/th>\n<th>CEM<\/th>\n<th>GD<\/th>\n<th>LatCo<\/th>\n<th><strong>GRASP<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>H=40<\/td>\n<td><strong>61.4%<\/strong> \/ 35.3s<\/td>\n<td>51.0% \/ 18.0s<\/td>\n<td>15.0% \/ 598.0s<\/td>\n<td>59.0% \/ <strong>8.5s<\/strong><\/td>\n<\/tr>\n<tr>\n<td>H=50<\/td>\n<td>30.2% \/ 96.2s<\/td>\n<td>37.6% \/ 76.3s<\/td>\n<td>4.2% \/ 1114.7s<\/td>\n<td><strong>43.4%<\/strong> \/ <strong>15.2s<\/strong><\/td>\n<\/tr>\n<tr>\n<td>H=60<\/td>\n<td>7.2% \/ 83.1s<\/td>\n<td>16.4% \/ 146.5s<\/td>\n<td>2.0% \/ 231.5s<\/td>\n<td><strong>26.2%<\/strong> \/ <strong>49.1s<\/strong><\/td>\n<\/tr>\n<tr>\n<td>H=70<\/td>\n<td>7.8% \/ 156.1s<\/td>\n<td>12.0% \/ 103.1s<\/td>\n<td>0.0% \/ \u2014<\/td>\n<td><strong>16.0%<\/strong> \/ <strong>79.9s<\/strong><\/td>\n<\/tr>\n<tr>\n<td>H=80<\/td>\n<td>2.8% \/ 132.2s<\/td>\n<td>6.4% \/ 161.3s<\/td>\n<td>0.0% \/ \u2014<\/td>\n<td><strong>10.4%<\/strong> \/ <strong>58.9s<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p style=\"text-align: center; margin-top: 0.75em;\"><em>Push-T outcomes. Success price (%) \/ median time to success. Daring = finest in row. Notice the median success time will bias larger with larger success price; GRASP manages to be quicker regardless of larger success price.<\/em><\/p>\n<hr\/>\n<h2 id=\"whats-next\">What\u2019s subsequent?<\/h2>\n<p>There may be nonetheless loads of work to be executed for contemporary world mannequin planners. We wish to exploit the gradient construction of realized world fashions, and collocation (lifted-state optimization) is a pure method for long-horizon planning, however it\u2019s essential to grasp typical gradient construction right here: clean and informative motion gradients and brittle state gradients. We view GRASP as an preliminary iteration for such planners.<\/p>\n<p>Extension to diffusion-based world fashions (deeper latent timesteps could be seen as smoothed variations of the world mannequin itself), extra refined optimizers and noising methods, and integrating GRASP into both a closed-loop system or RL coverage studying for adaptive long-horizon planning are all pure and attention-grabbing subsequent steps.<\/p>\n<p>I do genuinely suppose it\u2019s an thrilling time to be engaged on world mannequin planners. It\u2019s a humorous candy spot the place the background literature (planning and management general) is extremely mature and well-developed, however the present setting (pure planning optimization over trendy, large-scale world fashions) remains to be closely underexplored. However, as soon as we work out all the fitting concepts, world mannequin planners will seemingly grow to be as commonplace as RL.<\/p>\n<hr\/>\n<p>For extra particulars, learn the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2602.00475\">full paper<\/a> or go to the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.michaelpsenka.io\/grasp\/\">undertaking web site<\/a>.<\/p>\n<hr\/>\n<h2 id=\"citation\">Quotation<\/h2>\n<div class=\"language-bibtex highlighter-rouge\">\n<div class=\"highlight\">\n<pre class=\"highlight\"><code><span class=\"nc\">@article<\/span><span class=\"p\">{<\/span><span class=\"nl\">psenka2026grasp<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">title<\/span><span class=\"p\">=<\/span><span class=\"s\">{Parallel Stochastic Gradient-Primarily based Planning for World Fashions}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">creator<\/span><span class=\"p\">=<\/span><span class=\"s\">{Michael Psenka and Michael Rabbat and Aditi Krishnapriyan and Yann LeCun and Amir Bar}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">yr<\/span><span class=\"p\">=<\/span><span class=\"s\">{2026}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">eprint<\/span><span class=\"p\">=<\/span><span class=\"s\">{2602.00475}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">archivePrefix<\/span><span class=\"p\">=<\/span><span class=\"s\">{arXiv}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">primaryClass<\/span><span class=\"p\">=<\/span><span class=\"s\">{cs.LG}<\/span><span class=\"p\">,<\/span>\n  <span class=\"na\">url<\/span><span class=\"p\">=<\/span><span class=\"s\">{https:\/\/arxiv.org\/abs\/2602.00475}<\/span>\n<span class=\"p\">}<\/span>\n<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>GRASP is a brand new gradient-based planner for realized dynamics (a \u201cworld mannequin\u201d) that makes long-horizon planning sensible by (1) lifting the trajectory into digital states so optimization is parallel throughout time, (2) including stochasticity on to the state iterates for exploration, and (3) reshaping gradients so actions get clear indicators whereas we keep away [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":14078,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[311,310,110,8785,6229,312,2319,266,1366,193,720],"class_list":["post-14076","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-artificial","tag-berkeley","tag-blog","tag-gradientbased","tag-horizons","tag-intelligence","tag-longer","tag-models","tag-planning","tag-research","tag-world"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=14076"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14076\/revisions"}],"predecessor-version":[{"id":14077,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14076\/revisions\/14077"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/14078"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=14076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=14076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=14076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-04-24 13:09:30 UTC -->