This paper was accepted on the Workshop on Latent & Implicit Considering – Going Past CoT Reasoning 2026 at ICLR.
Autoregressive language fashions skilled with next-token prediction generate textual content by sampling one discrete token at a time. Though very scalable, this goal forces the mannequin to commit at each step, stopping it from exploring or reflecting upon a number of believable continuations. Moreover, the compute allocation throughout tokens is uniform; each token is shaped based mostly on a single forward-pass, doubtlessly limiting the mannequin’s expressiveness in instances the place troublesome tokens require inherently extra compute. In direction of addressing these limitations, we introduce latent lookahead, a coaching technique that allows fashions to “assume” earlier than producing: at chosen positions within the sequence, earlier than committing to the subsequent token, the mannequin performs a multi-step lookahead in latent area. Extra exactly, as a substitute of sampling future tokens, we leverage the community’s latent area by recursively feeding its hidden states again into the context for τ steps, investing extra compute on predicting that token. This produces τ latent predictions which might be supervised in opposition to the subsequent τ ground-truth tokens, encouraging the mannequin to “lookahead” and refine its prediction. We present that latent lookahead considerably outperforms each autoregressive and non-autoregressive baselines on planning duties comparable to maze fixing, Sudoku, and ProsQA, the place foresight is important.
- ** Work executed whereas at Apple






