Verlog: A Multi-turn RL framework for LLM brokers – Machine Studying Weblog

Verlog is a multi-turn reinforcement studying framework constructed for long-horizon LLM-agentic duties with extremely variable episode lengths. Extending VeRL and BALROG whereas following the confirmed design rules of pytorch-a2c-ppo-acktr-gail, it introduces specialised optimizations for secure and environment friendly coaching when episodes span from brief interactions to lots of of turns.

Whereas prior frameworks like VeRL and RAGEN successfully deal with duties with ~10 turns, and verl-agent scales as much as 50 turns, Verlog is designed to function in environments with over 400 turns, making it uniquely suited to advanced, long-term decision-making. This functionality has been validated throughout difficult domains corresponding to BabyAI, BabaIsAI, and Crafter. In Crafter, as an example, episode lengths vary from 70 to 400 steps with a median of about 190. On these difficult domains, Verlog constantly achieves robust efficiency out of the field.

Key Options

🧠 Flip-Stage Abstraction: To deal with extraordinarily lengthy episodes, we deal with every flip as an unbiased coaching pattern. This eliminates the necessity to encode the complete trajectory right into a single context window and permits for modular, customizable reminiscence architectures.

🎯 Mounted-Flip Batching: To handle the excessive variance in episode lengths throughout environments, we use fixed-turn batching. Every coaching batch accommodates a set variety of turns. For incomplete episodes, we substitute ultimate rewards with worth operate estimates because the supervision sign.

🛠️ Tailor-made for Multi-Flip RL: To handle the distinctive challenges of multi-turn RL, we introduce a set of focused methods corresponding to Twin Discounting GAE and Critic Pre-training, mixed with rigorously tuned hyperparameters to make sure environment friendly and secure studying.

Foremost Outcomes

Coaching curves of Verlog on Crafter, BabyAI, and BabaIsAI.

We consider Verlog on three difficult benchmarks that spotlight totally different points of long-horizon multi-turn RL:

Crafter: Episodes vary from 70 to over 400 steps, with extraordinarily excessive variance in size. Rewards are sparse, typically showing solely as soon as each ~20 steps.
BabyAI and BabaIsAI: Episodes are shorter (as much as ~100–128 steps), however rewards are supplied solely on the finish of the trajectory, making credit score task significantly difficult.

These domains are particularly tough because of their mixture of lengthy horizons, sparse reward alerts, and extremely variable episode lengths.

Outcomes.
All experiments use PPO with Qwen2.5-Instruct fashions (3B or 7B relying on the area). In all plots, the x-axis exhibits PPO coaching steps, the y-axis exhibits the task-specific reward or success fee, and shaded areas symbolize the usual deviation throughout three random seeds.

For Crafter, we prepare the Qwen2.5-7B-Instruct mannequin on 8×H100 (82 GB) GPUs for ~36 hours. For BabyAI and BabaIsAI, we use the Qwen2.5-3B-Instruct mannequin educated on 4×A40 (48 GB) GPUs for ~24 hours.

Throughout all three domains, Verlog demonstrates its means to coach reliably underneath lengthy horizons, sparse rewards, and variable episode lengths, exhibiting that the framework scales naturally from brief to very lengthy multi-turn duties.

Within the following sections, we define our design decisions, implementation particulars, and discover potential analysis questions that our framework could assist deal with.

Mannequin & Immediate

Instruct Mannequin

We start with the Instruct variant of Qwen-2.5 (Qwen-2.5-3B/7B-Instruct), fairly than the bottom mannequin, for 2 key causes. First, it permits seamless integration with BALROG, a framework designed to guage the zero-shot efficiency of instruct fashions throughout a spread of benchmarks. Second, it permits us to make use of the benchmark’s prompts with minimal modifications.

Reminiscence Mechanism

Somewhat than putting the complete trajectory into the context window, we embrace solely the most recent (n+1) turns. Every flip, i.e., information = ((textual content{historical past}_t, s_t, textual content{suppose}_t, a_t)), with (textual content{historical past}_t = {s_{t-n}, textual content{suppose}_{t-n}, a_{t-n},…, s_{t-1}, textual content{suppose}_{t-1}, a_{t-1}}), is handled as a person coaching information level. In consequence, every coaching batch consists of batch_size particular person turns, not batch_size full trajectories.

The outcomes present that for the 3B Qwen mannequin, efficiency peaks at (n = 1) or (2) and degrades as (n) will increase to (4) or (8). We hypothesize that this decline is as a result of 3B mannequin’s restricted capability to deal with lengthy contexts—for instance, (n = 8) yields a immediate of roughly 4.6k tokens. Whether or not this pattern holds for bigger fashions is an open query. Notably, the duties we consider may be framed as Markov Resolution Processes (MDPs). In additional advanced or partially observable duties, a bigger (n) could assist.

Limits of Multi-Flip Reminiscence

We noticed two notable points associated to the multi-turn reminiscence mechanism.

✴️ Mimicking prior reasoning patterns: The mannequin tends to copy reasoning types from earlier turns, lowering the variety of its thought processes.
💭 Multi-turn hallucinations: The mannequin struggles to tell apart between actions imagined throughout reasoning and precise occasions within the setting. For instance, it could plan to “chop a tree then craft a pickaxe” however fail to discover a tree in actuality—but it’ll nonetheless act as if the plan succeeded. This can be a distinctive problem for agentic duties.

We carried out preliminary experiments to deal with these points, testing:

A variant that features solely the ultimate motion in historical past: information = ( (textual content{historical past}t, s_t, textual content{suppose}_t, a_t) ), with ( textual content{historical past}_t = {s{t-n}, a_{t-n}, …, s_{t-1}, a_{t-1}} ),
A variant that periodically clears the historical past buffer (each 5 steps).

Each approaches led to worse efficiency.

Immediate Template

Belows is the immediate template used for BabyAI. The prompts are tailored from BALROG.

[SYSTEM] You're an agent enjoying a easy navigation recreation. Your aim is to {MISSION}. The next are the doable actions you'll be able to take within the recreation, adopted by a brief description of every motion: {AVAILABLE ACTIONS}. In a second I'll current you an statement. Suggestions: {TIPS}. PLAY!

[USER] {OBSERVATION}

[ASSISTANT] THINK: {THINK} ACTION: {ACTION}

[USER] {OBSERVATION}. What's going to you do subsequent? Please reply within the following format: THINK: step-by-step reasoning. ACTION: One legitimate motion from the allowed set.

We suggest at all times analyzing the mannequin’s zero-shot outputs earlier than coaching. Particularly, this implies evaluating: (1) Whether or not reasoning paths are numerous, (2) whether or not the mannequin causes sufficiently earlier than deciding on an motion, (3) the ratio of legitimate actions, and (4) the sorts of failure circumstances. These checks make sure the mannequin understands the setting from the immediate. If not, revise the immediate earlier than fine-tuning.

Atmosphere

Verlog makes use of a extremely summary recreation as its testbed, lowering the necessity for immediate engineering and permitting researchers to deal with algorithmic design. We element all engineering points under.

Legitimate Motion

Enhancing the legitimate motion ratio by means of immediate engineering is the only and simplest approach to enhance efficiency. In our setup, we make sure the mannequin produces legitimate actions over 95% of the time utilizing the next methods:

📋 Hardcoded motion translation: Sure invalid actions are steadily produced by zero-shot LLMs (e.g., “Transfer ahead” and “Go ahead”). We implement a handmade translation operate to map these to legitimate actions, stopping them from reducing the legitimate motion ratio.
❌ Exchange invalid actions with a default motion: When the LLM outputs an invalid motion, the setting rejects it and executes a predefined default motion as an alternative. Concurrently, we substitute the invalid motion with the default one earlier than appending it to the historical past buffer. This prevents the agent from mimicking the invalid motion in subsequent steps.

We observe that truncating the trajectory upon encountering an invalid motion results in worse efficiency, however changing invalid actions with a default motion yields higher efficiency. On this work, we apply a 0.1 penalty to invalid actions. Nonetheless, with a excessive legitimate motion ratio, the format penalty has minimal influence on total efficiency.

Reward

Rewards are rule-based and supplied by the setting. In BabyAI and BabaIsAI, we undertake a binary trajectory-level reward scheme: 1 for achievement trajectory, 0 for failure trajectory. Mixed with dual-discount GAE, this setup ensures that earlier steps in suboptimal trajectories obtain decrease credit score in comparison with these in optimum ones. For Crafter, we use the native setting rewards straight.

We noticed a irritating difficulty when coaching on Crafter — specifically, the rating enchancment primarily comes from reinforcing abilities that the bottom mannequin already possessed, fairly than studying new abilities. For instance, earlier than fine-tuning, brokers not often positioned furnace, however after fine-tuning, they efficiently place furnace nearly each episode. Expertise that had been beforehand unknown, corresponding to crafting an iron sword, stay unlearned even after fine-tuning.

This means that the present model of RL fails to show brokers new abilities on these duties, however as an alternative primarily sharpens the motion distribution to favor behaviors with increased rapid rewards.

Batch Atmosphere (Mounted-Flip Batching)

Our framework helps asynchronous rollouts and works with any setting utilizing the OpenAI Fitness center interface. Every coaching batch dimension is: n_env × e_len, the place:

n_env = variety of parallel environments
e_len = episode size per rollout

Notice: e_len may be smaller than the setting’s trajectory size. For instance, we set e_len = 8 and max trajectory size = 128 in BabyAI. For early truncated trajectories, we leverage the worth operate to information the coaching course of. An extended e_len (smaller n_env) typically results in higher efficiency, albeit at the price of decrease token throughput.

Algorithm

Twin Discounting GAE

To incentivize brokers to unravel duties with fewer setting steps, we decouple token-level discounting ((gamma_{textual content{token}}, lambda_{textual content{token}})) and step-level ((gamma_{textual content{step}}, lambda_{textual content{step}})). We set:

(gamma_{textual content{step}} = 0.99), (lambda_{textual content{step}} = 0.95)
(gamma_{textual content{token}} = 1.0), (lambda_{textual content{token}} = 1.0)

The GAE is computed recursively:

$$hat{A}_t=gammalambdahat{A}_{t+1}+delta_t^V$$

the place:

(gammalambda = gamma_{textual content{step}} lambda_{textual content{step}}), if tokens are from totally different turns
(gammalambda = gamma_{textual content{token}} lambda_{textual content{token}}), in any other case
and (delta_t^V = -V(s_t) + r_t + gamma V(s_{t+1}))

The recursion begins from the final token of the ultimate flip and proceeds backward. As soon as all tokens within the ultimate flip are processed, we transfer to the final token of the second-to-last flip, and proceed this course of recursively. Throughout this course of, all state tokens are skipped.

If a trajectory is truncated at step (T), we retailer the subsequent state (s_{T+1}) however don’t pattern (a_{T+1}). As a substitute, we use the ultimate token of (s_{T+1}) to estimate (V(s_{T+1})), used because the bootstrap worth in GAE.

Worth Perform Estimation

When each (gamma) and (lambda) are set to 1.0, the worth operate serves purely as a baseline in PPO’s benefit estimation. Particularly, the benefit for the (t)-th token within the final flip is outlined as (A_{-1,t} = r – V_{-1,t}), the place (r) is the trajectory reward and (V_{-1,t}) is the worth estimate for the (t)-th token within the final flip.
When (lambda) are lower than 1.0, the worth operate contributes to the GAE goal past serving as a easy baseline. For example, in our setting with (lambda_{textual content{step}} = 0.95), (gamma_{textual content{token}} = 1.0), (lambda_{textual content{token}} = 1.0), and the reward (r) that’s zero alongside the trajectory besides on the ultimate flip, the benefit for the (t)-th token within the second-to-last flip is given by: (A_{-2,t} = gamma_{textual content{step}}[lambda_{text{step}} r + (1-lambda_{text{step}}) V_{-1,0}] – V_{-2,t}). This means that, in our setting, the worth operate of the primary token in every flip is used to bootstrap the GAE goal for the previous flip.
In our setting, the worth of the primary token of every flip carries extra semantic significance than the next tokens, we assign it the next weight when coaching the critic community.

Critic Warmup

In our setting, we heat up the critic earlier than fine-tuning, as it’s used each for bootstrapping truncated trajectories and for computing GAE. That’s, we freeze the actor and replace solely the critic originally of coaching. Particularly, we acquire w_epoch × batch_size turns of information originally. For every warmup iteration, we compute the GAE goal with present critic, pattern one tenth of the collected information, prepare the critic, and repeat this course of for w_iter iterations. We choose w_epoch = 40 and w_iter = 5 in our experiments and ensure that the critic loss converges to a small worth earlier than fine-tuning the actor.

KL-Divergence in Reward

Including a KL-divergence time period (KL(pimidpi_0)) within the reward stabilizes coaching. With out it, the coverage rapidly drifts from (pi_0) and converges to poor options. A KL penalty encourages native exploration round (pi_0) earlier than divergence. We notice an attention-grabbing statement associated to the KL-divergence.

📈 Motion Hacking: The LLM’s output may be decomposed right into a reasoning path and a ultimate motion. We plot the typical KL-divergence between (pi) and (pi_0) for each the reasoning path tokens and the ultimate motion tokens. A standard failure mode in Crafter arises when the KL divergence of the ultimate motion tokens will increase considerably quicker than that of the reasoning path tokens. On this case, the agent learns to take advantage of simply accessible rewards early in coaching by modifying solely the ultimate motion, with out meaningfully enhancing its underlying reasoning. This results in poor exploration.

Conclusion

Verlog addresses a number of core engineering challenges in constructing LLM brokers for long-horizon, multi-turn duties, together with:

Dealing with lengthy interplay histories by means of reminiscence mechanisms. and turn-level abstraction.
Stabilizing coaching on sparse rewards with dual-discounting GAE and critic pre-training.
Managing variable trajectory lengths through fixed-turn batching and bootstrapped worth estimation.
Enhancing motion validity by means of focused immediate engineering and default-action replacements, guaranteeing >95% legitimate actions throughout coaching.
Mitigating coverage collapse with KL-regularization, whereas figuring out novel failure modes corresponding to motion hacking, the place brokers alter solely the ultimate motion token with out enhancing reasoning.

Transferring ahead, Verlog offers a basis to discover core analysis issues in LLM-based reinforcement studying, together with:

Reminiscence design – learning how totally different reminiscence abstractions (brief vs. lengthy histories, structured buffers) have an effect on generalization throughout partially observable duties.
Exploration methods – growing strategies that show RL can repeatedly purchase new abilities by means of interplay with the setting, whereas avoiding untimely convergence in open-ended settings.
Variety of behaviors – designing mechanisms that promote a broad vary of methods and abilities, stopping mode collapse and inspiring robustness throughout duties.
Worth operate studying – enhancing stability and illustration high quality of critics when trajectories are lengthy and rewards sparse.
Dealing with off-policyness – investigating higher reuse informative trajectories from numerous, asynchronous rollouts with out destabilizing studying.

By addressing these challenges, Verlog positions itself as a versatile analysis platform for advancing long-horizon LLM-agentic RL.