Verlog

A Multi-turn RL framework for LLM agents

GitHub Source Code W&B Experiment Logs

Verlog is a multi-turn reinforcement learning framework built for long-horizon LLM-agentic tasks with highly variable episode lengths. Extending VeRL and BALROG while following the proven design principles of pytorch-a2c-ppo-acktr-gail, it introduces specialized optimizations for stable and efficient training when episodes span from short interactions to hundreds of turns. Whereas prior frameworks like VeRL and RAGEN effectively handle tasks with ~10 turns, and verl-agent scales up to 50 turns, Verlog is designed to operate in environments with over 400 turns, making it uniquely suited for complex, long-term decision-making. This capability has been validated across challenging domains such as BabyAI, BabaIsAI, and Crafter, where it consistently achieves strong performance out of the box. In Crafter, for instance, episode lengths range from 70 to 400 steps with an average of about 190.

Key features:

🧠 Turn-Level Abstraction: To handle extremely long episodes, we treat each turn as an independent training sample. This eliminates the need to encode the entire trajectory into a single context window and allows for modular, customizable memory architectures.

🎯 Fixed-Turn Batching: To address the high variance in episode lengths across environments, we use fixed-turn batching. Each training batch contains a fixed number of turns. For incomplete episodes, we replace final rewards with value function estimates as the supervision signal.

🛠️ Tailored for Multi-Turn RL: To address the unique challenges of multi-turn RL, we introduce a set of targeted techniques such as Dual Discounting GAE and Critic Pre-training, combined with carefully tuned hyperparameters to ensure efficient and stable learning.

Main Results

Crafter Results:

Zero-shot policy
Fine-tuned policy
Metric Instruct-model Verlog (Ours)
Rewards 5.80 10.44
Trajectory Length 172.23 196.42

Crafter’s experiments are done with Qwen2.5-7B-Instruct model, using PPO algorithm, trained on 8xH100 GPUs with 82Gb memory for ~36 hours, corresponding to 170 PPO updates.

BabyAI open
BabaIsAI two_room-maybe_break_stop-goto_win

BabaIsAI Results (win rate)

goto_win → 🏁; distr_obj → 🎁; two_room → 🚪; distr_obj_rule → 📏;
maybe_break_stop → ⚠️;

Model 🏁+🎁 🚪+🏁 🚪+🏁+📏 🚪+⚠️+🏁
Instruct-model 0.66 ± 0.08 0.03 ± 0.03 0.22 ± 0.07 0.19 ± 0.07
Verlog (Ours) 1.00 ± 0.00 0.89 ± 0.17 0.89 ± 0.11 0.36 ± 0.07

BabaIsAI’s experiments are done with Qwen2.5-3B-Instruct model, using PPO algorithm, trained on 4xA40 GPUs with 48Gb memory for ~24 hours, corresponding to 300 PPO updates. The maximum episode length was set to 100.

BabyAI Results (win rate)

Model goto pickup pick_up_seq_go_to open
Instruct-model 0.88 ± 0.06 0.41 ± 0.09 0.22 ± 0.07 0.09 ± 0.05
Verlog (Ours) 1.00 ± 0.00 1.00 ± 0.00 0.65 ± 0.16 0.94 ± 0.07

BabyAI’s experiments are done with Qwen2.5-3B-Instruct model, using PPO algorithm, trained on 4xA40 GPUs with 48Gb memory for ~24 hours, corresponding to 300 PPO updates. The maximum episode length was set to 128.

Technical Report

In the following sections, we outline our design choices, implementation details, and explore potential research questions that our framework may help address.

Model & Prompt

Instruct Model

We begin with the Instruct variant of Qwen-2.5 (Qwen-2.5-3B/7B-Instruct), rather than the base model, for two key reasons. First, it enables seamless integration with BALROG, a framework designed to evaluate the zero-shot performance of instruct models across a range of benchmarks. Second, it allows us to use the benchmark’s prompts with minimal modifications

Memory Mechanism

Rather than placing the entire trajectory into the context window, we include only the latest \(n+1\) turns. Each turn, i.e., data = \((\text{history}_t, s_t,\) \(\text{think}_t, a_t)\) , with \(\text{history}_t = \{s_{t-n},\) \(\text{think}_{t-n}, a_{t-n},\) \(..., s_{t-1},\) \(\text{think}_{t-1}, a_{t-1}\}\), is treated as an individual training data point. As a result, each training batch consists of batch_size individual turns, not batch_size full trajectories.

The results show that for the 3B Qwen model, performance peaks at \(n = 1\) or \(2\) and degrades as \(n\) increases to \(4\) or \(8\). We hypothesize that this decline is due to the 3B model’s limited capacity to handle long contexts—for example, \(n = 8\) yields a prompt of approximately 4.6k tokens. Whether this trend holds for larger models is an open question. Notably, the tasks we evaluate can be framed as Markov Decision Processes (MDPs). In more complex or partially observable tasks, a larger \(n\) may help.

We observed two notable issues related to the multi-turn memory mechanism:

We conducted preliminary experiments to address these issues: (1) We tested a variant that includes only the final action in history: data = \((\text{history}_t, s_t, \text{think}_t, a_t)\), with \(\text{history}_t = \{s_{t-n}, a_{t-n}, ..., s_{t-1}, a_{t-1}\}\). (2) We tested a variant that periodically clears the history buffer (every 5 steps). Both approaches led to worse performance.

Prompt Template

Belows is the prompt template used for BabyAI. The prompts are adapted from BALROG.

[SYSTEM] You are an agent playing a simple navigation game. Your goal is to {MISSION}. The following are the possible actions you can take in the game, followed by a short description of each action: {AVAILABLE ACTIONS}. In a moment I will present you an observation. Tips: {TIPS}. PLAY!
[USER] {OBSERVATION}
[ASSISTANT] THINK: {THINK} ACTION: {ACTION}
[USER] {OBSERVATION}. What will you do next? Please respond in the following format: THINK: step-by-step reasoning. ACTION: One valid action from the allowed set.

We recommend always examining the model’s zero-shot outputs before training. Specifically, evaluate: (1) Whether reasoning paths are diverse, (2) whether the model reasons sufficiently before selecting an action, (3) the ratio of valid actions, and (4) the types of failure cases. These checks ensure the model understands the environment from the prompt. If not, revise the prompt before fine-tuning.

Environment

Verlog uses a highly abstract game as its testbed, reducing the need for prompt engineering and allowing researchers to focus on algorithmic design. We detail all engineering aspects below:

Valid Action

Improving the valid action ratio through prompt engineering is the simplest and most effective way to boost performance. In our setup, we ensure the model produces valid actions over 95% of the time using the following strategies:

We observe that truncating the trajectory upon encountering an invalid action leads to worse performance. Replacing invalid actions with a default action yields better results. In this work, we apply a 0.1 penalty to invalid actions. However, with a high valid action ratio, the format penalty has minimal impact on overall performance.

Reward

Rewards are rule-based and provided by the environment. In BabyAI and BabaIsAI, we adopt a binary trajectory-level reward scheme: 1 for success trajectory, 0 for failure trajectory. Combined with dual-discount GAE, this setup ensures that earlier steps in suboptimal trajectories receive lower credit compared to those in optimal ones. For Crafter, we use the native environment rewards directly.

We observed a frustrating issue when training on Crafter:

Batch Environment (Fixed-Turn Batching)

Description

Our framework supports asynchronous rollouts and works with any environment using the OpenAI Gym interface. Each training batch size is: n_env × e_len, where:

Note: e_len can be smaller than the environment’s trajectory length. For example, we set e_len = 8 and max trajectory length = 128 in BabyAI. For early truncated trajectories, we leverage the value function to guide the training process. A longer e_len (smaller n_env) often leads to better performance, albeit at the cost of lower token throughput.

Algorithm

Dual Discounting GAE

To incentivize agents to solve tasks with fewer environment steps, we decouple token-level discounting \((\gamma_{\text{token}}, \lambda_{\text{token}})\) and step-level \((\gamma_{\text{step}}, \lambda_{\text{step}})\). We set:

The GAE is computed recursively:

\[\hat{A}_t = \gamma\lambda \hat{A}_{t+1} + \delta_t^V\]

where:

The recursion starts from the last token of the final turn and proceeds backward. Once all tokens in the final turn are processed, we move to the last token of the second-to-last turn, and continue this process recursively. During this process, all state tokens are skipped. If a trajectory is truncated at step \(T\), we store the next state \(s_{T+1}\) but do not sample \(a_{T+1}\). Instead, we use the final token of \(s_{T+1}\) to estimate \(V(s_{T+1})\), used as the bootstrap value in GAE.

Value Function Estimation

Critic Warmup

In our setting, we warm up the critic before fine-tuning, as it is used both for bootstrapping truncated trajectories and for computing GAE. That is, we freeze the actor and update only the critic at the beginning of training. Specifically, We collect w_epoch × batch_size turns of data at the beginning. For each warmup iteration, we compute the GAE objective with current critic, sample one tenth of the collected data, train the critic, and repeat this process for w_iter iterations. We select w_epoch = 40 and w_iter = 5 in our experiments, and make sure that the critic loss converges to a small value before fine-tuning the actor.

KL-Divergence in Reward

Adding a KL-divergence term \(KL(\pi\mid\pi_0)\) in reward stabilizes training. Without it, the policy quickly drifts from \(\pi_0\) and converges to poor solutions. KL penalty encourage local exploration around \(\pi_0\) before divergence. We observe an interesting observation related to the KL-Divergence:

Conclusion

Verlog solves most of the engineering challenges in building LLM agents for long-horizon, multi-turn tasks. Moving forward, we hope to use this framework to explore core research problems in LLM agents, such as memory design, exploration strategies, value function learning, and handling off-policyness.

If you find Verlog useful, please consider citing our workshop paper. The full version is coming soon!

@inproceedings{
chen2025contextlite,
title={Context-lite Multi-turn Reinforcement Learning for {LLM} Agents},
author={Wentse Chen and Jiayu Chen and Hao Zhu and Jeff Schneider},
booktitle={ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models},
year={2025},
url={https://openreview.net/forum?id=6CE5PLsZdW}
}