Recipe: CollabLLM
Last updated: 09/22/2025.
Open-Source Algorithm Implementation & Expriement Running: Haiquan Chen, Shirley Wu
🏠 Homepage | 📝 Paper | 🤗 Datasets & Models | ⭐️ Original Implementation
verl provides a recipe for the Outstanding Paper at ICML 2025, “CollabLLM: From Passive Responders to Active Collaborators”. CollabLLM is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users.
Core Idea: Models are rewarded based on how well their responses enable effective future collaboration with users.
Paper Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Quick Start
0. Environment
Make sure the required packages for verl are installed. Additionally, install litellm and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below).
1. Prepare Your Dataset
First, process your dataset using the provided script (see example commands and usage in process_dataset.py):
python process_dataset.py --dataset <> ... --dataset_type <sft or rl>
Requirements:
Input: A Hugging Face multiturn dataset. Existing datasets:
collabllm/collabllm-multiturn-$DATASET, withDATASETin one of [math-hard(-large),medium(-large),bigcodebench(-large)] (*-large are the datasets used in the CollabLLM paper)Example format: See collabllm-multiturn-math-hard
To generate your own dataset: Use build_dataset.py from the original CollabLLM repository
2. Train Your Model
(Optional) For Supervised Fine-Tuning (SFT):
bash train_sft_collabllm.sh
For Reinforcement Learning (RL):
bash train_rl_collabllm.sh
The RL script shows an example to train CollabLLM on math-hard-large.
The config to sample future conversations are in
recipe/collabllm/config/collabllm_interaction_config.yaml.The Multiturn-aware Reward is aggregated from these three conversational-level rewards:
+reward_model.reward_kwargs.metric_weights.accuracy=1 \ +reward_model.reward_kwargs.metric_weights.interactivity=1 \ +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under
recipe/collabllm/metrics. For example, onmedium-large, you can replaceaccuracywithbleu_scorevia+reward_model.reward_kwargs.metric_weights.bleu_score=1
which will instead apply bleu score on the sampled future conversations.
Algorithm
Step |
Name |
Description |
|---|---|---|
1 |
Model response generation |
The model generates multiple responses for each prompt in a batch. |
2 |
Collaborative simulation |
A user simulator (e.g., GPT or Claude) samples |
3 |
Compute Multiturn-aware Reward |
Customized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts. |
4 |
Update model |
The model weights are updated using the computed multiturn-aware rewards. |
Configuration
The primary configuration is managed through the launch script train_rl_collabllm.sh and the YAML file recipe/collabllm/config/collabllm_interaction_config.yaml. Key configuration sections:
Section |
Key Parameters / Notes |
|---|---|
|
Paths to training/validation files, batch sizes, sequence lengths. |
|
Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler). |
|
Hyperparameters under |
|
Defined in |
|
Manager set to |
|
GRPO-specific hyperparameters such as |
|
Distributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency. |
Key Files
File Path |
Purpose |
|---|---|
|
Main logic to sample future conversations, using |
|
Computes rewards for future conversations, leveraging |
Acknowledgement
We sincerely thank the verl community and advisors for their contributions and guidance!