Trainer Interface

Last updated: 06/08/2025 (API docstrings are auto-generated).

Trainers drive the training loop. Introducing new trainer classes in case of new training paradiam is encouraged.

Core APIs

Utils for tokenization.

verl.trainer.ppo.reward.compute_reward(data: DataProto, reward_fn: AbstractRewardManager) → tuple[Tensor, dict[str, Any]][source]

Compute reward for a batch of data. :param data: DataProto object containing the input data. :param reward_fn: Reward function to compute the reward.

Returns:: Tuple of reward tensor and extra info dictionary.

verl.trainer.ppo.reward.load_reward_manager(config: DictConfig, tokenizer: Any, num_examine: int, **reward_kwargs: Any) → AbstractRewardManager[source]

Load and initialize a reward manager based on the configuration.

Parameters:

config – PPO trainer configuration object containing reward_model fields.
tokenizer – Tokenizer object used for processing text.
num_examine – Number of samples to examine.
**reward_kwargs – Additional keyword arguments for the reward manager.

Returns:

An instance of the specified reward manager class.

class verl.workers.reward_manager.NaiveRewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source')[source]: The reward manager.

class verl.workers.reward_manager.DAPORewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source', max_resp_len=None, overlong_buffer_cfg=None)[source]: The reward manager.