Frequently Asked Questions

Last updated: 09/24/2025.

Distributed training

How to run multi-node post-training with Ray?

You can start a ray cluster and submit a ray job, following the official guide from Ray: https://docs.ray.io/en/latest/ray-core/starting-ray.html

Then in the configuration, set the trainer.nnode config to the number of machines for your job.

How to use verl on a Slurm-managed cluster?

Ray provides users with this official tutorial to start a Ray cluster on top of Slurm. We have verified the GSM8K example on a Slurm cluster under a multi-node setting with the following steps.

1. [Optional] If your cluster support Apptainer or Singularity and you wish to use it, convert verl’s Docker image to an Apptainer image. Alternatively, set up the environment with the package manager available on your cluster or use other container runtimes (e.g. through Slurm’s OCI support) available to you.

apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3

Follow GSM8K example to prepare the dataset and model checkpoints.
Modify examples/slurm/ray_on_slurm.slurm with your cluster’s own information.
Submit the job script to the Slurm cluster with sbatch.

Please note that Slurm cluster setup may vary. If you encounter any issues, please refer to Ray’s Slurm user guide for common caveats.

If you changed Slurm resource specifications, please make sure to update the environment variables in the job script if necessary.

Illegal memory access

If you encounter the error message like CUDA error: an illegal memory access was encountered during rollout, please check the vLLM documentation for troubleshooting steps specific to your vLLM version.

Checkpoints

If you want to convert the model checkpoint into huggingface safetensor format, please refer to verl/model_merger.

Triton `compile_module_from_src` error

If you encounter triton compilation error similar to the stacktrace below, please set the use_torch_compile flag according to https://verl.readthedocs.io/en/latest/examples/config.html to disable just-in-time compilation for fused kernels.

File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
  return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 338, in run
  return self.fn.run(*args, **kwargs)
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
  device = driver.active.get_current_device()
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
  self._initialize_obj()
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
  self._obj = self._init_fn()
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
  return actives[0]()
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
  self.utils = CudaUtils()  # TODO: make static
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
  mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
  so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
  ret = subprocess.check_call(cc_cmd)
File "/data/lbh/conda_envs/verl/lib/python3.10/subprocess.py", line 369, in check_call
  raise CalledProcessError(retcode, cmd)

What is the meaning of train batch size, mini batch size, and micro batch size?

This figure illustrates the relationship between different batch size configurations.

https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA

https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d

How to generate ray timeline to analyse performance of a training job?

To generate the ray timeline file, you can set the config term ray_init.timeline_file to a json file path. For example:

ray_init.timeline_file=/tmp/ray_timeline.json

The file will be generated in the specified path at the end of a training job. You can use tools like chrome://tracing or the Perfetto UI and view the ray timeline file.

This figure shows the ray timeline file generated by from a training job on 1 node with 4 GPUs

https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray_timeline.png?raw=true

How to set proxy only for wandb?

If you need a proxy to access wandb, you can add below config in your training job script. Comparing to using global https_proxy env variable, this approach won’t mess up other http requests, such as ChatCompletionScheduler.

+trainer.wandb_proxy=http://<your proxy and port>

Missmatch between inference and training sequence (high actor/grad_norm)

If you encounter the issue of actor/grad_norm metric continuously increasing during training, it might be caused by a significant precision mismatching between the inference engine and training. You can use the following parameter to confirm this:

actor_rollout_ref.rollout.calculate_log_probs=True

This parameter will add metrics like training/rollout_probs_diff_mean , which can be used to verify if there is a precision difference between inference and training.

Under normal circumstances, the value of training/rollout_probs_diff_mean should be below 0.005. If you observe this value to be higher than 0.01, it indicates a precision issue from the inference engine. The precision issue is known to occur under the following conditions:

Using non-Hopper architecture GPUs, such as A100, L20, B200, etc.
Using vLLM with issue 22103 as the inference engine.
The input and output texts are long, for example, in multi-turn scenarios using reasioning models like Qwen3 for RL training.

If all three conditions above are met and you observe that rollout_probs_diff_mean is too high, it is recommended to add the following parameter to resolve the precision issue:

+actor_rollout_ref.rollout.engine_kwargs.vllm.disable_cascade_attn=True

The root cause of this issue is a bug in the flash attention used by vLLM. Although it has been fixed, the fix has not yet been released in the latest version of vLLM (v0.10.2). For a more detailed explanation of this issue, please refer to Fix LSE output error in FA2 kv-split.

Until vLLM releases a new version with this fix, it is recommended to use the configuration above to disable cascade attention as a workaround.

Frequently Asked Questions

Ray related

How to add breakpoint for debugging with distributed Ray?

“Unable to register worker with raylet”