Data collection based on FSDP backend on Ascend devices(en)

Last updated: 08/14/2025.

This is a tutorial for data collection using the GRPO or DAPO algorithm based on FSDP on Ascend devices.

Configuration

Leverage two levels of configuration to control data collection:

  1. Global profiler control: Use parameters in ppo_trainer.yaml to control the collection mode and steps.

  2. Role profile control: Use parameters in each role’s profile field to control the collection mode for each role.

Global collection control

Use parameters in ppo_trainer.yaml to control the collection mode and steps.

  • global_profiler: Control the ranks and mode of profiling

    • tool: The profiling tool to use, options are nsys, npu, torch, torch_memory.

    • steps: This parameter can be set as a list that has collection steps, such as [2, 4], which means it will collect steps 2 and 4. If set to null, no collection occurs.

    • save_path: The path to save the collected data. Default is “outputs/profile”.

Role collection control

In each role’s profiler field, you can control the collection mode for that role.

  • enable: Whether to enable profiling for this role.

  • all_ranks: Whether to collect data from all ranks.

  • ranks: A list of ranks to collect data from. If empty, no data is collected.

  • tool_config: Configuration for the profiling tool used by this role.

Use parameters in each role’s profiler.tool_config.npu to control npu profiler behavior:

  • level: Collection level—options are level_none, level0, level1, and level2

    • level_none: Disables all level-based data collection (turns off profiler_level).

    • level0: Collect high-level application data, underlying NPU data, and operator execution details on NPU.

    • level1: Extends level0 by adding CANN-layer AscendCL data and AI Core performance metrics on NPU.

    • level2: Extends level1 by adding CANN-layer Runtime data and AI CPU metrics.

  • contents: A list of options to control the collection content, such as npu, cpu, memory, shapes, module, stack.

    • npu: Whether to collect device-side performance data.

    • cpu: Whether to collect host-side performance data.

    • memory: Whether to enable memory analysis.

    • shapes: Whether to record tensor shapes.

    • module: Whether to record framework-layer Python call stack information.

    • stack: Whether to record operator call stack information.

  • analysis: Enables automatic data parsing.

  • discrete: Whether to enable discrete mode.

Examples

Disabling collection

global_profiler:
   steps: null # disable profile

End-to-End collection

global_profiler:
   steps: [1, 2, 5]
actor_rollout_ref:
   actor:
      profiler:
         enable: True
         all_ranks: True
         tool_config:
            npu:
               discrete: False
  # rollout & ref follow actor settings

Discrete Mode Collection

global_profiler:
   steps: [1, 2, 5]
actor_rollout_ref:
   actor:
      profiler:
         enable: True
         all_ranks: True
         tool_config:
            npu:
               discrete: True
  # rollout & ref follow actor settings

Visualization

Collected data is stored in the user-defined save_path and can be visualized by using the MindStudio Insight tool.

If the analysis parameter is set to False, offline parsing is required after data collection:

import torch_npu
# Set profiler_path to the parent directory of the "localhost.localdomain_<PID>_<timestamp>_ascend_pt" folder
torch_npu.profiler.profiler.analyse(profiler_path=profiler_path)