Skip to content

Conversation

@bithighrr
Copy link

What does this PR do?

Address key functionality gaps in rollout discrete profiling for the sglang backend by adding global step awareness and expanding support for flexible profile control parameters:

  1. Missing global step information resulted in disorganized profiling files;
  2. Lack of support for specifying critical sglang backend profile control parameters (including num_steps, profile_by_stage, and merge_profiles) led to overly large profile files and hindered convenient profiling analysis.

Solutions implemented:

  • Pass kwargs parameters to the rollout server to enable global step awareness, and generate an independent folder for each global step to organize profiling data in a structured way;
  • Extend the profiling config by adding optional parameters to the content field, and remove unnecessary restrictions on content values in TorchProfilerToolConfig to support specifying sglang-specific profile control parameters (e.g., profile_by_stage, merge_profiles).

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: PR #4320: [perf] feat: verl profiler system support Agent Loop scenario and integrate torch.profiler
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

To validate the functionality of each profile control parameter for the sglang rollout backend, orthogonal test cases were designed (minimizing redundant combinations while covering all key parameter values). The test results are as follows:

step_start step_end profile-by-stage merge-profiles stack shapes cpu/cuda Test Result
default default default (False) default (False) default (False) default (False) default Pass
0 (≥0) 1 (≥1) default default default default default Pass
0 (≥0) default default default default default default Pass
default 1 (≥1) default default default default default Pass
default default True default default default default Pass
default default default True default default default Pass
default default default default True default default Pass
default default default default default True default Pass
default default default default default default [] / [cpu] / [cuda] / [cpu,cuda] Pass

API and Usage Example

  rollout:
    quantization: null
    profiler:
      enable: True
      all_ranks: False
      ranks: [0, 1, 2] 
      tool_config:
        torch:
          discrete: True
          step_start: 0
          step_end: 1
          contents: [cpu, cuda, stack, shapes, profile_by_stage, merge_profiles]

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

…le control params for rollout profiling (sglang backend)
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces global step awareness for rollout profiling and expands support for sglang-specific profile control parameters. The changes correctly integrate global_step into the profiling process and add new configuration options for TorchProfilerToolConfig. However, there are a couple of high-severity issues related to the profiling logic in async_sglang_server.py that need to be addressed to ensure correct and predictable behavior.

Comment on lines 427 to 428
rollout_start_step = max(1, config.step_start or 1)
rollout_end_step = max(2, config.step_end or 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hardcoded max(1, ...) and max(2, ...) for rollout_start_step and rollout_end_step can lead to unintended profiling ranges. For example, if a user sets config.step_start=0 and config.step_end=1 (intending to profile only step 0), rollout_end_step will be forced to 2, resulting in profiling steps 0 and 1. This overrides the user's explicit configuration. It's generally better to respect user-provided values or implement clearer default logic.

Suggested change
rollout_start_step = max(1, config.step_start or 1)
rollout_end_step = max(2, config.step_end or 2)
rollout_start_step = config.step_start if config.step_start is not None else 0
rollout_end_step = config.step_end if config.step_end is not None else -1
if rollout_end_step != -1 and rollout_end_step < rollout_start_step:
raise ValueError("rollout_end_step cannot be less than rollout_start_step")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SGlang profiling, the forward step is required to start from 1.

assert rollout_num_steps is None or rollout_num_steps > 0, (
f"Rollout num steps must be greater than 0 for sglang, but got {rollout_num_steps}"
)
self._auto_stop_profiling = rollout_num_steps is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for _auto_stop_profiling seems inverted. self._auto_stop_profiling is set to True if rollout_num_steps is defined (i.e., config.step_start and config.step_end are provided). However, the stop_profile method's condition and not self._auto_stop_profiling means that if rollout_num_steps is set, the explicit self.tokenizer_manager.stop_profile() call will be skipped. This could prevent profiling sessions from being properly terminated, leading to resource leaks or corrupted profile files.

Suggested change
self._auto_stop_profiling = rollout_num_steps is not None
self._auto_stop_profiling = rollout_num_steps is not None and rollout_num_steps > 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollout_num_steps can only be a number here and will never get None, so _auto_stop_profiling is always True

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the rollout_num_steps parameter is set, SGLang will automatically stop the profiling session.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed.


def __post_init__(self) -> None:
"""config validation logics go here"""
__support_contents = ["cuda", "cpu", "memory", "shapes", "stack", "profile-by-stage", "merge-profiles"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the two newly added contents and try to find an appropriate place to update the documentation(at least add it here as a code comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the parameters is explained in the _profile_args interface. This may be added to the documentation in future work.

assert rollout_num_steps is None or rollout_num_steps > 0, (
f"Rollout num steps must be greater than 0 for sglang, but got {rollout_num_steps}"
)
self._auto_stop_profiling = rollout_num_steps is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollout_num_steps can only be a number here and will never get None, so _auto_stop_profiling is always True

self.profiler_controller.check_enable()
and self.profiler_controller.check_this_rank()
and self.profiler_controller.is_discrete_mode()
and not self._auto_stop_profiling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If auto_stop_profiling is always True, it seems that stop_profile will never execute? Please verify the logic related to auto_stop_profiling


async def start_profile(self, **kwargs):
# TODO: Persist global_step to engine server-created file/path
kwargs.pop("global_step")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the TODO, and remove the pop operation

An extra unused parameter should not cause any issue. However, if we need to use it later, we might spend time to figure out why global_step has vanished ^_^

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsyncLLM.start_profile interface from vLLM does not support the global_step parameter ~


return TokenOutput(token_ids=token_ids, log_probs=log_probs, routed_experts=routed_experts)

def _profile_args(self, **kwargs) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep Profiler Logic Cohesive: The profiler logic seems to be scattered across the main workflow. It would be better to encapsulate it to keep the main execution path clean and cohesive. Avoid adding too much profiling-specific code directly into the core rollout logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the SGLang implementation class, where potential differing interfaces should be reserved to allow for better encapsulation and abstraction design in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a new commit to improve code cohesion and reduce coupling. Thank you for your advice.


def __post_init__(self) -> None:
"""config validation logics go here"""
__support_contents = ["cuda", "cpu", "memory", "shapes", "stack", "profile-by-stage", "merge-profiles"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stick to Common Parameters: Some parameters added here seem specific to the SGLang interface (e.g., profile-by-stage). I suggest removing them if they don't provide significant benefits for profiling analysis. It's better to stick to system parameters that are common across both vLLM and SGLang backends to maintain consistency and reduce maintenance overhead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include both generic parameters and profiling parameters specific to the rollout backend implementation, to prevent the inability to use the rollout backend's performance analysis capabilities later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants