-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[rollout, perf, cfg] fix: Add global step info and support more profile control params for rollout profiling (sglang backend) #5025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…le control params for rollout profiling (sglang backend)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces global step awareness for rollout profiling and expands support for sglang-specific profile control parameters. The changes correctly integrate global_step into the profiling process and add new configuration options for TorchProfilerToolConfig. However, there are a couple of high-severity issues related to the profiling logic in async_sglang_server.py that need to be addressed to ensure correct and predictable behavior.
| rollout_start_step = max(1, config.step_start or 1) | ||
| rollout_end_step = max(2, config.step_end or 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded max(1, ...) and max(2, ...) for rollout_start_step and rollout_end_step can lead to unintended profiling ranges. For example, if a user sets config.step_start=0 and config.step_end=1 (intending to profile only step 0), rollout_end_step will be forced to 2, resulting in profiling steps 0 and 1. This overrides the user's explicit configuration. It's generally better to respect user-provided values or implement clearer default logic.
| rollout_start_step = max(1, config.step_start or 1) | |
| rollout_end_step = max(2, config.step_end or 2) | |
| rollout_start_step = config.step_start if config.step_start is not None else 0 | |
| rollout_end_step = config.step_end if config.step_end is not None else -1 | |
| if rollout_end_step != -1 and rollout_end_step < rollout_start_step: | |
| raise ValueError("rollout_end_step cannot be less than rollout_start_step") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SGlang profiling, the forward step is required to start from 1.
| assert rollout_num_steps is None or rollout_num_steps > 0, ( | ||
| f"Rollout num steps must be greater than 0 for sglang, but got {rollout_num_steps}" | ||
| ) | ||
| self._auto_stop_profiling = rollout_num_steps is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for _auto_stop_profiling seems inverted. self._auto_stop_profiling is set to True if rollout_num_steps is defined (i.e., config.step_start and config.step_end are provided). However, the stop_profile method's condition and not self._auto_stop_profiling means that if rollout_num_steps is set, the explicit self.tokenizer_manager.stop_profile() call will be skipped. This could prevent profiling sessions from being properly terminated, leading to resource leaks or corrupted profile files.
| self._auto_stop_profiling = rollout_num_steps is not None | |
| self._auto_stop_profiling = rollout_num_steps is not None and rollout_num_steps > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollout_num_steps can only be a number here and will never get None, so _auto_stop_profiling is always True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the rollout_num_steps parameter is set, SGLang will automatically stop the profiling session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed.
|
|
||
| def __post_init__(self) -> None: | ||
| """config validation logics go here""" | ||
| __support_contents = ["cuda", "cpu", "memory", "shapes", "stack", "profile-by-stage", "merge-profiles"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain the two newly added contents and try to find an appropriate place to update the documentation(at least add it here as a code comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the parameters is explained in the _profile_args interface. This may be added to the documentation in future work.
| assert rollout_num_steps is None or rollout_num_steps > 0, ( | ||
| f"Rollout num steps must be greater than 0 for sglang, but got {rollout_num_steps}" | ||
| ) | ||
| self._auto_stop_profiling = rollout_num_steps is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollout_num_steps can only be a number here and will never get None, so _auto_stop_profiling is always True
| self.profiler_controller.check_enable() | ||
| and self.profiler_controller.check_this_rank() | ||
| and self.profiler_controller.is_discrete_mode() | ||
| and not self._auto_stop_profiling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If auto_stop_profiling is always True, it seems that stop_profile will never execute? Please verify the logic related to auto_stop_profiling
|
|
||
| async def start_profile(self, **kwargs): | ||
| # TODO: Persist global_step to engine server-created file/path | ||
| kwargs.pop("global_step") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the TODO, and remove the pop operation
An extra unused parameter should not cause any issue. However, if we need to use it later, we might spend time to figure out why global_step has vanished ^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AsyncLLM.start_profile interface from vLLM does not support the global_step parameter ~
|
|
||
| return TokenOutput(token_ids=token_ids, log_probs=log_probs, routed_experts=routed_experts) | ||
|
|
||
| def _profile_args(self, **kwargs) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep Profiler Logic Cohesive: The profiler logic seems to be scattered across the main workflow. It would be better to encapsulate it to keep the main execution path clean and cohesive. Avoid adding too much profiling-specific code directly into the core rollout logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the SGLang implementation class, where potential differing interfaces should be reserved to allow for better encapsulation and abstraction design in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a new commit to improve code cohesion and reduce coupling. Thank you for your advice.
|
|
||
| def __post_init__(self) -> None: | ||
| """config validation logics go here""" | ||
| __support_contents = ["cuda", "cpu", "memory", "shapes", "stack", "profile-by-stage", "merge-profiles"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stick to Common Parameters: Some parameters added here seem specific to the SGLang interface (e.g., profile-by-stage). I suggest removing them if they don't provide significant benefits for profiling analysis. It's better to stick to system parameters that are common across both vLLM and SGLang backends to maintain consistency and reduce maintenance overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include both generic parameters and profiling parameters specific to the rollout backend implementation, to prevent the inability to use the rollout backend's performance analysis capabilities later on.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
To validate the functionality of each profile control parameter for the sglang rollout backend, orthogonal test cases were designed (minimizing redundant combinations while covering all key parameter values). The test results are as follows:
API and Usage Example
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.