rccl: add patch to limit max memory usage, reduce number of targets#3386
Open
rccl: add patch to limit max memory usage, reduce number of targets#3386
Conversation
Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
2761787 to
c134f01
Compare
Member
Author
|
@afzpatel just FYI, What's the idea then? You have to have 256GB of memory to build it in "reasonable" time? Edit: seems like we only build for one amd gpu target in the rest of the stack, so applying that to rccl too |
Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
Member
Author
|
With 1 target peak memory is apparently 21.7GB and 3 jobs were used. I guess we could do 8GB per job per arch or something? |
Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
zackgalbreath
approved these changes
Feb 13, 2026
Member
Author
|
Giving @afzpatel @renjithravindrankannath and @srekolam an opportunity to look at this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ref ROCm/rocm-systems#3231
Cause we don't use cgroups v2, the computation of "available memory" doesn't
work, and we end up with up to 16 jobs, with user reports claiming 32GB per job
is needed, so 512GB of memory for the build...
This patch allows us to set a maximum on the amount of memory used, which I've
hard-coded to 128GB (which is 4x the claim we make in CI, but otherwise the build
takes well over 6 hours...)
Edit: also reduce the number of GPU architectures to just one.