Skip to content

rccl: add patch to limit max memory usage, reduce number of targets#3386

Open
haampie wants to merge 4 commits intodevelopfrom
hs/fix/rccl-archs
Open

rccl: add patch to limit max memory usage, reduce number of targets#3386
haampie wants to merge 4 commits intodevelopfrom
hs/fix/rccl-archs

Conversation

@haampie
Copy link
Member

@haampie haampie commented Feb 12, 2026

ref ROCm/rocm-systems#3231

Cause we don't use cgroups v2, the computation of "available memory" doesn't
work, and we end up with up to 16 jobs, with user reports claiming 32GB per job
is needed, so 512GB of memory for the build...

This patch allows us to set a maximum on the amount of memory used, which I've
hard-coded to 128GB (which is 4x the claim we make in CI, but otherwise the build
takes well over 6 hours...)

Edit: also reduce the number of GPU architectures to just one.

Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
@haampie
Copy link
Member Author

haampie commented Feb 13, 2026

@afzpatel just FYI, rccl managed to make a node with 2TB of memory go OOM :) can you potentially convince the relevant people to prioritize less resource intensive builds? It's nice the ecosystem is open source, but when nobody can build it... Limiting this to 32GB of memory, which requires a patch, we cannot build it cause it takes 6h+.

What's the idea then? You have to have 256GB of memory to build it in "reasonable" time?

Edit: seems like we only build for one amd gpu target in the rest of the stack, so applying that to rccl too

Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
@haampie haampie changed the title rccl: add patch to limit max memory usage rccl: add patch to limit max memory usage, reduce number of targets Feb 13, 2026
@haampie
Copy link
Member Author

haampie commented Feb 13, 2026

With 1 target peak memory is apparently 21.7GB and 3 jobs were used. I guess we could do 8GB per job per arch or something?

Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
@haampie
Copy link
Member Author

haampie commented Feb 13, 2026

Giving @afzpatel @renjithravindrankannath and @srekolam an opportunity to look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants