Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: various optimizations #1012

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Aug 21, 2024

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:

  1. support rectangular blocks
  2. other configurations in how we post the WQEs:
    • iterate across nodes before blocks when posting the WQEs
    • reuse dm chunks
    • send blocks by batch
  3. knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

TL/MLX5: add npolls cfg for FANIN

TL/MLX5: knomial fanin

TL/MLX5: add prints and profile events

TL/MLX5: remove debug prints
tiny bit more robust

print blocks dimensions

fully working configurable batch_size, serialization, and pollings
clean and working

TL/MLX5: add more config for block dimensions

force longer by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant