Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: better tuner dumper #717

Merged
merged 2 commits into from
Dec 12, 2024
Merged

Conversation

aws-nslick
Copy link
Contributor

No description provided.

@aws-nslick aws-nslick requested a review from a team as a code owner November 22, 2024 20:31
@aws-nslick aws-nslick force-pushed the rewrite-show-tuner-dec branch 3 times, most recently from 3a1110b to eade05a Compare November 22, 2024 20:47
@bwbarrett
Copy link
Contributor

This is marked wip:, but also not marked a draft PR. What's the goal / plan?

rajachan
rajachan previously approved these changes Dec 4, 2024
Copy link
Member

@rajachan rajachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this going in. But overriding product name is handy to have in the plugin for corner case testing too, so I don't feel strongly either way.

@aws-nslick
Copy link
Contributor Author

I was waiting for @arunkarthik-akkart to report whether he found it useful. This needs to be extended further but this is a better starting ground than the existing tool.

@arunkarthik-akkart
Copy link
Contributor

arunkarthik-akkart commented Dec 12, 2024

Thanks for the PR Nick. This is really useful.
Couple of things I noticed when I was trying to run the tests.

  1. Since we only print the algo-proto when there is change with this PR. We currently have this for the output.
Nodes Ranks message_size  Collective  Algo      Proto
2048  2048  128           AllReduce       Tree      Ll
            65537         AllReduce       Tree   Ll128
            12678238      AllReduce       Tree  Simple
            536870913     AllReduce       Ring   Ll128
            4616131146    AllReduce       Ring  Simple

Instead I would suggest we have something like this which shows the actual range of the message size in which we have a particular algo-proto selection.

Nodes Ranks message_size  Collective  Algo     Proto
2048  2048  0-128           AllReduce       Tree      Ll
            128-65537         AllReduce       Tree   Ll128
            65537-12678238      AllReduce       Tree  Simple
            12678238-536870913     AllReduce       Ring   Ll128
            536870913-4616131146    AllReduce       Ring  Simple
  1. It would be good to have a README file which would list the commands to install and run the tests.

Please let me know if you can do this change, if not I can help update this PR as well.
We can merge this and follow up with another PR as well

this was never a unit test, but was added here because it was a simple
way to add it to the build. Replace it with a python tool.

Signed-off-by: Nicholas Sielicki <[email protected]>
@@ -27,6 +27,11 @@ const char *nccl_net_ofi_get_product_name(void)
static char *product_name = NULL;
static pthread_mutex_t product_name_mutex = PTHREAD_MUTEX_INITIALIZER;

char* forced_pn = getenv("OFI_NCCL_FORCE_PRODUCT_NAME");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be done in nccl_ofi_param.h for consistency.

@aws-nslick aws-nslick merged commit 0a6589c into aws:master Dec 12, 2024
22 of 23 checks passed
@aws-nslick aws-nslick deleted the rewrite-show-tuner-dec branch December 12, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants