Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.0.x] fabtests: Bugfixes for neuron #10663

Merged
merged 1 commit into from
Jan 2, 2025
Merged

Conversation

sunkuamzn
Copy link
Contributor

This commit fixes the following bugs in neuron fabtests

  1. The neuron accelerator detection is broken on some OSs because the full path of the executable neuron-ls was not used

  2. Before this commit, each pytest worker was assigned a single neuron core. This works on multi node tests but fails on single node tests because a neuron core can only be opened by a single process. This commit assigns two different neuron cores to each pytest worker for client-server tests: one for the server and one for the client. Trn1 has 2 cores per neuron device and Trn2 has 8 cores per neuron device, so
    this assignment works for both.

  3. When running in serial mode, the env var PYTEST_XDIST_WORKER is not set, so the NEURON_RT_VISIBLE_CORES env var is also not set. This causes the server to occupy all neuron cores and the client fails. So this commit assigns device 0 to the server and client when running with one worker.

Signed-off-by: Sai Sunku [email protected]
(cherry picked from commit f893f5f)

This commit fixes the following bugs in neuron fabtests
1. The neuron accelerator detection is broken on some OSs because the
   full path of the executable `neuron-ls` was not used

2. Before this commit, each pytest worker was assigned a single  neuron
   core. This works on multi node tests but fails on single node tests
because a neuron core can only be opened by a single process. This
commit assigns two different neuron cores to each pytest worker for
client-server tests: one for the server and one for the client. Trn1 has
2 cores per neuron device and Trn2 has 8 cores per neuron device, so
  this assignment works for both.

3. When running in serial mode, the env var PYTEST_XDIST_WORKER is not
   set, so the NEURON_RT_VISIBLE_CORES env var is also not set. This
causes the server to occupy all neuron cores and the client fails. So
this commit assigns device 0 to the server and client when running with
one worker.

Signed-off-by: Sai Sunku <[email protected]>
(cherry picked from commit f893f5f)
@sunkuamzn
Copy link
Contributor Author

@j-xiong @aingerson could you please check if the Intel CI failure is relevant for this PR and #10662?

@j-xiong
Copy link
Contributor

j-xiong commented Jan 2, 2025

The failures were with oneCCL tests over tcp. I think they are unrelated to this PR.

@shijin-aws shijin-aws merged commit 4f1303b into ofiwg:v2.0.x Jan 2, 2025
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants