[v2.0.x] fabtests: Bugfixes for neuron #10663

sunkuamzn · 2024-12-30T17:44:12Z

This commit fixes the following bugs in neuron fabtests

The neuron accelerator detection is broken on some OSs because the full path of the executable neuron-ls was not used
Before this commit, each pytest worker was assigned a single neuron core. This works on multi node tests but fails on single node tests because a neuron core can only be opened by a single process. This commit assigns two different neuron cores to each pytest worker for client-server tests: one for the server and one for the client. Trn1 has 2 cores per neuron device and Trn2 has 8 cores per neuron device, so
this assignment works for both.
When running in serial mode, the env var PYTEST_XDIST_WORKER is not set, so the NEURON_RT_VISIBLE_CORES env var is also not set. This causes the server to occupy all neuron cores and the client fails. So this commit assigns device 0 to the server and client when running with one worker.

Signed-off-by: Sai Sunku [email protected]
(cherry picked from commit f893f5f)

This commit fixes the following bugs in neuron fabtests 1. The neuron accelerator detection is broken on some OSs because the full path of the executable `neuron-ls` was not used 2. Before this commit, each pytest worker was assigned a single neuron core. This works on multi node tests but fails on single node tests because a neuron core can only be opened by a single process. This commit assigns two different neuron cores to each pytest worker for client-server tests: one for the server and one for the client. Trn1 has 2 cores per neuron device and Trn2 has 8 cores per neuron device, so this assignment works for both. 3. When running in serial mode, the env var PYTEST_XDIST_WORKER is not set, so the NEURON_RT_VISIBLE_CORES env var is also not set. This causes the server to occupy all neuron cores and the client fails. So this commit assigns device 0 to the server and client when running with one worker. Signed-off-by: Sai Sunku <[email protected]> (cherry picked from commit f893f5f)

sunkuamzn · 2025-01-02T14:33:34Z

@j-xiong @aingerson could you please check if the Intel CI failure is relevant for this PR and #10662?

j-xiong · 2025-01-02T17:10:19Z

The failures were with oneCCL tests over tcp. I think they are unrelated to this PR.

shijin-aws approved these changes Dec 30, 2024

View reviewed changes

shijin-aws merged commit 4f1303b into ofiwg:v2.0.x Jan 2, 2025
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2.0.x] fabtests: Bugfixes for neuron #10663

[v2.0.x] fabtests: Bugfixes for neuron #10663

sunkuamzn commented Dec 30, 2024

sunkuamzn commented Jan 2, 2025

j-xiong commented Jan 2, 2025

[v2.0.x] fabtests: Bugfixes for neuron #10663

[v2.0.x] fabtests: Bugfixes for neuron #10663

Conversation

sunkuamzn commented Dec 30, 2024

sunkuamzn commented Jan 2, 2025

j-xiong commented Jan 2, 2025