Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Significant overhead on running neural network benchmark #1132

Open
aterrel opened this issue Apr 10, 2024 · 0 comments
Open

[BUG] Significant overhead on running neural network benchmark #1132

aterrel opened this issue Apr 10, 2024 · 0 comments

Comments

@aterrel
Copy link

aterrel commented Apr 10, 2024

Software versions

unknown

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

Rory Mitchell has two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). These two pieces of code should be similar in runtime with perhaps cunumeric being only 80% of the CUDA version.

Observed behavior

When he trains 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, he gets an order difference in training times:
Cunumeric: 79.447s
Native: 4.833s

Example code or instructions

https://github.com/rapidsai/legate-boost/pull/92

Stack traceback or browser console output

Slack thread:

Rory Mitchell
10 hours ago
I have two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). If I train 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, I get the following training times:
Cunumeric: 79.447s
Native: 4.833s
In summary, the latency of cunumeric seems to be too high for this kind of application. Its too slow for me to run tests on CI when each run takes this long.

Andy Terrel
7 hours ago
Hi Rory,
Can you please share the code? I would like to make a github issue that will preserve the lineage of the issue better than a slack thread.

Rory Mitchell
7 hours ago
rapidsai/legate-boost#92
👍
1

Wonchan Lee
:spiral_calendar_pad: 1 hour ago
@Rory Mitchell
how hard is it to extract the nn piece and make it runnable for both cupy and cunumeric? this posting is quite timely, as we just started talking about optimizing single GPU execution: https://docs.google.com/document/d/1IGwvwaSi4Dh5vqK7Hq41k9A8gPWnwPict8rLrGOyQq4/edit#heading=h.4yizf0qjzlv7

Wonchan Lee
:spiral_calendar_pad: 1 hour ago
this could be a great target for the fast path proposed in the doc

Wonchan Lee
:spiral_calendar_pad: 1 hour ago
I'm also curious to see the execution profile if it's readily available. I want to make sure that we're limited by the overhead of launching cunumeric tasks

Rory Mitchell
31 minutes ago
I can probably pull it out for you with some effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant