You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rory Mitchell has two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). These two pieces of code should be similar in runtime with perhaps cunumeric being only 80% of the CUDA version.
Observed behavior
When he trains 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, he gets an order difference in training times:
Cunumeric: 79.447s
Native: 4.833s
Rory Mitchell 10 hours ago
I have two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). If I train 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, I get the following training times:
Cunumeric: 79.447s
Native: 4.833s
In summary, the latency of cunumeric seems to be too high for this kind of application. Its too slow for me to run tests on CI when each run takes this long.
Andy Terrel 7 hours ago
Hi Rory,
Can you please share the code? I would like to make a github issue that will preserve the lineage of the issue better than a slack thread.
Wonchan Lee
:spiral_calendar_pad: 1 hour ago
this could be a great target for the fast path proposed in the doc
Wonchan Lee
:spiral_calendar_pad: 1 hour ago
I'm also curious to see the execution profile if it's readily available. I want to make sure that we're limited by the overhead of launching cunumeric tasks
Rory Mitchell 31 minutes ago
I can probably pull it out for you with some effort.
The text was updated successfully, but these errors were encountered:
Software versions
unknown
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
Rory Mitchell has two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). These two pieces of code should be similar in runtime with perhaps cunumeric being only 80% of the CUDA version.
Observed behavior
When he trains 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, he gets an order difference in training times:
Cunumeric: 79.447s
Native: 4.833s
Example code or instructions
Stack traceback or browser console output
Slack thread:
Rory Mitchell
10 hours ago
I have two implementations of neural network training, one in cunumeric and one in native C++/CUDA (with nccl). If I train 10 (small) neural networks with hidden layer size (100,100) in legateboost on 1 GPU, on a dataset, I get the following training times:
Cunumeric: 79.447s
Native: 4.833s
In summary, the latency of cunumeric seems to be too high for this kind of application. Its too slow for me to run tests on CI when each run takes this long.
Andy Terrel
7 hours ago
Hi Rory,
Can you please share the code? I would like to make a github issue that will preserve the lineage of the issue better than a slack thread.
Rory Mitchell
7 hours ago
rapidsai/legate-boost#92
👍
1
Wonchan Lee
:spiral_calendar_pad: 1 hour ago
@Rory Mitchell
how hard is it to extract the nn piece and make it runnable for both cupy and cunumeric? this posting is quite timely, as we just started talking about optimizing single GPU execution: https://docs.google.com/document/d/1IGwvwaSi4Dh5vqK7Hq41k9A8gPWnwPict8rLrGOyQq4/edit#heading=h.4yizf0qjzlv7
Wonchan Lee
:spiral_calendar_pad: 1 hour ago
this could be a great target for the fast path proposed in the doc
Wonchan Lee
:spiral_calendar_pad: 1 hour ago
I'm also curious to see the execution profile if it's readily available. I want to make sure that we're limited by the overhead of launching cunumeric tasks
Rory Mitchell
31 minutes ago
I can probably pull it out for you with some effort.
The text was updated successfully, but these errors were encountered: