support GPU tensors in eager mode #1873

martinResearch · 2024-09-22T10:20:14Z

The eager mode is described in the doc as " mostly used to debug and check intermediate results are as expected",
however it seems it has a much greater potential than just this: with support for GPU tensors it could be use as an alternative to numpy and cupy, with having the advantages over using numpy+cupy:

it would allow and easier switch between cpu and gpu execution by using the same code
it could run on all GPU architectures supported by onnx, not only Nvidia GPUS.

Is this something that could be considered to be added in the roadmap? Any potential limitation?

martinResearch · 2024-09-22T10:38:10Z

maybe one way to implement this would be to have to store the data in the onnxscript Tensor class by using an instance of OrtValue from onnxruntime.capi.onnxruntime_inference_collection instead of a numpy array?

martinResearch · 2024-09-24T08:32:47Z

I experimented with using OrtValue instances instead of numpy arrays to store the data in the Tensor class. Here are the changes I made https://github.com/martinResearch/onnxscript/pull/1/files
These changes allow me to use OrtValue instances on the GPU and use the CUDA execution provider to execute the operations.
I compared the execution with numpy and cupy using a simple elementwise multiplication operation with different tensor sizes and ignoring the duration of the first onnx run as it takes much longer.
Using the GPU the execution time is fairly constant w.r.t the tensor size, but is about 200x slower than cupy.

One important bottleneck for the onnscript execution is that it creates a new onnx session each time an operator is called. To mitigate this, I added a lru_cache decorator to reuse sessions when calling the same operator multiple times.
With this modification onnxscript using GPU is about 15x slower than cupy

When profiling it appears that only 20% of the time is spend in the function "run_with_iobinding", so there might be a potential to speed things by 5x, but that would still be 3x slower than cupy.
I wonder if there is anything that could then be done on the onnxruntime side to make things faster.

justinchuby · 2024-09-24T23:00:48Z

FWIW, Running onnx ops via onnxscript may still be too expensive because the overhead is too great. For what you described, would the array apis be what you need? https://data-apis.org/array-api/latest/

martinResearch · 2024-09-28T11:19:40Z

Adding some references to related projects for anyone interested in this issue:

It seems that getting the eager mode based on onnxruntime runtime sessions at competitive speed with cupy would be hard to achieve because

InferenceSession has too much overhead when running just one kernel
Dlpack does not allow transfer of ownership, which results in data copies

One approach to reduce python code duplication when going from python to onnx would consist in using cupy and numpy through the array API standard interface (https://data-apis.org/array-api/latest/) and then use ndonnx or onnx-array-api to export the code to onnx without much code rewrite. Note that this would not allow to use some of the advanced onnx operators that are not in the array API.

martinResearch · 2024-10-02T21:58:40Z

@justinchuby do you think there is an interest in getting the changes I made in https://github.com/martinResearch/onnxscript/pull/1/files to this repository? Although it does not allow to match cupy's speed, it still improves significatively the speed on the eager mode and adds support for gpu execution, which can potentially be helpful while debugging in case a bug appears only when executed on the GPU. If so I could submit a PR or multiple PRs.

justinchuby · 2024-10-02T22:08:10Z

Thank you! I will look deeper and let you know. As a note, we would not want a tight coupling between onnxscript and onnx runtime. Onnxscript needs to work without onnx runtime.

justinchuby added the topic: discussion For discussion label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support GPU tensors in eager mode #1873

support GPU tensors in eager mode #1873

martinResearch commented Sep 22, 2024

martinResearch commented Sep 22, 2024 •

edited

Loading

martinResearch commented Sep 24, 2024 •

edited

Loading

justinchuby commented Sep 24, 2024

martinResearch commented Sep 28, 2024 •

edited

Loading

martinResearch commented Oct 2, 2024

justinchuby commented Oct 2, 2024

support GPU tensors in eager mode #1873

support GPU tensors in eager mode #1873

Comments

martinResearch commented Sep 22, 2024

martinResearch commented Sep 22, 2024 • edited Loading

martinResearch commented Sep 24, 2024 • edited Loading

justinchuby commented Sep 24, 2024

martinResearch commented Sep 28, 2024 • edited Loading

martinResearch commented Oct 2, 2024

justinchuby commented Oct 2, 2024

martinResearch commented Sep 22, 2024 •

edited

Loading

martinResearch commented Sep 24, 2024 •

edited

Loading

martinResearch commented Sep 28, 2024 •

edited

Loading