Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support GPU tensors in eager mode #1873

Open
martinResearch opened this issue Sep 22, 2024 · 6 comments
Open

support GPU tensors in eager mode #1873

martinResearch opened this issue Sep 22, 2024 · 6 comments
Labels
topic: discussion For discussion

Comments

@martinResearch
Copy link

The eager mode is described in the doc as " mostly used to debug and check intermediate results are as expected",
however it seems it has a much greater potential than just this: with support for GPU tensors it could be use as an alternative to numpy and cupy, with having the advantages over using numpy+cupy:

  • it would allow and easier switch between cpu and gpu execution by using the same code
  • it could run on all GPU architectures supported by onnx, not only Nvidia GPUS.

Is this something that could be considered to be added in the roadmap? Any potential limitation?

@martinResearch
Copy link
Author

martinResearch commented Sep 22, 2024

maybe one way to implement this would be to have to store the data in the onnxscript Tensor class by using an instance of OrtValue from onnxruntime.capi.onnxruntime_inference_collection instead of a numpy array?

@martinResearch
Copy link
Author

martinResearch commented Sep 24, 2024

I experimented with using OrtValue instances instead of numpy arrays to store the data in the Tensor class. Here are the changes I made https://github.com/martinResearch/onnxscript/pull/1/files
These changes allow me to use OrtValue instances on the GPU and use the CUDA execution provider to execute the operations.
I compared the execution with numpy and cupy using a simple elementwise multiplication operation with different tensor sizes and ignoring the duration of the first onnx run as it takes much longer.
Using the GPU the execution time is fairly constant w.r.t the tensor size, but is about 200x slower than cupy.
Figure_no_cache
One important bottleneck for the onnscript execution is that it creates a new onnx session each time an operator is called. To mitigate this, I added a lru_cache decorator to reuse sessions when calling the same operator multiple times.
With this modification onnxscript using GPU is about 15x slower than cupy
Figure_1
When profiling it appears that only 20% of the time is spend in the function "run_with_iobinding", so there might be a potential to speed things by 5x, but that would still be 3x slower than cupy.
I wonder if there is anything that could then be done on the onnxruntime side to make things faster.

@justinchuby justinchuby added the topic: discussion For discussion label Sep 24, 2024
@justinchuby
Copy link
Collaborator

FWIW, Running onnx ops via onnxscript may still be too expensive because the overhead is too great. For what you described, would the array apis be what you need? https://data-apis.org/array-api/latest/

@martinResearch
Copy link
Author

martinResearch commented Sep 28, 2024

Adding some references to related projects for anyone interested in this issue:

It seems that getting the eager mode based on onnxruntime runtime sessions at competitive speed with cupy would be hard to achieve because

  • InferenceSession has too much overhead when running just one kernel
  • Dlpack does not allow transfer of ownership, which results in data copies

One approach to reduce python code duplication when going from python to onnx would consist in using cupy and numpy through the array API standard interface (https://data-apis.org/array-api/latest/) and then use ndonnx or onnx-array-api to export the code to onnx without much code rewrite. Note that this would not allow to use some of the advanced onnx operators that are not in the array API.

@martinResearch
Copy link
Author

@justinchuby do you think there is an interest in getting the changes I made in https://github.com/martinResearch/onnxscript/pull/1/files to this repository? Although it does not allow to match cupy's speed, it still improves significatively the speed on the eager mode and adds support for gpu execution, which can potentially be helpful while debugging in case a bug appears only when executed on the GPU. If so I could submit a PR or multiple PRs.

@justinchuby
Copy link
Collaborator

Thank you! I will look deeper and let you know. As a note, we would not want a tight coupling between onnxscript and onnx runtime. Onnxscript needs to work without onnx runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: discussion For discussion
Projects
None yet
Development

No branches or pull requests

2 participants