Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion to/from tch Tensors #46

Closed
guillaume-be opened this issue Apr 30, 2023 · 7 comments
Closed

Conversion to/from tch Tensors #46

guillaume-be opened this issue Apr 30, 2023 · 7 comments
Labels
p: high high priority perf

Comments

@guillaume-be
Copy link
Contributor

Hello,

Thank you again for building these bindings.

I am working on integrating ONNX support for a project I have been working on (rust-bert). I have most existing pipelines working (from classification to text generation), but I am observing a severe performance degradation compared to the Libtorch backend using tch bindings.

Most of the pipeline logic is written using tch tensors and I was hoping to be able to re-use most of this logic for ONNX models. I suspect the performance hit comes from the conversion between tch::Tensor and ort::tensor::InputTensor.

The current conversion I am using follows generally the following steps:

1. tch to ort:

1. tch::Tensor to `Vec`
2. `Vec` to `ndarray::ArrayD`
3. `ndarray::ArrayD` to `InputTensor`

The actual implementation looks like

let mut vec = vec![T::ZERO; num_elem];
tch_tensor.f_to_kind(T::KIND)?.f_copy_data(&mut vec, num_elem)?;
let shape: Vec<usize> = tch_tensor.size().iter().map(|s| *s as usize).collect();
let array = ndarray::ArrayD::from_shape_vec(ndarray::IxDyn(&shape), vec)?
let input_tensor = InputTensor::from_array(array )

2. ort to tch:

1. Extract array from `DynOrtTensor`
2. Convert array to slice
3. Create tensor from slice

The actual implementation looks like

let array = dyn_ort_tensor.try_extract::<f32>()?.view().to_owned();
let shape = array .shape().iter().map(|s| *s as i64).collect();
let slice = array .as_slice().unwrap()?;
let tensor = tensor::f_of_slice(slice)?;
tensor .f_reshape(shape)

This includes a lot of copy and memory allocations (especially given the slice intermediate representation). I was hoping to be able to convert from TchTensors to OrtTensors ideally without copy (creating these elements from the data point of the source element), or at least without having to go through the intermediate slices.

I have tried a few things on the tch side, including creating a Tensor from a ndarray skipping the slice creation, but this still copies data over and I am unsure if there would be a better way of doing so.

impl<T: Element + Copy> TryInto<ndarray::ArrayD<T>> for &Tensor {
    type Error = TchError;

    fn try_into(self) -> Result<ndarray::ArrayD<T>, Self::Error> {
        let num_elem = self.numel();
        let shape: Vec<usize> = self.size().iter().map(|s| *s as usize).collect();
        let array = unsafe {
            let mut array = ndarray::ArrayD::uninit(ndarray::IxDyn(&shape));
            at_copy_data(
                self.to_kind(T::KIND).as_mut_ptr(),
                array.as_mut_ptr() as *const c_void,
                num_elem,
                T::KIND.elt_size_in_bytes(),
            );
            array.assume_init()
        };
        Ok(array)
    }
}

I understand you may not be fully familiar with the tch project - any hints on the way forward would be appreciated.

For information, the ONNX implementation I am working on is on guillaume-be/rust-bert#346

Thank you!

@decahedron1
Copy link
Member

decahedron1 commented Apr 30, 2023

There's definitely an unacceptable number of copies that could be causing the degradation in performance.

In the conversion to ort::InputTensor:

As for conversion from ort::OrtOwnedTensor:

  • tch::Tensor::f_of_slice copies data from the slice
  • The usage of .to_owned() on the ort tensor view holder could probably be removed since tch::Tensor will copy the data anyways.

I'm not too familiar with the tch crate, but I wonder if Tensor::data_ptr and Tensor::f_of_blob could be used to reduce copies.

@guillaume-be
Copy link
Contributor Author

Thank you very much for the quick response. It seems the mechanism for libraries relying on manipulating pytorch tensors for pre/post-processing and inference on an onnxruntime model relies on IoBindings when the device is on CUDA:
https://github.com/huggingface/optimum/blob/5285732115000b78803154809cf7c05c180568f9/optimum/onnxruntime/modeling_ort.py#LL755C8-L755C8, indeed using the Tensor's data_ptr to avoid unnecessary copies.

I ran a few more experiments using either CPU or CUDA. The performance when using CPU is very similar for both a Libtorch and ONNX implementation of the models. When using CUDA the Libtorch version is more than ~10 faster than the ONNX equivalent. This aligns with the optimum linked above setting IoBinding by default for CUDA devices, but otherwise relying on numpy conversion.

It seems using IoBinding is probably critical to the performance to avoid cross-device copies, I see this is listed in the PR you linked it would be great to be able to test it using the bindings you made available.

@decahedron1 decahedron1 added perf p: high high priority labels Jun 3, 2023
@decahedron1 decahedron1 mentioned this issue Jun 3, 2023
8 tasks
@SunDoge
Copy link
Contributor

SunDoge commented Jun 16, 2023

Maybe we can use DLPack for zero-copy tensor conversion. I'm working on this PR.

@decahedron1
Copy link
Member

decahedron1 commented Jun 29, 2023

Maybe we can use DLPack for zero-copy tensor conversion. I'm working on this PR.

Unfortunately, it seems ONNX Runtime only exposes DLPack conversion through Python bindings, there is no C API for it (at least at the moment). (+ it would only be available in training builds.)

But I am working on IoBinding right now to hopefully improve performance.

@decahedron1
Copy link
Member

I released v1.15.0 with support for IoBinding. https://github.com/pykeio/ort/releases/tag/v1.15.0

@guillaume-be
Copy link
Contributor Author

Thank you!
Note that this release includes breaking changes but is marked as a minor update, I will pin the ort version while working on the update. If possible it would be great to have future breaking changes included as part of major version releases

@guillaume-be
Copy link
Contributor Author

@decahedron1 I have migrated my crate to ort-2.0, it is still unclear how to effectively use IOBindings with tensors that are already on the CUDA device. Would it be possible to provide documentation for the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p: high high priority perf
Projects
None yet
Development

No branches or pull requests

3 participants