Conversion to/from `tch` Tensors #46

guillaume-be · 2023-04-30T10:02:37Z

Hello,

Thank you again for building these bindings.

I am working on integrating ONNX support for a project I have been working on (rust-bert). I have most existing pipelines working (from classification to text generation), but I am observing a severe performance degradation compared to the Libtorch backend using tch bindings.

Most of the pipeline logic is written using tch tensors and I was hoping to be able to re-use most of this logic for ONNX models. I suspect the performance hit comes from the conversion between tch::Tensor and ort::tensor::InputTensor.

The current conversion I am using follows generally the following steps:

1. tch to ort:

1. tch::Tensor to `Vec`
2. `Vec` to `ndarray::ArrayD`
3. `ndarray::ArrayD` to `InputTensor`

The actual implementation looks like

let mut vec = vec![T::ZERO; num_elem];
tch_tensor.f_to_kind(T::KIND)?.f_copy_data(&mut vec, num_elem)?;
let shape: Vec<usize> = tch_tensor.size().iter().map(|s| *s as usize).collect();
let array = ndarray::ArrayD::from_shape_vec(ndarray::IxDyn(&shape), vec)?
let input_tensor = InputTensor::from_array(array )

2. ort to tch:

1. Extract array from `DynOrtTensor`
2. Convert array to slice
3. Create tensor from slice

The actual implementation looks like

let array = dyn_ort_tensor.try_extract::<f32>()?.view().to_owned();
let shape = array .shape().iter().map(|s| *s as i64).collect();
let slice = array .as_slice().unwrap()?;
let tensor = tensor::f_of_slice(slice)?;
tensor .f_reshape(shape)

This includes a lot of copy and memory allocations (especially given the slice intermediate representation). I was hoping to be able to convert from TchTensors to OrtTensors ideally without copy (creating these elements from the data point of the source element), or at least without having to go through the intermediate slices.

I have tried a few things on the tch side, including creating a Tensor from a ndarray skipping the slice creation, but this still copies data over and I am unsure if there would be a better way of doing so.

impl<T: Element + Copy> TryInto<ndarray::ArrayD<T>> for &Tensor {
    type Error = TchError;

    fn try_into(self) -> Result<ndarray::ArrayD<T>, Self::Error> {
        let num_elem = self.numel();
        let shape: Vec<usize> = self.size().iter().map(|s| *s as usize).collect();
        let array = unsafe {
            let mut array = ndarray::ArrayD::uninit(ndarray::IxDyn(&shape));
            at_copy_data(
                self.to_kind(T::KIND).as_mut_ptr(),
                array.as_mut_ptr() as *const c_void,
                num_elem,
                T::KIND.elt_size_in_bytes(),
            );
            array.assume_init()
        };
        Ok(array)
    }
}

I understand you may not be fully familiar with the tch project - any hints on the way forward would be appreciated.

For information, the ONNX implementation I am working on is on guillaume-be/rust-bert#346

Thank you!

The text was updated successfully, but these errors were encountered:

decahedron1 · 2023-04-30T16:29:45Z

There's definitely an unacceptable number of copies that could be causing the degradation in performance.

In the conversion to ort::InputTensor:

ort v1.14.6 will currently always copy the input tensor (though this is being fixed in v1.15 - I/O rework #41)
Tensor::f_copy_data can actually end up doing 2 copies if the tensor is not contiguous in memory.

As for conversion from ort::OrtOwnedTensor:

tch::Tensor::f_of_slice copies data from the slice
The usage of .to_owned() on the ort tensor view holder could probably be removed since tch::Tensor will copy the data anyways.

I'm not too familiar with the tch crate, but I wonder if Tensor::data_ptr and Tensor::f_of_blob could be used to reduce copies.

guillaume-be · 2023-04-30T16:45:55Z

Thank you very much for the quick response. It seems the mechanism for libraries relying on manipulating pytorch tensors for pre/post-processing and inference on an onnxruntime model relies on IoBindings when the device is on CUDA:
https://github.com/huggingface/optimum/blob/5285732115000b78803154809cf7c05c180568f9/optimum/onnxruntime/modeling_ort.py#LL755C8-L755C8, indeed using the Tensor's data_ptr to avoid unnecessary copies.

I ran a few more experiments using either CPU or CUDA. The performance when using CPU is very similar for both a Libtorch and ONNX implementation of the models. When using CUDA the Libtorch version is more than ~10 faster than the ONNX equivalent. This aligns with the optimum linked above setting IoBinding by default for CUDA devices, but otherwise relying on numpy conversion.

It seems using IoBinding is probably critical to the performance to avoid cross-device copies, I see this is listed in the PR you linked it would be great to be able to test it using the bindings you made available.

SunDoge · 2023-06-16T07:11:30Z

Maybe we can use DLPack for zero-copy tensor conversion. I'm working on this PR.

decahedron1 · 2023-06-29T01:28:11Z

Maybe we can use DLPack for zero-copy tensor conversion. I'm working on this PR.

Unfortunately, it seems ONNX Runtime only exposes DLPack conversion through Python bindings, there is no C API for it (at least at the moment). (+ it would only be available in training builds.)

But I am working on IoBinding right now to hopefully improve performance.

decahedron1 · 2023-07-16T01:05:12Z

I released v1.15.0 with support for IoBinding. https://github.com/pykeio/ort/releases/tag/v1.15.0

guillaume-be · 2023-07-16T08:27:28Z

Thank you!
Note that this release includes breaking changes but is marked as a minor update, I will pin the ort version while working on the update. If possible it would be great to have future breaking changes included as part of major version releases

guillaume-be · 2023-12-02T14:40:41Z

@decahedron1 I have migrated my crate to ort-2.0, it is still unclear how to effectively use IOBindings with tensors that are already on the CUDA device. Would it be possible to provide documentation for the same?

guillaume-be mentioned this issue Apr 30, 2023

Reduce the number of calls through the ffi. LaurentMazare/tch-rs#694

Merged

decahedron1 added perf p: high high priority labels Jun 3, 2023

decahedron1 mentioned this issue Jun 3, 2023

v1.15 - I/O rework #41

Closed

8 tasks

decahedron1 closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion to/from `tch` Tensors #46

Conversion to/from `tch` Tensors #46

guillaume-be commented Apr 30, 2023

decahedron1 commented Apr 30, 2023 •

edited

Loading

guillaume-be commented Apr 30, 2023

SunDoge commented Jun 16, 2023

decahedron1 commented Jun 29, 2023 •

edited

Loading

decahedron1 commented Jul 16, 2023

guillaume-be commented Jul 16, 2023

guillaume-be commented Dec 2, 2023

Conversion to/from tch Tensors #46

Conversion to/from tch Tensors #46

Comments

guillaume-be commented Apr 30, 2023

decahedron1 commented Apr 30, 2023 • edited Loading

guillaume-be commented Apr 30, 2023

SunDoge commented Jun 16, 2023

decahedron1 commented Jun 29, 2023 • edited Loading

decahedron1 commented Jul 16, 2023

guillaume-be commented Jul 16, 2023

guillaume-be commented Dec 2, 2023

Conversion to/from `tch` Tensors #46

Conversion to/from `tch` Tensors #46

decahedron1 commented Apr 30, 2023 •

edited

Loading

decahedron1 commented Jun 29, 2023 •

edited

Loading