Skip to content

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Jan 14, 2026

This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request extends the ONNX Runtime C API and internal data transfer infrastructure to support fine-grained tensor copying with source/destination offsets and custom copy sizes. The changes add a new CopyTensorsEx API function, extend the OrtDataTransferImpl interface to accept offset parameters, and update CPU and WebGPU data transfer implementations to handle offset-based copying.

Changes:

  • Added new CopyTensorsEx C API function with offset and size parameters for partial tensor copies
  • Extended OrtDataTransferImpl::CopyTensors to accept source_offsets, destination_offsets, and sizes arrays
  • Updated WebGPU BufferManager and DataTransfer to support offset-based memory operations

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
include/onnxruntime/core/session/onnxruntime_c_api.h Adds CopyTensorsEx API declaration with offset/size parameters
include/onnxruntime/core/session/onnxruntime_ep_c_api.h Updates OrtDataTransferImpl::CopyTensors signature with offset parameters
onnxruntime/core/session/ort_apis.h Declares CopyTensorsEx implementation function
onnxruntime/core/session/onnxruntime_c_api.cc Implements CopyTensors and CopyTensorsEx using shared helper function
onnxruntime/core/framework/data_transfer.h Adds offset/size fields to SrcDstPair and new CopyTensor overload
onnxruntime/core/framework/data_transfer.cc Implements offset-aware CopyTensor for CPU and base interface
onnxruntime/core/framework/plugin_data_transfer.cc Updates to call CopyTensors with offset parameters
onnxruntime/core/providers/webgpu/data_transfer.h Declares offset-aware CopyTensor overload
onnxruntime/core/providers/webgpu/data_transfer.cc Implements offset-based tensor copying for WebGPU
onnxruntime/core/providers/webgpu/buffer_manager.h Updates Upload/MemCpy/Download signatures with offset parameters
onnxruntime/core/providers/webgpu/buffer_manager.cc Implements offset support in buffer operations
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc Updates WebGpuDataTransferImpl to extract and pass offset parameters
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.h Updates example EP data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.cc Implements offset handling in example EP
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.h Updates example EP kernel registry data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.cc Implements offset handling in example EP kernel registry
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/kernels/utils.h Updates CopyTensor call with null offset parameters
onnxruntime/test/shared_lib/test_data_copy.cc Adds comment about backward compatibility test
Comments suppressed due to low confidence (2)

onnxruntime/core/providers/webgpu/data_transfer.cc:46

  • Missing bounds validation for offset and size parameters. The function should validate that src_offset + bytes does not exceed src.SizeInBytes() and dst_offset + bytes does not exceed dst.SizeInBytes() before performing the copy operations. This is especially important for CPU to GPU and GPU to CPU transfers where buffer overflow could occur.
common::Status DataTransfer::CopyTensor(const Tensor& src, Tensor& dst, size_t src_offset, size_t dst_offset, size_t size) const {
  size_t bytes = size > 0 ? size : src.SizeInBytes();
  if (bytes > 0) {
    void const* src_data = src.DataRaw();
    void* dst_data = dst.MutableDataRaw();

    auto& src_device = src.Location().device;
    auto& dst_device = dst.Location().device;

    if (dst_device.Type() == OrtDevice::GPU) {
      if (src_device.Type() == OrtDevice::GPU) {
        // copy from GPU to GPU
        buffer_manager_.MemCpy(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      } else {
        // copy from CPU to GPU
        buffer_manager_.Upload(const_cast<void*>(src_data),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      }
    } else /* if (src_device.Type() == OrtDevice::GPU) */ {
      // copy from GPU to CPU
      buffer_manager_.Download(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               dst_data, bytes, src_offset, dst_offset);
    }
  }

  return Status::OK();

include/onnxruntime/core/session/onnxruntime_ep_c_api.h:142

  • The documentation for CopyTensors should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether implementations are expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for EP implementers to understand their responsibilities.
  /** \brief Copy tensors from src_tensors to dst_tensors using the provided streams.
   *
   * The implementation can use the provided streams to perform asynchronous copies if supported.
   * If a stream is not available, the copy is performed synchronously.
   *
   * \param[in] this_ptr Pointer to the OrtDataTransferImpl instance.
   * \param[in] src_tensors Array of source OrtValue pointers to copy from.
   * \param[in] dst_tensors Array of destination OrtValue pointers to copy to.
   * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
   * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
   * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
   * \param[in] streams Array of OrtSyncStream pointers for the copy operations, if the execution provider is stream
   *                    aware. nullptr if it is not.
   * \param[in] num_tensors Number of tensors to copy.
   *
   * \snippet{doc} snippets.dox OrtStatus Return Value
   *
   * \since Version 1.23.
   */

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +481 to 496
void BufferManager::MemCpy(WGPUBuffer src, WGPUBuffer dst, size_t size, size_t src_offset, size_t dst_offset) const {
ORT_ENFORCE(src != dst, "Source and destination buffers must be different.");
EnforceBufferUnmapped(context_, src);
EnforceBufferUnmapped(context_, dst);

auto buffer_size = NormalizeBufferSize(size);
auto src_size = static_cast<size_t>(wgpuBufferGetSize(src));
auto dst_size = static_cast<size_t>(wgpuBufferGetSize(dst));
ORT_ENFORCE(buffer_size <= src_size && buffer_size <= dst_size,
ORT_ENFORCE(src_offset + buffer_size <= src_size && dst_offset + buffer_size <= dst_size,
"Source and destination buffers must have enough space for the copy operation. src_size=",
src_size, ", dst_size=", dst_size, ", copy_size=", buffer_size, ".");
src_size, ", dst_size=", dst_size, ", src_offset=", src_offset, ", dst_offset=", dst_offset, ", copy_size=", buffer_size, ".");

auto& command_encoder = context_.GetCommandEncoder();
context_.EndComputePass();
command_encoder.CopyBufferToBuffer(src, 0, dst, 0, buffer_size);
command_encoder.CopyBufferToBuffer(src, src_offset, dst, dst_offset, buffer_size);
}
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with buffer size normalization when using offsets. The NormalizeBufferSize function rounds up 'size' to be aligned to 16 bytes, but when dst_offset is applied, the total required buffer space is actually dst_offset + buffer_size. The current validation at line 489 checks dst_offset + buffer_size against dst_size, which is correct. However, if the destination buffer was created with a size that was normalized independently, there could be cases where the aligned buffer_size causes the operation to exceed the actual buffer bounds when combined with the offset.

Copilot uses AI. Check for mistakes.
Comment on lines 140 to 143
if (!device_tensors.empty()) {
// Test original CopyTensors (backward compatible)
ASSERT_CXX_ORTSTATUS_OK(ort_env->CopyTensors(cpu_tensors, device_tensors, stream));

Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CopyTensorsEx API function is not covered by any tests. The PR adds this significant new functionality but there are no test cases that exercise copying tensors with offsets or custom sizes to verify the implementation works correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +6788 to +6805
/** \brief Copy OrtValue instances containing Tensors between devices with offset and size control.
*
* Extended version of CopyTensors that supports copying with source/destination offsets and custom sizes.
* All offsets and sizes are in bytes.
*
* \param[in] env The OrtEnv instance to use.
* \param[in] src_tensors Array of OrtValue instances containing the source tensors to copy.
* \param[in] dst_tensors Array of OrtValue instances to copy the source tensors to.
* \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
* \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
* \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
* \param[in] stream Optional OrtSyncStream that can be used to perform the copy asynchronously. May be nullptr.
* \param[in] num_tensors The number of tensors to copy.
*
* \snippet{doc} snippets.dox OrtStatus Return Value
*
* \since Version 1.24
*/
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for CopyTensorsEx should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether the implementation is expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for API consumers to understand their responsibilities and avoid undefined behavior.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@fs-eire
Copy link
Contributor

fs-eire commented Jan 14, 2026

@tianleiwu @skottmckay

Copy link
Contributor

@fs-eire fs-eire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding CopyTensorsEx should probably be fine, but modifying the signature of existing OrtDataTransferImpl::CopyTensors probably cause backward compatibilty issues.

@skottmckay
Copy link
Contributor

This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.

Can you please explain 'why' this is required? Changing OrtDataTransferImpl::CopyTensors will impact every non-CPU EP.

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

@tianleiwu
Copy link
Contributor

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

I agree. From API point of review, we only need an API to copy from source location to target location for specified number of bytes.

User can add a helper function for sub-tensor copy, but not needed to expose it as API. The helper function can be shared by some EPs if it is used internally.

@fs-eire
Copy link
Contributor

fs-eire commented Jan 15, 2026

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

@skottmckay
Copy link
Contributor

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

It's up to the EP (and its data transfer implementation) to interpret what the void* for data in the tensor represents. e.g. DML EP that is struct with metadata and not just a raw pointer to memory. WebGPU EP could do something similar and have a struct with the handle and offset info. User would need to know how to create that struct but could use CreateTensorWithDataAsOrtValue and pass it in as the void* p_data.

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

@qjia7
Copy link
Contributor Author

qjia7 commented Jan 15, 2026

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

Two use cases in onnxruntime-genai:

  1. Support CopyFrom(size_t begin_dest, DeviceBuffer& source, size_t begin_source, size_t size_in_bytes) in gpu. I ever commented there https://github.com/microsoft/onnxruntime-genai/pull/1848/files#r2513245293.
  2. Support UpdateAttentionMask in gpu. When graph capture is enabled, the attention mask is in gpu with shape [batch_size,max_sequence_length], we can use copyTensors with offset to update the content of attention_mask at the specific position. For example, just copy 1 to of attention_mask[total_sequence_length-1].

@skottmckay
Copy link
Contributor

What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it.

FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense?

If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes.

Is another alternative having a CreateTensor variant that takes an offset to support the void* not necessarily being an addressable pointer to memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants