-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[webgpu] Support CopyTensors with offsets #27004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request extends the ONNX Runtime C API and internal data transfer infrastructure to support fine-grained tensor copying with source/destination offsets and custom copy sizes. The changes add a new CopyTensorsEx API function, extend the OrtDataTransferImpl interface to accept offset parameters, and update CPU and WebGPU data transfer implementations to handle offset-based copying.
Changes:
- Added new
CopyTensorsExC API function with offset and size parameters for partial tensor copies - Extended
OrtDataTransferImpl::CopyTensorsto accept source_offsets, destination_offsets, and sizes arrays - Updated WebGPU BufferManager and DataTransfer to support offset-based memory operations
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| include/onnxruntime/core/session/onnxruntime_c_api.h | Adds CopyTensorsEx API declaration with offset/size parameters |
| include/onnxruntime/core/session/onnxruntime_ep_c_api.h | Updates OrtDataTransferImpl::CopyTensors signature with offset parameters |
| onnxruntime/core/session/ort_apis.h | Declares CopyTensorsEx implementation function |
| onnxruntime/core/session/onnxruntime_c_api.cc | Implements CopyTensors and CopyTensorsEx using shared helper function |
| onnxruntime/core/framework/data_transfer.h | Adds offset/size fields to SrcDstPair and new CopyTensor overload |
| onnxruntime/core/framework/data_transfer.cc | Implements offset-aware CopyTensor for CPU and base interface |
| onnxruntime/core/framework/plugin_data_transfer.cc | Updates to call CopyTensors with offset parameters |
| onnxruntime/core/providers/webgpu/data_transfer.h | Declares offset-aware CopyTensor overload |
| onnxruntime/core/providers/webgpu/data_transfer.cc | Implements offset-based tensor copying for WebGPU |
| onnxruntime/core/providers/webgpu/buffer_manager.h | Updates Upload/MemCpy/Download signatures with offset parameters |
| onnxruntime/core/providers/webgpu/buffer_manager.cc | Implements offset support in buffer operations |
| onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc | Updates WebGpuDataTransferImpl to extract and pass offset parameters |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.h | Updates example EP data transfer signature |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.cc | Implements offset handling in example EP |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.h | Updates example EP kernel registry data transfer signature |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.cc | Implements offset handling in example EP kernel registry |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/kernels/utils.h | Updates CopyTensor call with null offset parameters |
| onnxruntime/test/shared_lib/test_data_copy.cc | Adds comment about backward compatibility test |
Comments suppressed due to low confidence (2)
onnxruntime/core/providers/webgpu/data_transfer.cc:46
- Missing bounds validation for offset and size parameters. The function should validate that src_offset + bytes does not exceed src.SizeInBytes() and dst_offset + bytes does not exceed dst.SizeInBytes() before performing the copy operations. This is especially important for CPU to GPU and GPU to CPU transfers where buffer overflow could occur.
common::Status DataTransfer::CopyTensor(const Tensor& src, Tensor& dst, size_t src_offset, size_t dst_offset, size_t size) const {
size_t bytes = size > 0 ? size : src.SizeInBytes();
if (bytes > 0) {
void const* src_data = src.DataRaw();
void* dst_data = dst.MutableDataRaw();
auto& src_device = src.Location().device;
auto& dst_device = dst.Location().device;
if (dst_device.Type() == OrtDevice::GPU) {
if (src_device.Type() == OrtDevice::GPU) {
// copy from GPU to GPU
buffer_manager_.MemCpy(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
} else {
// copy from CPU to GPU
buffer_manager_.Upload(const_cast<void*>(src_data),
static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
}
} else /* if (src_device.Type() == OrtDevice::GPU) */ {
// copy from GPU to CPU
buffer_manager_.Download(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
dst_data, bytes, src_offset, dst_offset);
}
}
return Status::OK();
include/onnxruntime/core/session/onnxruntime_ep_c_api.h:142
- The documentation for CopyTensors should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether implementations are expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for EP implementers to understand their responsibilities.
/** \brief Copy tensors from src_tensors to dst_tensors using the provided streams.
*
* The implementation can use the provided streams to perform asynchronous copies if supported.
* If a stream is not available, the copy is performed synchronously.
*
* \param[in] this_ptr Pointer to the OrtDataTransferImpl instance.
* \param[in] src_tensors Array of source OrtValue pointers to copy from.
* \param[in] dst_tensors Array of destination OrtValue pointers to copy to.
* \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
* \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
* \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
* \param[in] streams Array of OrtSyncStream pointers for the copy operations, if the execution provider is stream
* aware. nullptr if it is not.
* \param[in] num_tensors Number of tensors to copy.
*
* \snippet{doc} snippets.dox OrtStatus Return Value
*
* \since Version 1.23.
*/
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void BufferManager::MemCpy(WGPUBuffer src, WGPUBuffer dst, size_t size, size_t src_offset, size_t dst_offset) const { | ||
| ORT_ENFORCE(src != dst, "Source and destination buffers must be different."); | ||
| EnforceBufferUnmapped(context_, src); | ||
| EnforceBufferUnmapped(context_, dst); | ||
|
|
||
| auto buffer_size = NormalizeBufferSize(size); | ||
| auto src_size = static_cast<size_t>(wgpuBufferGetSize(src)); | ||
| auto dst_size = static_cast<size_t>(wgpuBufferGetSize(dst)); | ||
| ORT_ENFORCE(buffer_size <= src_size && buffer_size <= dst_size, | ||
| ORT_ENFORCE(src_offset + buffer_size <= src_size && dst_offset + buffer_size <= dst_size, | ||
| "Source and destination buffers must have enough space for the copy operation. src_size=", | ||
| src_size, ", dst_size=", dst_size, ", copy_size=", buffer_size, "."); | ||
| src_size, ", dst_size=", dst_size, ", src_offset=", src_offset, ", dst_offset=", dst_offset, ", copy_size=", buffer_size, "."); | ||
|
|
||
| auto& command_encoder = context_.GetCommandEncoder(); | ||
| context_.EndComputePass(); | ||
| command_encoder.CopyBufferToBuffer(src, 0, dst, 0, buffer_size); | ||
| command_encoder.CopyBufferToBuffer(src, src_offset, dst, dst_offset, buffer_size); | ||
| } |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with buffer size normalization when using offsets. The NormalizeBufferSize function rounds up 'size' to be aligned to 16 bytes, but when dst_offset is applied, the total required buffer space is actually dst_offset + buffer_size. The current validation at line 489 checks dst_offset + buffer_size against dst_size, which is correct. However, if the destination buffer was created with a size that was normalized independently, there could be cases where the aligned buffer_size causes the operation to exceed the actual buffer bounds when combined with the offset.
| if (!device_tensors.empty()) { | ||
| // Test original CopyTensors (backward compatible) | ||
| ASSERT_CXX_ORTSTATUS_OK(ort_env->CopyTensors(cpu_tensors, device_tensors, stream)); | ||
|
|
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new CopyTensorsEx API function is not covered by any tests. The PR adds this significant new functionality but there are no test cases that exercise copying tensors with offsets or custom sizes to verify the implementation works correctly.
| /** \brief Copy OrtValue instances containing Tensors between devices with offset and size control. | ||
| * | ||
| * Extended version of CopyTensors that supports copying with source/destination offsets and custom sizes. | ||
| * All offsets and sizes are in bytes. | ||
| * | ||
| * \param[in] env The OrtEnv instance to use. | ||
| * \param[in] src_tensors Array of OrtValue instances containing the source tensors to copy. | ||
| * \param[in] dst_tensors Array of OrtValue instances to copy the source tensors to. | ||
| * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros. | ||
| * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros. | ||
| * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors. | ||
| * \param[in] stream Optional OrtSyncStream that can be used to perform the copy asynchronously. May be nullptr. | ||
| * \param[in] num_tensors The number of tensors to copy. | ||
| * | ||
| * \snippet{doc} snippets.dox OrtStatus Return Value | ||
| * | ||
| * \since Version 1.24 | ||
| */ |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for CopyTensorsEx should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether the implementation is expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for API consumers to understand their responsibilities and avoid undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
fs-eire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding CopyTensorsEx should probably be fine, but modifying the signature of existing OrtDataTransferImpl::CopyTensors probably cause backward compatibilty issues.
Can you please explain 'why' this is required? Changing A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is? |
I agree. From API point of review, we only need an API to copy from source location to target location for specified number of bytes. User can add a helper function for sub-tensor copy, but not needed to expose it as API. The helper function can be shared by some EPs if it is used internally. |
For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu. |
It's up to the EP (and its data transfer implementation) to interpret what the Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue? |
Two use cases in onnxruntime-genai:
|
|
What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it. FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense? If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes. Is another alternative having a CreateTensor variant that takes an offset to support the |
This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.