Add CreateTensor with byte offset support for WebGPU sub-buffer views#27004
Add CreateTensor with byte offset support for WebGPU sub-buffer views#27004
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request extends the ONNX Runtime C API and internal data transfer infrastructure to support fine-grained tensor copying with source/destination offsets and custom copy sizes. The changes add a new CopyTensorsEx API function, extend the OrtDataTransferImpl interface to accept offset parameters, and update CPU and WebGPU data transfer implementations to handle offset-based copying.
Changes:
- Added new
CopyTensorsExC API function with offset and size parameters for partial tensor copies - Extended
OrtDataTransferImpl::CopyTensorsto accept source_offsets, destination_offsets, and sizes arrays - Updated WebGPU BufferManager and DataTransfer to support offset-based memory operations
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| include/onnxruntime/core/session/onnxruntime_c_api.h | Adds CopyTensorsEx API declaration with offset/size parameters |
| include/onnxruntime/core/session/onnxruntime_ep_c_api.h | Updates OrtDataTransferImpl::CopyTensors signature with offset parameters |
| onnxruntime/core/session/ort_apis.h | Declares CopyTensorsEx implementation function |
| onnxruntime/core/session/onnxruntime_c_api.cc | Implements CopyTensors and CopyTensorsEx using shared helper function |
| onnxruntime/core/framework/data_transfer.h | Adds offset/size fields to SrcDstPair and new CopyTensor overload |
| onnxruntime/core/framework/data_transfer.cc | Implements offset-aware CopyTensor for CPU and base interface |
| onnxruntime/core/framework/plugin_data_transfer.cc | Updates to call CopyTensors with offset parameters |
| onnxruntime/core/providers/webgpu/data_transfer.h | Declares offset-aware CopyTensor overload |
| onnxruntime/core/providers/webgpu/data_transfer.cc | Implements offset-based tensor copying for WebGPU |
| onnxruntime/core/providers/webgpu/buffer_manager.h | Updates Upload/MemCpy/Download signatures with offset parameters |
| onnxruntime/core/providers/webgpu/buffer_manager.cc | Implements offset support in buffer operations |
| onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc | Updates WebGpuDataTransferImpl to extract and pass offset parameters |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.h | Updates example EP data transfer signature |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.cc | Implements offset handling in example EP |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.h | Updates example EP kernel registry data transfer signature |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.cc | Implements offset handling in example EP kernel registry |
| onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/kernels/utils.h | Updates CopyTensor call with null offset parameters |
| onnxruntime/test/shared_lib/test_data_copy.cc | Adds comment about backward compatibility test |
Comments suppressed due to low confidence (2)
onnxruntime/core/providers/webgpu/data_transfer.cc:46
- Missing bounds validation for offset and size parameters. The function should validate that src_offset + bytes does not exceed src.SizeInBytes() and dst_offset + bytes does not exceed dst.SizeInBytes() before performing the copy operations. This is especially important for CPU to GPU and GPU to CPU transfers where buffer overflow could occur.
common::Status DataTransfer::CopyTensor(const Tensor& src, Tensor& dst, size_t src_offset, size_t dst_offset, size_t size) const {
size_t bytes = size > 0 ? size : src.SizeInBytes();
if (bytes > 0) {
void const* src_data = src.DataRaw();
void* dst_data = dst.MutableDataRaw();
auto& src_device = src.Location().device;
auto& dst_device = dst.Location().device;
if (dst_device.Type() == OrtDevice::GPU) {
if (src_device.Type() == OrtDevice::GPU) {
// copy from GPU to GPU
buffer_manager_.MemCpy(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
} else {
// copy from CPU to GPU
buffer_manager_.Upload(const_cast<void*>(src_data),
static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
}
} else /* if (src_device.Type() == OrtDevice::GPU) */ {
// copy from GPU to CPU
buffer_manager_.Download(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
dst_data, bytes, src_offset, dst_offset);
}
}
return Status::OK();
include/onnxruntime/core/session/onnxruntime_ep_c_api.h:142
- The documentation for CopyTensors should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether implementations are expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for EP implementers to understand their responsibilities.
/** \brief Copy tensors from src_tensors to dst_tensors using the provided streams.
*
* The implementation can use the provided streams to perform asynchronous copies if supported.
* If a stream is not available, the copy is performed synchronously.
*
* \param[in] this_ptr Pointer to the OrtDataTransferImpl instance.
* \param[in] src_tensors Array of source OrtValue pointers to copy from.
* \param[in] dst_tensors Array of destination OrtValue pointers to copy to.
* \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
* \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
* \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
* \param[in] streams Array of OrtSyncStream pointers for the copy operations, if the execution provider is stream
* aware. nullptr if it is not.
* \param[in] num_tensors Number of tensors to copy.
*
* \snippet{doc} snippets.dox OrtStatus Return Value
*
* \since Version 1.23.
*/
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void BufferManager::MemCpy(WGPUBuffer src, WGPUBuffer dst, size_t size, size_t src_offset, size_t dst_offset) const { | ||
| ORT_ENFORCE(src != dst, "Source and destination buffers must be different."); | ||
| EnforceBufferUnmapped(context_, src); | ||
| EnforceBufferUnmapped(context_, dst); | ||
|
|
||
| auto buffer_size = NormalizeBufferSize(size); | ||
| auto src_size = static_cast<size_t>(wgpuBufferGetSize(src)); | ||
| auto dst_size = static_cast<size_t>(wgpuBufferGetSize(dst)); | ||
| ORT_ENFORCE(buffer_size <= src_size && buffer_size <= dst_size, | ||
| ORT_ENFORCE(src_offset + buffer_size <= src_size && dst_offset + buffer_size <= dst_size, | ||
| "Source and destination buffers must have enough space for the copy operation. src_size=", | ||
| src_size, ", dst_size=", dst_size, ", copy_size=", buffer_size, "."); | ||
| src_size, ", dst_size=", dst_size, ", src_offset=", src_offset, ", dst_offset=", dst_offset, ", copy_size=", buffer_size, "."); | ||
|
|
||
| auto& command_encoder = context_.GetCommandEncoder(); | ||
| context_.EndComputePass(); | ||
| command_encoder.CopyBufferToBuffer(src, 0, dst, 0, buffer_size); | ||
| command_encoder.CopyBufferToBuffer(src, src_offset, dst, dst_offset, buffer_size); | ||
| } |
There was a problem hiding this comment.
Potential issue with buffer size normalization when using offsets. The NormalizeBufferSize function rounds up 'size' to be aligned to 16 bytes, but when dst_offset is applied, the total required buffer space is actually dst_offset + buffer_size. The current validation at line 489 checks dst_offset + buffer_size against dst_size, which is correct. However, if the destination buffer was created with a size that was normalized independently, there could be cases where the aligned buffer_size causes the operation to exceed the actual buffer bounds when combined with the offset.
| if (!device_tensors.empty()) { | ||
| // Test original CopyTensors (backward compatible) | ||
| ASSERT_CXX_ORTSTATUS_OK(ort_env->CopyTensors(cpu_tensors, device_tensors, stream)); | ||
|
|
There was a problem hiding this comment.
The new CopyTensorsEx API function is not covered by any tests. The PR adds this significant new functionality but there are no test cases that exercise copying tensors with offsets or custom sizes to verify the implementation works correctly.
| /** \brief Copy OrtValue instances containing Tensors between devices with offset and size control. | ||
| * | ||
| * Extended version of CopyTensors that supports copying with source/destination offsets and custom sizes. | ||
| * All offsets and sizes are in bytes. | ||
| * | ||
| * \param[in] env The OrtEnv instance to use. | ||
| * \param[in] src_tensors Array of OrtValue instances containing the source tensors to copy. | ||
| * \param[in] dst_tensors Array of OrtValue instances to copy the source tensors to. | ||
| * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros. | ||
| * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros. | ||
| * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors. | ||
| * \param[in] stream Optional OrtSyncStream that can be used to perform the copy asynchronously. May be nullptr. | ||
| * \param[in] num_tensors The number of tensors to copy. | ||
| * | ||
| * \snippet{doc} snippets.dox OrtStatus Return Value | ||
| * | ||
| * \since Version 1.24 | ||
| */ |
There was a problem hiding this comment.
The documentation for CopyTensorsEx should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether the implementation is expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for API consumers to understand their responsibilities and avoid undefined behavior.
fs-eire
left a comment
There was a problem hiding this comment.
Adding CopyTensorsEx should probably be fine, but modifying the signature of existing OrtDataTransferImpl::CopyTensors probably cause backward compatibilty issues.
Can you please explain 'why' this is required? Changing A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is? |
I agree. From API point of review, we only need an API to copy from source location to target location for specified number of bytes. User can add a helper function for sub-tensor copy, but not needed to expose it as API. The helper function can be shared by some EPs if it is used internally. |
For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu. |
It's up to the EP (and its data transfer implementation) to interpret what the Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue? |
Two use cases in onnxruntime-genai:
|
|
What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it. FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense? If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes. Is another alternative having a CreateTensor variant that takes an offset to support the |
You're absolutely right. From an implementation perspective, CPU/CUDA already support creating tensors with offsets implicitly (via pointer arithmetic), while handle-based EPs like WebGPU cannot. However, from a user's perspective, we should provide a unified API that works consistently across all EPs. Your suggestions of either a CreateTensor variant that takes an offset or a CreateSubTensor that takes an OrtValue would achieve this nicely.
I agree that using axis+index would provide better type safety. If we align on the overall direction, I'm happy to work out the specific API design with you—whether that's axis/index based or offset based with proper validation.
I've prototyped the model-based approach (in fact, I'm using it for Cast operations in https://github.com/microsoft/onnxruntime-genai/pull/1895). However, for frequent small updates like CopyFrom and UpdateAttentionMask, my benchmarks show that direct copying significantly outperforms running a dedicated ONNX model, even a single-op one.
Both approaches would work well for the WebGPU use case. From the EP's perspective, they achieve the same goal: enabling partial tensor updates for both pointer-based EPs (CPU, CUDA) and handle-based EPs (WebGPU, and potentially Vulkan/Metal). To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:
Happy to prototype whichever approach you think fits best with ORT's API design principles. |
One of the challenges with creating a new Tensor instance pointing to a subset of an existing tensor is the path to EP-specific logic to interpret the handle. Typically that sort of logic is in the IAllocator for the OrtDevice, but as we're not allocating in this scenario that doesn't quite fit. Doesn't quite fit to add to IDataTransfer either given we're not doing a data transfer at that point. One option might be the new external resource importer where you can import memory and create an OrtValue from it. Added recently in #26828. That already supports a Maybe a choice between that and #3. Would like to know what others on the ORT team think about the options. @adrianlizarraga @edgchen1 @yuslepukhin |
|
Ping @adrianlizarraga @edgchen1 @yuslepukhin. Any suggestions on Scott's above comment? |
- Add CreateTensorWithDataAsOrtValueWithByteOffset to OrtApi (v26, offset 418) + WebGPU branch: stores raw WGPUBuffer handle and offset via CreateTensorImplWithByteOffset + CPU/CUDA branch: advances pointer directly before delegating to CreateTensorImpl - Add typed (element offset) and void* (byte offset) C++ wrapper overloads in onnxruntime_cxx_api.h / onnxruntime_cxx_inline.h - Update WebGPU DataTransfer (data_transfer.cc) to recover base WGPUBuffer from DataRaw() - ByteOffset() for all three copy directions (GPU->GPU, CPU->GPU, GPU->CPU) - Update LaunchComputePipeline to accept bind_buffers_byte_offsets for per-binding WGPUBindGroupEntry offsets in webgpu_context.h / webgpu_context.cc - Remove indirect_buffer_offset from CapturedCommandInfo: indirect buffers are always allocated internally by the WebGPU EP and never have a non-zero ByteOffset() - Add tests in test_data_copy.cc covering typed/void* offset overloads and CopyTensors
| // dst.MutableDataRaw() for a CPU tensor is a real addressable pointer. | ||
| void* dst_data = dst.MutableDataRaw(); | ||
| buffer_manager_.Download(src_buf, static_cast<uint8_t*>(dst_data) + dst_offset, | ||
| bytes, actual_src_offset, 0); |
There was a problem hiding this comment.
| bytes, actual_src_offset, 0); | |
| bytes, actual_src_offset, 0); |
| if (uniform_buffer) { | ||
| bind_buffers.push_back(uniform_buffer); | ||
| bind_buffers_byte_offsets.push_back(0); // uniform buffer has no byte offset | ||
| bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment |
There was a problem hiding this comment.
| bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment | |
| bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment |
…fer views CopyTensorsEx is superseded by CreateTensorWithDataAsOrtValueWithByteOffset: callers now create an offset tensor view once and pass it to regular CopyTensors. Changes: - Remove OrtApi::CopyTensorsEx entry and vtable slot (offset 417 freed); CreateTensorWithDataAsOrtValueWithByteOffset moves to offset 417 - Remove CopyTensorsEx forward decl from ort_apis.h and both implementations (full build and minimal-build stub) from onnxruntime_c_api.cc - Remove source_offsets/destination_offsets/sizes params from OrtDataTransferImpl::CopyTensors in onnxruntime_ep_c_api.h (EP C API) - Revert plugin_data_transfer.cc to not propagate offset arrays - Revert framework IDataTransfer: remove CopyTensor/CopyTensorAsync overloads with offset params; remove source_offset/destination_offset/size from SrcDstPair; simplify CopyTensors loop back to original; remove CPUDataTransfer offset overload - Revert WebGPU DataTransfer: remove CopyTensor(+offsets) overload; fold ByteOffset recovery into single CopyTensor(src,dst) which uses tensor's own ByteOffset() to recover the base WGPUBuffer and pass correct offsets to BufferManager::MemCpy/Upload/Download (buffer_manager offsets kept) - Revert webgpu_provider_factory.cc CopyTensorsImpl to simpler signature and call CopyTensor(src,dst) directly - Revert example EP test files (ep_data_transfer.h/.cc x2, utils.h)
|
@skottmckay, I’ve added a new CreateTensor interface that supports byte offsets to replace the previous one CopyTensors with byte offset. This change should have no side effects for the CPU and CUDA EPs. At the same time, it enables the WebGPU EP to support sub-buffer views. This allows us to recover the original WGPUBuffer correctly. |
My preferred approach is to build this work on top of the external resource importer and extend the CopyTensors() implementation to support these tensor types, without requiring any changes to the existing API signature. |
|
My concern would be that creates a hidden requirement that an EP with an opaque handle in e.g. would this WebGPU kernel break if it received a tensor instance with an offset? That doesn't feel overly safe. If an offset can't be directly applied to the data pointer because it's opaque, would it be better to always store a tuple of <WGPUBuffer, offset> in the data pointer so the Tensor class is not aware of that? |
Using a
Can we add a new method like EDIT: another proposal is to modify the definition of EDIT2: seems that currently only training code uses offset so either way is unlikely to break existing inference usage. |
It's more future usage I'm concerned about. You're adding a new function to the public API to create a Tensor that has an offset. Anyone can call that to create an OrtValue that is used during inferencing. IIUC, any EP that has an opaque value in |
Problem
The existing
CreateTensorWithDataAsOrtValueAPI supports sub-buffer views and zero-copy transfers by accepting a pre-advanced pointer — the caller performs(char*)p_data + byte_offsetbefore passing it in. This works for CPU and CUDA, wherep_datais a real, addressable memory pointer.However, for backends such as WebGPU,
p_datais an opaque, non-addressable buffer handle (WGPUBuffer). Applying pointer arithmetic to such a handle produces a corrupted value that is no longer a valid buffer handle, making it impossible to express sub-buffer tensor views or perform zero-copy transfers over regions of a larger GPU allocation.Solution
This PR introduces a new C API entry point:
Instead of advancing the pointer before creating the tensor, the byte offset is stored directly in the tensor's internal byte_offset_ field. For addressable backends (CPU, CUDA), DataRaw() / MutableData() already apply the offset via (char*)p_data_ + byte_offset_, so behavior is identical to the old approach. For non-addressable backends (WebGPU), the raw handle remains intact in p_data_, and the WebGPU DataTransfer and kernel dispatch code recover the correct buffer region at runtime using the stored ByteOffset().