Add CreateTensor with byte offset support for WebGPU sub-buffer views by qjia7 · Pull Request #27004 · microsoft/onnxruntime

qjia7 · 2026-01-14T06:23:24Z

Problem

The existing CreateTensorWithDataAsOrtValue API supports sub-buffer views and zero-copy transfers by accepting a pre-advanced pointer — the caller performs (char*)p_data + byte_offset before passing it in. This works for CPU and CUDA, where p_data is a real, addressable memory pointer.

However, for backends such as WebGPU, p_data is an opaque, non-addressable buffer handle (WGPUBuffer). Applying pointer arithmetic to such a handle produces a corrupted value that is no longer a valid buffer handle, making it impossible to express sub-buffer tensor views or perform zero-copy transfers over regions of a larger GPU allocation.

Solution

This PR introduces a new C API entry point:

OrtApi::CreateTensorWithDataAsOrtValueWithByteOffset(
    const OrtMemoryInfo* info,
    void* p_data,
    size_t p_data_byte_count,
    size_t p_data_byte_offset,
    const int64_t* shape, size_t shape_len,
    ONNXTensorElementDataType type,
    OrtValue** out);

Instead of advancing the pointer before creating the tensor, the byte offset is stored directly in the tensor's internal byte_offset_ field. For addressable backends (CPU, CUDA), DataRaw() / MutableData() already apply the offset via (char*)p_data_ + byte_offset_, so behavior is identical to the old approach. For non-addressable backends (WebGPU), the raw handle remains intact in p_data_, and the WebGPU DataTransfer and kernel dispatch code recover the correct buffer region at runtime using the stored ByteOffset().

Copilot

Pull request overview

This pull request extends the ONNX Runtime C API and internal data transfer infrastructure to support fine-grained tensor copying with source/destination offsets and custom copy sizes. The changes add a new CopyTensorsEx API function, extend the OrtDataTransferImpl interface to accept offset parameters, and update CPU and WebGPU data transfer implementations to handle offset-based copying.

Changes:

Added new CopyTensorsEx C API function with offset and size parameters for partial tensor copies
Extended OrtDataTransferImpl::CopyTensors to accept source_offsets, destination_offsets, and sizes arrays
Updated WebGPU BufferManager and DataTransfer to support offset-based memory operations

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
include/onnxruntime/core/session/onnxruntime_c_api.h	Adds CopyTensorsEx API declaration with offset/size parameters
include/onnxruntime/core/session/onnxruntime_ep_c_api.h	Updates OrtDataTransferImpl::CopyTensors signature with offset parameters
onnxruntime/core/session/ort_apis.h	Declares CopyTensorsEx implementation function
onnxruntime/core/session/onnxruntime_c_api.cc	Implements CopyTensors and CopyTensorsEx using shared helper function
onnxruntime/core/framework/data_transfer.h	Adds offset/size fields to SrcDstPair and new CopyTensor overload
onnxruntime/core/framework/data_transfer.cc	Implements offset-aware CopyTensor for CPU and base interface
onnxruntime/core/framework/plugin_data_transfer.cc	Updates to call CopyTensors with offset parameters
onnxruntime/core/providers/webgpu/data_transfer.h	Declares offset-aware CopyTensor overload
onnxruntime/core/providers/webgpu/data_transfer.cc	Implements offset-based tensor copying for WebGPU
onnxruntime/core/providers/webgpu/buffer_manager.h	Updates Upload/MemCpy/Download signatures with offset parameters
onnxruntime/core/providers/webgpu/buffer_manager.cc	Implements offset support in buffer operations
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc	Updates WebGpuDataTransferImpl to extract and pass offset parameters
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.h	Updates example EP data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.cc	Implements offset handling in example EP
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.h	Updates example EP kernel registry data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.cc	Implements offset handling in example EP kernel registry
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/kernels/utils.h	Updates CopyTensor call with null offset parameters
onnxruntime/test/shared_lib/test_data_copy.cc	Adds comment about backward compatibility test

Comments suppressed due to low confidence (2)

onnxruntime/core/providers/webgpu/data_transfer.cc:46

Missing bounds validation for offset and size parameters. The function should validate that src_offset + bytes does not exceed src.SizeInBytes() and dst_offset + bytes does not exceed dst.SizeInBytes() before performing the copy operations. This is especially important for CPU to GPU and GPU to CPU transfers where buffer overflow could occur.

common::Status DataTransfer::CopyTensor(const Tensor& src, Tensor& dst, size_t src_offset, size_t dst_offset, size_t size) const {
  size_t bytes = size > 0 ? size : src.SizeInBytes();
  if (bytes > 0) {
    void const* src_data = src.DataRaw();
    void* dst_data = dst.MutableDataRaw();

    auto& src_device = src.Location().device;
    auto& dst_device = dst.Location().device;

    if (dst_device.Type() == OrtDevice::GPU) {
      if (src_device.Type() == OrtDevice::GPU) {
        // copy from GPU to GPU
        buffer_manager_.MemCpy(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      } else {
        // copy from CPU to GPU
        buffer_manager_.Upload(const_cast<void*>(src_data),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      }
    } else /* if (src_device.Type() == OrtDevice::GPU) */ {
      // copy from GPU to CPU
      buffer_manager_.Download(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               dst_data, bytes, src_offset, dst_offset);
    }
  }

  return Status::OK();

include/onnxruntime/core/session/onnxruntime_ep_c_api.h:142

The documentation for CopyTensors should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether implementations are expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for EP implementers to understand their responsibilities.

  /** \brief Copy tensors from src_tensors to dst_tensors using the provided streams.
   *
   * The implementation can use the provided streams to perform asynchronous copies if supported.
   * If a stream is not available, the copy is performed synchronously.
   *
   * \param[in] this_ptr Pointer to the OrtDataTransferImpl instance.
   * \param[in] src_tensors Array of source OrtValue pointers to copy from.
   * \param[in] dst_tensors Array of destination OrtValue pointers to copy to.
   * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
   * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
   * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
   * \param[in] streams Array of OrtSyncStream pointers for the copy operations, if the execution provider is stream
   *                    aware. nullptr if it is not.
   * \param[in] num_tensors Number of tensors to copy.
   *
   * \snippet{doc} snippets.dox OrtStatus Return Value
   *
   * \since Version 1.23.
   */

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-14T08:18:57Z

onnxruntime/core/providers/webgpu/buffer_manager.cc

+void BufferManager::MemCpy(WGPUBuffer src, WGPUBuffer dst, size_t size, size_t src_offset, size_t dst_offset) const {
  ORT_ENFORCE(src != dst, "Source and destination buffers must be different.");
  EnforceBufferUnmapped(context_, src);
  EnforceBufferUnmapped(context_, dst);

  auto buffer_size = NormalizeBufferSize(size);
  auto src_size = static_cast<size_t>(wgpuBufferGetSize(src));
  auto dst_size = static_cast<size_t>(wgpuBufferGetSize(dst));
-  ORT_ENFORCE(buffer_size <= src_size && buffer_size <= dst_size,
+  ORT_ENFORCE(src_offset + buffer_size <= src_size && dst_offset + buffer_size <= dst_size,
              "Source and destination buffers must have enough space for the copy operation. src_size=",
-              src_size, ", dst_size=", dst_size, ", copy_size=", buffer_size, ".");
+              src_size, ", dst_size=", dst_size, ", src_offset=", src_offset, ", dst_offset=", dst_offset, ", copy_size=", buffer_size, ".");

  auto& command_encoder = context_.GetCommandEncoder();
  context_.EndComputePass();
-  command_encoder.CopyBufferToBuffer(src, 0, dst, 0, buffer_size);
+  command_encoder.CopyBufferToBuffer(src, src_offset, dst, dst_offset, buffer_size);
 }


Potential issue with buffer size normalization when using offsets. The NormalizeBufferSize function rounds up 'size' to be aligned to 16 bytes, but when dst_offset is applied, the total required buffer space is actually dst_offset + buffer_size. The current validation at line 489 checks dst_offset + buffer_size against dst_size, which is correct. However, if the destination buffer was created with a size that was normalized independently, there could be cases where the aligned buffer_size causes the operation to exceed the actual buffer bounds when combined with the offset.

Copilot · 2026-01-14T08:18:57Z

onnxruntime/test/shared_lib/test_data_copy.cc

    if (!device_tensors.empty()) {
+      // Test original CopyTensors (backward compatible)
      ASSERT_CXX_ORTSTATUS_OK(ort_env->CopyTensors(cpu_tensors, device_tensors, stream));



The new CopyTensorsEx API function is not covered by any tests. The PR adds this significant new functionality but there are no test cases that exercise copying tensors with offsets or custom sizes to verify the implementation works correctly.

Copilot · 2026-01-14T08:18:58Z

include/onnxruntime/core/session/onnxruntime_c_api.h

+  /** \brief Copy OrtValue instances containing Tensors between devices with offset and size control.
+   *
+   * Extended version of CopyTensors that supports copying with source/destination offsets and custom sizes.
+   * All offsets and sizes are in bytes.
+   *
+   * \param[in] env The OrtEnv instance to use.
+   * \param[in] src_tensors Array of OrtValue instances containing the source tensors to copy.
+   * \param[in] dst_tensors Array of OrtValue instances to copy the source tensors to.
+   * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
+   * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
+   * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
+   * \param[in] stream Optional OrtSyncStream that can be used to perform the copy asynchronously. May be nullptr.
+   * \param[in] num_tensors The number of tensors to copy.
+   *
+   * \snippet{doc} snippets.dox OrtStatus Return Value
+   *
+   * \since Version 1.24
+   */


The documentation for CopyTensorsEx should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether the implementation is expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for API consumers to understand their responsibilities and avoid undefined behavior.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

fs-eire · 2026-01-14T23:04:42Z

@tianleiwu @skottmckay

fs-eire

Adding CopyTensorsEx should probably be fine, but modifying the signature of existing OrtDataTransferImpl::CopyTensors probably cause backward compatibilty issues.

skottmckay · 2026-01-14T23:09:24Z

This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.

Can you please explain 'why' this is required? Changing OrtDataTransferImpl::CopyTensors will impact every non-CPU EP.

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

tianleiwu · 2026-01-14T23:58:05Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

I agree. From API point of review, we only need an API to copy from source location to target location for specified number of bytes.

User can add a helper function for sub-tensor copy, but not needed to expose it as API. The helper function can be shared by some EPs if it is used internally.

fs-eire · 2026-01-15T01:01:10Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

skottmckay · 2026-01-15T01:20:21Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

It's up to the EP (and its data transfer implementation) to interpret what the void* for data in the tensor represents. e.g. DML EP that is struct with metadata and not just a raw pointer to memory. WebGPU EP could do something similar and have a struct with the handle and offset info. User would need to know how to create that struct but could use CreateTensorWithDataAsOrtValue and pass it in as the void* p_data.

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

qjia7 · 2026-01-15T02:10:35Z

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

Two use cases in onnxruntime-genai:

Support CopyFrom(size_t begin_dest, DeviceBuffer& source, size_t begin_source, size_t size_in_bytes) in gpu. I ever commented there https://github.com/microsoft/onnxruntime-genai/pull/1848/files#r2513245293.
Support UpdateAttentionMask in gpu. When graph capture is enabled, the attention mask is in gpu with shape [batch_size,max_sequence_length], we can use copyTensors with offset to update the content of attention_mask at the specific position. For example, just copy 1 to of attention_mask[total_sequence_length-1].

skottmckay · 2026-01-16T08:57:32Z

What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it.

FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense?

If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes.

Is another alternative having a CreateTensor variant that takes an offset to support the void* not necessarily being an addressable pointer to memory? Or a CreateSubTensor that takes an OrtValue with Tensor, axis and index and returns a Tensor for that slice (ownership would obviously stay with the original OrtValue, might need const and non-const versions).

qjia7 · 2026-01-19T03:35:15Z

@skottmckay

What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it.

You're absolutely right. From an implementation perspective, CPU/CUDA already support creating tensors with offsets implicitly (via pointer arithmetic), while handle-based EPs like WebGPU cannot. However, from a user's perspective, we should provide a unified API that works consistently across all EPs. Your suggestions of either a CreateTensor variant that takes an offset or a CreateSubTensor that takes an OrtValue would achieve this nicely.

FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense?

I agree that using axis+index would provide better type safety. If we align on the overall direction, I'm happy to work out the specific API design with you—whether that's axis/index based or offset based with proper validation.

If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes.

I've prototyped the model-based approach (in fact, I'm using it for Cast operations in https://github.com/microsoft/onnxruntime-genai/pull/1895). However, for frequent small updates like CopyFrom and UpdateAttentionMask, my benchmarks show that direct copying significantly outperforms running a dedicated ONNX model, even a single-op one.

Is another alternative having a CreateTensor variant that takes an offset to support the void* not necessarily being an addressable pointer to memory? Or a CreateSubTensor that takes an OrtValue with Tensor, axis and index and returns a Tensor for that slice (ownership would obviously stay with the original OrtValue, might need const and non-const versions).

Both approaches would work well for the WebGPU use case. From the EP's perspective, they achieve the same goal: enabling partial tensor updates for both pointer-based EPs (CPU, CUDA) and handle-based EPs (WebGPU, and potentially Vulkan/Metal).

To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:

CreateTensor with offset + existing CopyTensors
CreateSubTensor (with axis/index) + existing CopyTensors
Enhanced CopyTensors with axis/index parameters

Happy to prototype whichever approach you think fits best with ORT's API design principles.

skottmckay · 2026-01-19T08:08:04Z

To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:

CreateTensor with offset + existing CopyTensors

CreateSubTensor (with axis/index) + existing CopyTensors

Enhanced CopyTensors with axis/index parameters

One of the challenges with creating a new Tensor instance pointing to a subset of an existing tensor is the path to EP-specific logic to interpret the handle. Typically that sort of logic is in the IAllocator for the OrtDevice, but as we're not allocating in this scenario that doesn't quite fit. Doesn't quite fit to add to IDataTransfer either given we're not doing a data transfer at that point.

One option might be the new external resource importer where you can import memory and create an OrtValue from it. Added recently in #26828. That already supports a void* + offset and provides an EP specific implementation. Whilst we're technically not importing 'external' memory (e.g. D3D12 or Vulkan), it does seem to have the features required. It's also optional for an EP to implement so zero cost for an EP implementer if not needed. In terms of API changes it would possibly only require adding a value to OrtExternalMemoryHandleType. You could actually import the entire Tensor once, and call CreateTensorFromMemory for each slice as that supports an offset in OrtExternalTensorDescriptor.

Maybe a choice between that and #3. Would like to know what others on the ORT team think about the options. @adrianlizarraga @edgchen1 @yuslepukhin

qjia7 · 2026-01-22T07:24:48Z

Ping @adrianlizarraga @edgchen1 @yuslepukhin. Any suggestions on Scott's above comment?

- Add CreateTensorWithDataAsOrtValueWithByteOffset to OrtApi (v26, offset 418) + WebGPU branch: stores raw WGPUBuffer handle and offset via CreateTensorImplWithByteOffset + CPU/CUDA branch: advances pointer directly before delegating to CreateTensorImpl - Add typed (element offset) and void* (byte offset) C++ wrapper overloads in onnxruntime_cxx_api.h / onnxruntime_cxx_inline.h - Update WebGPU DataTransfer (data_transfer.cc) to recover base WGPUBuffer from DataRaw() - ByteOffset() for all three copy directions (GPU->GPU, CPU->GPU, GPU->CPU) - Update LaunchComputePipeline to accept bind_buffers_byte_offsets for per-binding WGPUBindGroupEntry offsets in webgpu_context.h / webgpu_context.cc - Remove indirect_buffer_offset from CapturedCommandInfo: indirect buffers are always allocated internally by the WebGPU EP and never have a non-zero ByteOffset() - Add tests in test_data_copy.cc covering typed/void* offset overloads and CopyTensors

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-03-02T07:15:44Z

onnxruntime/core/providers/webgpu/data_transfer.cc

+      // dst.MutableDataRaw() for a CPU tensor is a real addressable pointer.
+      void* dst_data = dst.MutableDataRaw();
+      buffer_manager_.Download(src_buf, static_cast<uint8_t*>(dst_data) + dst_offset,
+                                bytes, actual_src_offset, 0);


Suggested change

bytes, actual_src_offset, 0);

bytes, actual_src_offset, 0);

github-actions · 2026-03-02T07:15:45Z

onnxruntime/core/providers/webgpu/webgpu_context.cc

  if (uniform_buffer) {
    bind_buffers.push_back(uniform_buffer);
+    bind_buffers_byte_offsets.push_back(0);  // uniform buffer has no byte offset
    bind_buffers_segments.push_back(1);  // uniform buffer defaults to 1 segment


Suggested change

bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment

bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment

…fer views CopyTensorsEx is superseded by CreateTensorWithDataAsOrtValueWithByteOffset: callers now create an offset tensor view once and pass it to regular CopyTensors. Changes: - Remove OrtApi::CopyTensorsEx entry and vtable slot (offset 417 freed); CreateTensorWithDataAsOrtValueWithByteOffset moves to offset 417 - Remove CopyTensorsEx forward decl from ort_apis.h and both implementations (full build and minimal-build stub) from onnxruntime_c_api.cc - Remove source_offsets/destination_offsets/sizes params from OrtDataTransferImpl::CopyTensors in onnxruntime_ep_c_api.h (EP C API) - Revert plugin_data_transfer.cc to not propagate offset arrays - Revert framework IDataTransfer: remove CopyTensor/CopyTensorAsync overloads with offset params; remove source_offset/destination_offset/size from SrcDstPair; simplify CopyTensors loop back to original; remove CPUDataTransfer offset overload - Revert WebGPU DataTransfer: remove CopyTensor(+offsets) overload; fold ByteOffset recovery into single CopyTensor(src,dst) which uses tensor's own ByteOffset() to recover the base WGPUBuffer and pass correct offsets to BufferManager::MemCpy/Upload/Download (buffer_manager offsets kept) - Revert webgpu_provider_factory.cc CopyTensorsImpl to simpler signature and call CopyTensor(src,dst) directly - Revert example EP test files (ep_data_transfer.h/.cc x2, utils.h)

qjia7 · 2026-03-02T10:23:15Z

@skottmckay, I’ve added a new CreateTensor interface that supports byte offsets to replace the previous one CopyTensors with byte offset. This change should have no side effects for the CPU and CUDA EPs. At the same time, it enables the WebGPU EP to support sub-buffer views.
For WebGPU specifically, there is a small overhead when retrieving the raw WGPUBuffer. Previously, we could directly obtain the raw WGPUBuffer via tensor.MutableDataRaw() due to byte_offset_ is always 0 although it returns static_cast<char*>(p_data_) + byte_offset_, now we need to adjust the logic slightly. The new approach is:

reinterpret_cast<WGPUBuffer>(static_cast<char*>(tensor.MutableDataRaw()) - tensor.ByteOffset());

This allows us to recover the original WGPUBuffer correctly.
Are you okay with using this new interface CreateTensor with offset to move forward?

yuslepukhin · 2026-03-02T19:17:39Z

To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:

CreateTensor with offset + existing CopyTensors

CreateSubTensor (with axis/index) + existing CopyTensors

Enhanced CopyTensors with axis/index parameters

...
Maybe a choice between that and #3. Would like to know what others on the ORT team think about the options. @adrianlizarraga @edgchen1 @yuslepukhin

My preferred approach is to build this work on top of the external resource importer and extend the CopyTensors() implementation to support these tensor types, without requiring any changes to the existing API signature.

skottmckay · 2026-03-02T23:06:59Z

My concern would be that creates a hidden requirement that an EP with an opaque handle in Tensor::p_data_ needs to always be checking for an offset before accessing data in a Tensor. All the Tensor data access methods automatically apply the offset, which silently invalidates the pointer returned until the EP undoes that.

e.g. would this WebGPU kernel break if it received a tensor instance with an offset?

onnxruntime/onnxruntime/contrib_ops/webgpu/quantization/gather_block_quantized.cc

Line 160 in 1acddf2

const_cast<void*>(x->DataRaw()),

That doesn't feel overly safe. If an offset can't be directly applied to the data pointer because it's opaque, would it be better to always store a tuple of <WGPUBuffer, offset> in the data pointer so the Tensor class is not aware of that?

fs-eire · 2026-03-03T06:07:28Z

My concern would be that creates a hidden requirement that an EP with an opaque handle in Tensor::p_data_ needs to always be checking for an offset before accessing data in a Tensor. All the Tensor data access methods automatically apply the offset, which silently invalidates the pointer returned until the EP undoes that.

e.g. would this WebGPU kernel break if it received a tensor instance with an offset?

onnxruntime/onnxruntime/contrib_ops/webgpu/quantization/gather_block_quantized.cc

Line 160 in 1acddf2

const_cast<void*>(x->DataRaw()),

That doesn't feel overly safe. If an offset can't be directly applied to the data pointer because it's opaque, would it be better to always store a tuple of <WGPUBuffer, offset> in the data pointer so the Tensor class is not aware of that?

Using a std::tuple<WGPUBuffer, size_t> * in Tensor::p_data_ sounds not a good idea:

there need extra allocation and memory management for the tuple
unless Tensor::byte_offset_ is removed, there is no point to introduce the complexity. Both value can be retrieved from Tensor using existing methods.

Can we add a new method like Tensor::DataRawWithoutOffset() so that handle based EPs can use? Anyway WebGPU EP need to make changes to support offset.

EDIT: another proposal is to modify the definition of DataRaw() and add a new method DataRawPointer() or something similar, which may be less confusing ("DataRaw" sounds like to get the original data instead of implicitly adding an offset) but that requires code changes in the code base (which is easy to do because that basically involves changing everywhere)

EDIT2: seems that currently only training code uses offset so either way is unlikely to break existing inference usage.

skottmckay · 2026-03-03T07:39:46Z

EDIT2: seems that currently only training code uses offset so either way is unlikely to break existing inference usage.

It's more future usage I'm concerned about. You're adding a new function to the public API to create a Tensor that has an offset. Anyone can call that to create an OrtValue that is used during inferencing. IIUC, any EP that has an opaque value in Tensor::p_data_ now needs to ensure that it does not use the return value of any of the Tensor data access methods directly if there's an offset. A Tensor instance that returns invalid data if byte_offset_ is set feels pretty fragile. Is that not something we should avoid?

qjia7 added 2 commits January 13, 2026 16:52

[webgpu] Support CopyTensors with offsets

caedd86

Merge branch 'main' into copyTensors

f39d7cd

qjia7 requested a review from Copilot January 14, 2026 08:13

Copilot started reviewing on behalf of qjia7 January 14, 2026 08:14 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

fs-eire reviewed Jan 14, 2026

View reviewed changes

qjia7 added 6 commits February 3, 2026 15:48

Merge branch 'main' into copyTensors

346f442

fix build errors

fc68cee

fix the missing parts in copyTensors

bbcea98

Merge branch 'main' into copyTensors

2a206c3

Merge branch 'main' into copyTensors

f1d2232

github-actions bot reviewed Mar 2, 2026

View reviewed changes

qjia7 changed the title ~~[webgpu] Support CopyTensors with offsets~~ Add CreateTensor with byte offset support for WebGPU sub-buffer views Mar 2, 2026

	bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment
	bind_buffers_segments.push_back(1); // uniform buffer defaults to 1 segment

Conversation

qjia7 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

fs-eire commented Jan 14, 2026

Uh oh!

fs-eire left a comment

Choose a reason for hiding this comment

Uh oh!

skottmckay commented Jan 14, 2026

Uh oh!

tianleiwu commented Jan 14, 2026

Uh oh!

fs-eire commented Jan 15, 2026

Uh oh!

skottmckay commented Jan 15, 2026

Uh oh!

qjia7 commented Jan 15, 2026

Uh oh!

skottmckay commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented Jan 19, 2026

Uh oh!

skottmckay commented Jan 19, 2026

Uh oh!

qjia7 commented Jan 22, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

qjia7 commented Mar 2, 2026

Uh oh!

yuslepukhin commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skottmckay commented Mar 2, 2026

Uh oh!

fs-eire commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skottmckay commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

qjia7 commented Jan 14, 2026 •

edited

Loading

skottmckay commented Jan 16, 2026 •

edited

Loading

yuslepukhin commented Mar 2, 2026 •

edited

Loading

fs-eire commented Mar 3, 2026 •

edited

Loading