diff --git a/index.bs b/index.bs index b55bc9f5..375f22fd 100644 --- a/index.bs +++ b/index.bs @@ -401,7 +401,7 @@ As a future-proofing measure, the API design allows certain operations that can Issue: Investigate side channel attack feasibility considering the current state where CPU is shared between processes running renderers. -In order to not allow an attacker to target a specific implementation that may contain a flaw, the [[#programming-model-device-selection]] mechanism is a hint only, and the concrete device selection is left to the implementation - a user agent could for instance choose never to run a model on a device with known vulnerabilities. As a further mitigation, no device enumeration mechanism is defined. +In order to not allow an attacker to target a specific implementation that may contain a flaw, the [[#programming-model-context-device-association]] mechanism is a hint only, and the concrete device selection is left to the implementation - a user agent could for instance choose never to run a model on a device with known vulnerabilities. As a further mitigation, no device enumeration mechanism is defined. Issue: Hinting partially mitigates the concern. Investigate additional mitigations. @@ -442,9 +442,7 @@ Unlike WebGPU, this API does not intrinsically support custom shader authoring; The WebGPU API identifies machine-specific artifacts as a privacy consideration. Given the WebNN API defines means to record an ML workload onto a WebGPU-compatible {{GPUCommandBuffer}}, compute unit scheduling may under certain circumstances introduce a fingerprint. However, similarly to WebGPU, such fingerprints are identical across most or all of the devices of each vendor, mitigating the concern. Furthermore, software implementations can be used to further eliminate such artifacts. -The WebNN API defines two developer-settable preferences to help inform [[#programming-model-device-selection]] and allow the implementation to better select the most appropriate underlying execution device for the workload. [=Device type=] normatively indicates the kind of device and is either "cpu" or "gpu". If this type cannot be satisfied, an "{{OperationError}}" {{DOMException}} is thrown, thus this type can in some cases add two bits of entropy to the fingerprint. [=Power preference=] indicates preference as related to the power consumption and is considered a hint only and as such does not increase entropy of the fingerprint. - -If a future version of this specification introduces support for new a [=device type=] that can only support a subset of {{MLOperandType}}s, that may introduce a new fingerprint. +The WebNN API defines developer-settable preferences to help inform [[#programming-model-context-device-association]] and allow the implementation to better select the most appropriate underlying execution device for the workload. [=Power preference=] indicates preference as related to the power consumption and is considered a hint only and as such does not increase entropy of the fingerprint. In general, implementers of this API are expected to apply WebGPU Privacy Considerations to their implementations where applicable. @@ -491,49 +489,58 @@ that shares the same buffer as the input tensor. (In the case of reshape or sque the entire data is shared, while in the case of slice, a part of the input data is shared.) The implementation may use views, as above, for intermediate values. -Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion. +Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. +The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion. -There are multiple ways by which the graph may be compiled. The {{MLGraphBuilder}}.{{MLGraphBuilder/build()}} method compiles the graph in the background without blocking the calling thread, and returns a {{Promise}} that resolves to an {{MLGraph}}. The {{MLGraphBuilder}}.{{MLGraphBuilder/buildSync()}} method compiles the graph immediately on the calling thread, which must be a worker thread running on CPU or GPU device, and returns an {{MLGraph}}. Both compilation methods produce an {{MLGraph}} that represents a compiled graph for optimal execution. +There are multiple ways by which the graph may be compiled. The {{MLGraphBuilder}}.{{MLGraphBuilder/build()}} method compiles the graph +in the background without blocking the calling thread, and returns a {{Promise}} that resolves to an {{MLGraph}}. The +{{MLGraphBuilder}}.{{MLGraphBuilder/buildSync()}} method compiles the graph immediately on the calling thread, which must be a worker +thread and returns an {{MLGraph}}. Both compilation methods produce an {{MLGraph}} that represents a compiled graph for optimal execution. Once the {{MLGraph}} is constructed, there are multiple ways by which the graph may be executed. The {{MLContext}}.{{MLContext/computeSync()}} method represents a way the execution of the graph is carried out immediately -on the calling thread, which must also be a worker thread, either on a CPU or GPU device. The execution -produces the results of the computation from all the inputs bound to the graph. +on the calling thread, which must also be a worker thread. The execution produces the results of the computation from all the inputs bound to the graph. -The {{MLContext}}.{{MLContext/compute()}} method represents a way the execution of the graph is performed asynchronously -either on a parallel timeline in a separate worker thread for the CPU execution or on a GPU timeline in a GPU -command queue. This method returns immediately without blocking the calling thread while the actual execution is -offloaded to a different timeline. This type of execution is appropriate when the responsiveness of the calling -thread is critical to good user experience. The computation results will be placed at the bound outputs at the -time the operation is successfully completed on the offloaded timeline at which time the calling thread is -signaled. This type of execution supports both the CPU and GPU device. +The {{MLContext}}.{{MLContext/compute()}} method represents a way the execution of the graph is performed asynchronously either on a parallel +timeline in a CPU worker thread or on a GPU timeline executing a GPU command queue. This method returns immediately without blocking the calling +thread while the actual execution is offloaded to a different timeline. This type of execution is appropriate when the responsiveness of the calling +thread is critical to good user experience. The computation results will be placed at the bound outputs at the time the operation is successfully +completed on the offloaded timeline at which time the calling thread is signaled. In both the {{MLContext}}.{{MLContext/compute()}} and {{MLContext}}.{{MLContext/computeSync()}} execution methods, the caller supplies -the input values using {{MLNamedArrayBufferViews}}, binding the input {{MLOperand}}s to their values. The caller -then supplies pre-allocated buffers for output {{MLOperand}}s using {{MLNamedArrayBufferViews}}. +the input values using {{MLNamedArrayBufferViews}}, binding the input {{MLOperand}}s to their values. The caller then supplies pre-allocated +buffers for output {{MLOperand}}s using {{MLNamedArrayBufferViews}}. The {{MLCommandEncoder}} interface created by the {{MLContext}}.{{MLContext/createCommandEncoder()}} method supports a graph execution method that provides the maximum flexibility to callers that also utilize WebGPU in their application. It does this by placing the workload required to initialize and compute the results of the operations in the graph onto a {{GPUCommandBuffer}}. The callers are responsible for the eventual submission of this workload on the {{GPUQueue}} through the WebGPU queue submission mechanism. Once the submitted workload -is completely executed, the result is avaialble in the bound output buffers. - -## Device Selection ## {#programming-model-device-selection} +is completely executed, the result is available in the bound output buffers. -An {{MLContext}} interface represents a global state of neural network execution. One of the important context states is the underlying execution device that manages the resources and facilitates the compilation and the eventual execution of the neural network graph. In addition to the default method of creation with {{MLContextOptions}}, an {{MLContext}} could also be created from a specific {{GPUDevice}} that is already in use by the application, in which case the corresponding {{GPUBuffer}} resources used as graph constants, as well as the {{GPUTexture}} as graph inputs must also be created from the same device. In a multi-adapter configuration, the device used for {{MLContext}} must be created from the same adapter as the device used to allocate the resources referenced in the graph. +## Context and Device Association ## {#programming-model-context-device-association} -In a situation when a GPU context executes a graph with a constant or an input in the system memory as an {{ArrayBufferView}}, the input content is automatically uploaded from the system memory to the GPU memory, and downloaded back to the system memory of an {{ArrayBufferView}} output buffer at the end of the graph execution. This data upload and download cycles will only occur whenever the execution device requires the data to be copied out of and back into the system memory, such as in the case of the GPU. It doesn't occur when the device is a CPU device. Additionally, the result of the graph execution is in a known layout format. While the execution may be optimized for a native memory access pattern in an intermediate result within the graph, the output of the last operation of the graph must convert the content back to a known layout format at the end of the graph in order to maintain the expected behavior from the caller's perspective. +An {{MLContext}} interface represents a state of a neural network execution. An important function of this state is to manage resources used in the compilation +and execution of the neural network graph. An implementation in a user agent may implement this state in terms of CPU resources and execution when an {{MLContext}} +is created without an explicit association with a hardware device (aka. a *"default context"*). However, when an {{MLContext}} is created from a WebGPU {{GPUDevice}} +(aka. a *"GPU context"*), the implementation uses the specified GPU device as a resource domain for the subsequent compilation and execution of the graph. +Any GPU resource such as the {{GPUBuffer}} or {{GPUTexture}} created from the same {{GPUDevice}} is therefore considered a resource of native resource type that +can be used to store a graph constant, input, or output operand. -When an {{MLContext}} is created with {{MLContextOptions}}, the user agent selects and creates the underlying execution device by taking into account the application's [=power preference=] and [=device type=] specified in the {{MLPowerPreference}} and {{MLDeviceType}} options. +In a situation when a GPU context executes a graph with a constant or an input or output allocated in the system memory as in an {{ArrayBufferView}}, the content +is automatically uploaded from the system memory to the GPU memory, and downloaded back to the system memory of an {{ArrayBufferView}} output buffer at the end of +the graph execution. These automatic data upload and download cycles will only occur whenever the executing GPU context determines that the data must be copied out of or back +into the system memory as part of the execution. Additionally, the eventual result of the execution must also be in a known layout format. While the internal execution +technique may be optimized for native memory access pattern in an intermediate result within the graph, the output of the last operation of the graph must convert +the content back to a known layout format at the end of the graph in order to maintain interoperability as expected by the caller. -The following table summarizes the types of resource supported by the context created through different method of creation: +The following table summarizes the types of resource supported by the context created through different methods of creation:
| Creation method | ArrayBufferView | GPUBuffer | GPUTexture - |
|---|---|---|---|
| MLContextOptions | Yes | No | No - |
| GPUDevice | Yes | Yes | Yes + |
| Context Type | ArrayBufferView | GPUBuffer | GPUTexture + |
| Default Context | Yes | No | No + |
| GPU Context | Yes | Yes | Yes |