diff --git a/docs/_static/img/arch.png b/docs/_static/img/arch.png new file mode 100644 index 000000000..72a0c42ec Binary files /dev/null and b/docs/_static/img/arch.png differ diff --git a/docs/_static/img/qnn-trace-execute-seq.png b/docs/_static/img/qnn-trace-execute-seq.png new file mode 100644 index 000000000..784ef2bb0 Binary files /dev/null and b/docs/_static/img/qnn-trace-execute-seq.png differ diff --git a/docs/arch/arch.rst b/docs/arch/arch.rst new file mode 100644 index 000000000..9989913d2 --- /dev/null +++ b/docs/arch/arch.rst @@ -0,0 +1,346 @@ +MLLM Framework Core Architecture +=================================== + +Overview +-------- + +The MLLM framework employs a hierarchical execution model with three main components: + +* **Module**: High-level abstraction for neural network modules +* **Layer**: Abstraction for individual operations/layers +* **Dispatcher**: Execution engine for different backends + +This architecture supports both regular operation execution and intermediate representation (IR) tracing workflows, enabling flexible deployment across multiple hardware backends including CPU, QNN (also: Qualcomm AI Engine Direct/QAIRT), and custom accelerators. + +.. figure:: ../_static/img/arch.png + :width: 100% + :alt: Overview + :align: center + + Figure 1: MLLM Framework Core Architecture. + +Core Components +--------------- + +Module +~~~~~~~ + +The ``Module`` class serves as the top-level container for neural network components. Key responsibilities include: + +* **Hierarchical Organization**: Modules can contain other modules and layers, forming a tree structure +* **Parameter Management**: Loading and saving model parameters from/to files +* **Device Management**: Moving modules and their components across different devices +* **Forward Execution**: Orchestrating the execution flow through child components + +**Key Methods:** + +.. code-block:: cpp + + class Module { + std::vector forward(const std::vector& inputs, + const std::vector& args); + void to(DeviceTypes device_type); + void load(const ParameterFile::ptr_t& param_file); + + // Named module registration (similar to PyTorch's named_modules) + template + auto reg(const std::string& name, Args&&... args); + }; + +**Named Module Registration:** + +The ``reg()`` method provides functionality similar to PyTorch's ``named_modules()``, enabling hierarchical module organization with automatic name management in C++: + +.. code-block:: cpp + + class MyModel : public nn::Module { + public: + MyModel(const std::string& name) : nn::Module(name) { + // Register sub-modules with names + encoder_ = reg("encoder", config); + decoder_ = reg("decoder", config); + + // Register layers with names + linear1_ = reg("fc1", 768, 3072, false); + linear2_ = reg("fc2", 3072, 768, false); + } + + private: + EncoderModule encoder_; // Absolute name: "model.encoder" + DecoderModule decoder_; // Absolute name: "model.decoder" + nn::Linear linear1_; // Absolute name: "model.fc1" + nn::Linear linear2_; // Absolute name: "model.fc2" + }; + +**Key Features:** + +* **Automatic Name Hierarchy**: Constructs fully-qualified names (e.g., ``"model.encoder.layer0.attention"``) +* **Parameter Mapping**: Links module names to parameter files for loading/saving +* **Device Management**: Enables selective device placement by module name +* **Type Safety**: Template-based registration with compile-time type checking + +**Comparison with PyTorch:** + +.. code-block:: python + + # PyTorch + class MyModel(nn.Module): + def __init__(self): + super().__init__() + self.encoder = EncoderModule() # Automatically named "encoder" + self.decoder = DecoderModule() # Automatically named "decoder" + + # Print all named modules + for name, module in model.named_modules(): + print(f"{name}: {module}") + +.. code-block:: cpp + + // MLLM Framework + class MyModel : public nn::Module { + public: + MyModel(const std::string& name) : nn::Module(name) { + encoder_ = reg("encoder"); // Explicitly named "encoder" + decoder_ = reg("decoder"); // Explicitly named "decoder" + } + }; + + // Names are automatically constructed: "model.encoder", "model.decoder" + // Used for parameter loading: params->load("model.encoder.weight") + +The ``reg()`` method bridges the gap between Python's dynamic attribute naming and C++'s static type system, providing a clean API for building hierarchical neural networks. + +Layer Abstraction +~~~~~~~~~~~~~~~~~ + +The ``Layer`` class represents individual operations or layers within a module: + +* **Operation Encapsulation**: Wraps backend-specific operations (BaseOp) +* **Device Abstraction**: Handles operation instantiation for different backends +* **Task Creation**: Creates execution tasks for the dispatcher system + +**Key Methods:** + +.. code-block:: cpp + + class Layer { + std::vector __main(const std::vector& inputs); + Layer& to(DeviceTypes device_type); + OpTypes opType() const; + }; + +Dispatcher System +~~~~~~~~~~~~~~~~~ + +The dispatcher system provides backend-specific execution engines: + +**CPUDispatcher** + Handles CPU-based operation execution with full operation lifecycle: + + * ``reshape()``: Tensor shape computation + * ``setup()``: Operation initialization + * ``forward()``: Actual computation + +**IRTraceDispatcher** + Captures execution traces for IR generation: + + * Records operation calls and tensor flows + * Enables graph optimization and analysis + * Supports compilation workflows + +**QNNDispatcher** + Manages QNN backend execution: + + * Specialized for QNN graph execution + * Handles module-level execution for QNN graphs + * Selective operation execution (X2X, Embedding ops) + +Execution Workflows +------------------- + +Op Execution Workflow +~~~~~~~~~~~~~~~~~~~~~~ + +The standard execution path for neural network inference: + +.. code-block:: text + + Module::forward() + │ + ├─── Module::__main() + │ │ + │ ├─── Task::createExecuteModuleTask() + │ │ + │ └─── DispatcherManager::submit() + │ │ + │ └─── [CPU|QNN]Dispatcher::receive() + │ │ + │ └─── [CPU|QNN]Dispatcher::process() + │ + └─── Layer::__main() + │ + ├─── Task::createExecuteOpTask() + │ + └─── DispatcherManager::submit() + │ + └─── [CPU|QNN]Dispatcher::receive() + │ + └─── [CPU|QNN]Dispatcher::process() + │ + ├─── Op::reshape() + ├─── Op::setup() + └─── Op::forward() + +**Execution Flow Details:** + +1. **Module Entry**: ``Module::forward()`` is called with input tensors +2. **Task Creation**: Creates ``kExecuteModule`` or ``kExecuteOp`` tasks +3. **Dispatcher Selection**: Routes to appropriate backend dispatcher based on device type +4. **Backend Processing**: Dispatcher executes the operation using backend-specific logic +5. **Result Return**: Output tensors are returned through the task system + +IR Execution Workflow +~~~~~~~~~~~~~~~~~~~~~~ + +When trace mode is enabled, the framework captures an intermediate representation: + +.. code-block:: text + + Module::forward() [trace_mode=true] + │ + ├─── Module::__trace() + │ │ + │ ├─── IRContext::create() + │ ├─── IRContext::create() + │ │ + │ ├─── Module::forward() [recursive] + │ │ + │ └─── IRContext::create() + │ + └─── Layer::__main() [trace_mode=true] + │ + ├─── Task::createExecuteOpTask() + │ └─── task->custom_context_ptr = ir_context + │ + └─── IRTraceDispatcher::receive() + │ + └─── IRTraceDispatcher::process() + │ + ├─── Op::reshape() + └─── Op::trace() + +**IR Workflow Details:** + +1. **Trace Initialization**: ``Context::thisThread()->trace_mode`` enables IR capture +2. **Graph Construction**: Creates IR graph nodes (``CallGraphOp``, ``SubGraphOp``) +3. **Operation Tracing**: Each operation call is recorded in the IR graph +4. **Graph Completion**: ``ReturnOp`` finalizes the subgraph structure +5. **IR Output**: Complete computational graph is available for optimization/compilation + +For more details on IR tracing and compilation, refer to the :doc:`MLLM IR <../compile/ir>` section. + +Synchronous vs Asynchronous Execution +-------------------------------------- + +Synchronous Execution +~~~~~~~~~~~~~~~~~~~~~ + +Currently, the primary execution mode uses synchronous task processing: + +.. code-block:: cpp + + // In Dispatcher::receive() + void CPUDispatcher::receive(const Task::ptr_t& task) { + process(task); // Blocks until completion + } + +**Characteristics:** + +* **Blocking Operation**: Each task completes before returning +* **Simple Flow Control**: Sequential execution guarantees +* **Immediate Results**: Output tensors available immediately after task submission + +Asynchronous Execution (Future Enhancement) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The framework includes infrastructure for asynchronous execution: + +.. code-block:: cpp + + // In Dispatcher::asyncReceive() + TaskResult::sender_t CPUDispatcher::asyncReceive(const Task::ptr_t& task) { + auto scheduler = thread_pool_.get_scheduler(); + return stdexec::schedule(scheduler) | + stdexec::then([this, task] { process(task); }); + } + +**Design Features:** + +* **Non-blocking Submission**: Tasks return immediately with a sender/future +* **Thread Pool Integration**: Uses ``exec::static_thread_pool`` for parallel execution +* **Sender/Receiver Pattern**: Based on C++26 sender/receiver async model +* **Pipeline Capability**: Enables operation pipelining and overlapping + +**Current Status:** + +The asynchronous execution path is implemented but not fully integrated: + +* ``IRTraceDispatcher::asyncReceive()`` returns an error +* Most dispatchers have placeholder async implementations +* Synchronization points (``syncWait()``) are not fully implemented + +Task System Architecture +------------------------- + +The task system provides a unified interface for operation execution: + +**Task Types:** + +* ``kExecuteOp``: Single operation execution +* ``kExecuteModule``: Module-level execution (for QNN graphs) + +**Task Structure:** + +.. code-block:: cpp + + struct Task { + TaskTypes type; + BaseOp::ptr_t op; // Operation to execute + std::vector inputs; // Input tensors + std::vector outputs; // Output tensors + std::vector args; // Additional arguments + void* custom_context_ptr; // Backend-specific context + }; + +**Dispatcher Interface:** + +.. code-block:: cpp + + class Dispatcher { + virtual void receive(const Task::ptr_t& task) = 0; + virtual TaskResult::sender_t asyncReceive(const Task::ptr_t& task) = 0; + virtual void process(const Task::ptr_t& task) = 0; + virtual void syncWait() = 0; + }; + +Backend Integration +------------------- + +The framework supports multiple execution backends through the dispatcher pattern: + +**CPU Backend** + * Full operation support with reshape/setup/forward lifecycle + * Direct tensor computation on CPU + * Perfetto tracing integration for performance analysis + +**QNN Backend** + * Optimized execution for Qualcomm Neural Processing Units + * Graph-level execution for improved performance + * Selective operation fallback to CPU when needed + +**IR Tracing Backend** + * Captures computational graphs for analysis and optimization + * Enables ahead-of-time compilation workflows + * Supports graph transformation and optimization passes + +This architecture provides a flexible foundation for deploying neural networks across diverse hardware platforms while maintaining a consistent programming interface. \ No newline at end of file diff --git a/docs/arch/index.rst b/docs/arch/index.rst index 3cea39e0c..619bc019c 100644 --- a/docs/arch/index.rst +++ b/docs/arch/index.rst @@ -4,5 +4,7 @@ Architectures .. toctree:: :maxdepth: 2 + arch tensor support_ops + op_plugin_system diff --git a/docs/arch/op_plugin_system.rst b/docs/arch/op_plugin_system.rst new file mode 100644 index 000000000..fedf53592 --- /dev/null +++ b/docs/arch/op_plugin_system.rst @@ -0,0 +1,2 @@ +Op Plugin System +================= \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 8135b933f..ddbd668a0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -346,6 +346,11 @@ Documents cpu_backend/index +.. toctree:: + :maxdepth: 2 + + qnn_backend/index + .. toctree:: :maxdepth: 2 diff --git a/docs/qnn_backend/core_design.rst b/docs/qnn_backend/core_design.rst new file mode 100644 index 000000000..1822b7d0c --- /dev/null +++ b/docs/qnn_backend/core_design.rst @@ -0,0 +1,785 @@ +QNN Backend Design +==================== + +Overview +-------- + +The QNN (Qualcomm Neural Network) Backend provides optimized execution of neural network models on Qualcomm's AI Engine Direct (formerly SNPE/QNN SDK). This backend enables efficient deployment on Qualcomm-powered devices including smartphones, embedded systems, and edge AI platforms. + +**Key Features:** + +* **Hardware Acceleration**: Leverages Qualcomm's Hexagon DSP and HTP (Hexagon Tensor Processor) +* **Graph-Level Optimization**: Executes entire subgraphs as optimized QNN graphs +* **Mixed Precision Support**: INT8/INT16 quantization with dynamic scale propagation +* **Context Caching**: Serializes compiled graphs to binary format for fast loading +* **Custom Operations**: Extensible custom op support through QNN op packages + +.. figure:: ../_static/img/qnn-trace-execute-seq.png + :width: 90% + :alt: Overview + :align: center + + Figure 1: QNN Backend Execution Sequence. + + +Architecture Components +----------------------- + +The QNN backend architecture consists of several key components working together: + +.. code-block:: text + + ┌─────────────────────────────────────────────────────────────┐ + │ MLLM Framework │ + │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ + │ │ Module │ │ Layer │ │ Dispatcher │ │ + │ └──────┬───────┘ └────-─┬───────┘ └──────┬───────┘ │ + └─────────┼─────────────────┼─────────────────┼───────────────┘ + │ │ │ + └─────────────────┴─────────────────┘ + │ + ┌──────────────────────────────────────────────────────────────┐ + │ QNN Backend Infrastructure │ + │ │ + │ ┌────────────────────────────────────────────────────────┐ │ + │ │ QNNBackend (Core Manager) │ │ + │ │ - Runtime Management - Context Management │ │ + │ │ - Graph Registry - Tensor Management │ │ + │ └─────────┬──────────────────────────────────────────────┘ │ + │ │ │ + │ ┌─────────┴──────────┬──────────────┬─────────────────┐ │ + │ │ │ │ │ │ + │ ▼ ▼ ▼ ▼ │ + │ QNNRuntime QNNModel QNNDispatcher QNNGraphBuildPass + │ (SDK Interface) (Graph Mgmt) (Execution) (Compilation)│ + │ │ + └────────────────────────────┬─────────────────────────────────┘ + │ + ┌────────────────────────────▼──────────────────────────────────┐ + │ Qualcomm QNN SDK │ + │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ + │ │ QNN Interface│ │ QNN Context │ │ QNN Graph │ │ + │ └──────────────┘ └──────────────┘ └──────────────┘ │ + │ │ + │ ┌──────────────────────────────────────────────────────┐ │ + │ │ Hardware Backends (HTP/DSP) │ │ + │ └──────────────────────────────────────────────────────┘ │ + └───────────────────────────────────────────────────────────────┘ + +QNNBackend: Core Manager +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``QNNBackend`` class serves as the central orchestrator for QNN operations: + +**Responsibilities:** + +* **Runtime Initialization**: Manages QNN SDK initialization and device configuration +* **Context Management**: Creates and maintains QNN execution contexts +* **Graph Registry**: Maps graph names to ``QNNModel`` instances +* **Tensor Management**: Handles tensor creation, quantization, and data transfer +* **Performance Tuning**: Configures power profiles and performance settings + +**Key Methods:** + +.. code-block:: cpp + + class QNNBackend : public Backend { + // Graph lifecycle management + std::shared_ptr createQnnGraph(const std::string& graphName); + bool graphFinalize(const std::string& graphName); + void graphExecute(const std::string& graphName, + std::vector& inputs, + std::vector& outputs); + + // Tensor management + bool addTensor(const std::string& graphName, + const std::string& tensorName, + Qnn_TensorType_t type, + const Tensor& tensor, + Qnn_QuantizeParams_t quantize); + + // Component access + const QNN_INTERFACE_VER_TYPE& qnnInterface() const; + Qnn_BackendHandle_t backendHandle() const; + Qnn_ContextHandle_t context() const; + }; + +QNNRuntime: SDK Interface Layer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Manages low-level QNN SDK initialization and resource lifecycle: + +**Components:** + +* **Interface Loading**: Dynamically loads QNN library symbols +* **Backend Selection**: Initializes appropriate backend (HTP/DSP/GPU) +* **Device Management**: Configures device-specific settings +* **Logging & Profiling**: Optional debug and performance profiling + +**Initialization Flow:** + +.. code-block:: cpp + + // Create runtime with profiling + auto runtime = QNNRuntime::create( + ProfilingLevel::BASIC, // Enable profiling + QNN_LOG_LEVEL_WARN // Set log level + ); + + // Create execution context + Qnn_ContextHandle_t context; + runtime->createContext(context); + +QNNModel: Graph Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Represents a single QNN computational graph with complete lifecycle management: + +**Graph Lifecycle:** + +1. **Initialization**: Create graph with name and configuration +2. **Tensor Addition**: Register input/output/intermediate tensors +3. **Node Addition**: Add QNN operations with parameters +4. **Finalization**: Compile and optimize the graph +5. **Execution**: Run inference with input data + +**Key Operations:** + +.. code-block:: cpp + + class QNNModel { + // Initialization + ModelError_t initialize(const Qnn_ContextHandle_t& context, + const char* graphName, + bool debug); + + // Tensor management + ModelError_t addTensor(const std::string& tensorName, + Qnn_TensorType_t type, + const Tensor& tensor, + Qnn_QuantizeParams_t quantize); + + ModelError_t addStaticTensor(const std::string& tensorName, + const Tensor& tensor, + Qnn_QuantizeParams_t quantize); + + std::shared_ptr getTensorWrapper( + const std::string& tensorName); + + // Node management + ModelError_t addNode(Qnn_OpConfigVersion_t version, + const std::string& name, + const std::string& packageName, + const std::string& type, + const std::vector<...>& tensorParams, + const std::vector<...>& scalarParams, + const std::vector& inputNames, + const std::vector& outputNames); + + // Finalization and execution + ModelError_t finalizeGraph(Qnn_ProfileHandle_t profileHandle, + Qnn_SignalHandle_t signalHandle); + + bool isGraphFinalized() const; + }; + +**Tensor Wrappers:** + +The backend uses C++ RAII wrappers to manage QNN's C-style resources: + +* ``QNNTensorWrapper``: Manages tensor metadata and data buffers +* ``QNNParamTensorWrapper``: Wraps constant tensor parameters +* ``QNNParamScalarWrapper``: Wraps scalar parameters + +QNNDispatcher: Execution Engine +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Handles task execution routing between CPU and QNN: + +**Execution Strategy:** + +.. code-block:: cpp + + void QNNDispatcher::process(const Task::ptr_t& task) { + switch (task->type) { + case TaskTypes::kExecuteOp: { + // Selective execution: only X2X and Embedding on QNN + task->op->reshape(task->inputs, task->outputs); + if (task->op->getOpType() == OpTypes::kX2X || + task->op->getOpType() == OpTypes::kEmbedding) { + task->op->setup(task->inputs, task->outputs); + task->op->forward(task->inputs, task->outputs); + } + break; + } + case TaskTypes::kExecuteModule: { + // Full module execution on QNN + auto qnnBackend = getBackend(kQNN); + auto moduleName = getModuleName(task); + + // Forward pass to populate outputs + task->outputs = module->forward(task->inputs, task->args); + + // Execute the QNN graph + qnnBackend->graphExecute(moduleName, + task->inputs, + task->outputs); + break; + } + } + } + +**Execution Modes:** + +* **Op-Level**: Individual operations (X2X, Embedding) executed separately +* **Module-Level**: Entire subgraphs executed as optimized QNN graphs + +QNNGraphBuildPass: Compilation Pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Transforms MLLM IR into executable QNN graphs through pattern matching: + +**Compilation Flow:** + +1. **IR Traversal**: Iterate through ``SubGraphOp`` nodes marked for QNN +2. **Pattern Matching**: Match MLLM operations to QNN operation patterns +3. **Graph Construction**: Build QNN graph with nodes and tensors +4. **Optimization**: Apply QNN SDK optimizations +5. **Finalization**: Compile graph for target hardware + +**Pattern Registration:** + +.. code-block:: cpp + + class QNNGraphBuildPass : public Pass { + QNNGraphBuildPass() { + // Register operation patterns + regPattern(); + + // Register custom ops + patterns_.emplace( + customOpId("DequantizeAdd"), + std::make_shared() + ); + } + }; + +**Pattern Example:** + +.. code-block:: cpp + + class QNNLinearPattern : public QNNOpPattern { + bool addNode(const std::string& graphName, + const ir::linalg::LinalgIROp::ptr_t& op, + const std::vector& inputs, + const std::vector& outputs) override { + // Add input tensors + addTensor(graphName, inputs[0], QNN_TENSOR_TYPE_NATIVE); + addTensor(graphName, inputs[1], QNN_TENSOR_TYPE_STATIC); + + // Add output tensor + addTensor(graphName, outputs[0], QNN_TENSOR_TYPE_NATIVE); + + // Create QNN FullyConnected node + backend->graphAddNode( + graphName, + op->name(), + "FullyConnected", + {inputs[0]->name(), inputs[1]->name()}, + {outputs[0]->name()}, + {}, // tensor params + {} // scalar params + ); + + return true; + } + }; + +Execution Workflows +------------------- + +Compilation Workflow +~~~~~~~~~~~~~~~~~~~~ + +The QNN backend compilation workflow transforms traced IR into executable graphs: + +.. code-block:: text + + User Model (Python/C++) + │ + ├─── Model::trace() [trace_mode=true] + │ └─── Creates IR representation + │ + ▼ + IR Module (mllm::ir::ModuleOp) + │ + ├─── Contains SubGraphOp(s) marked as DeviceTypes::kQNN + │ + ▼ + QNNGraphBuildPass::run() + │ + ├─── For each QNN SubGraphOp: + │ │ + │ ├─── backend->createQnnGraph(graphName) + │ │ └─── Creates QNNModel instance + │ │ + │ ├─── Add graph input tensors + │ │ └─── qnnModel->addTensor(..., QNN_TENSOR_TYPE_APP_WRITE) + │ │ + │ ├─── Traverse IR operations + │ │ │ + │ │ ├─── Match to QNN patterns + │ │ │ └─── pattern->addNode(graphName, op, inputs, outputs) + │ │ │ + │ │ └─── Create QNN ops with: + │ │ ├─── Tensor parameters (weights, constants) + │ │ ├─── Scalar parameters (hyperparameters) + │ │ ├─── Input tensor names + │ │ └─── Output tensor names + │ │ + │ └─── backend->graphFinalize(graphName) + │ │ + │ ├─── qnnModel->finalizeGraph(...) + │ │ └─── Calls qnnInterface.graphFinalize() + │ │ └─── QNN SDK optimizes and compiles graph + │ │ + │ └─── Graph ready for execution + │ + ▼ + Compiled QNN Graphs (ready for inference) + +**Code Example:** + +.. code-block:: cpp + + // In QNNGraphBuildPass::buildQnnGraph() + void QNNGraphBuildPass::buildQnnGraph( + const ir::graph::SubGraphOp::ptr_t& sub_graph_op) { + + auto qnn_backend = getQNNBackend(); + std::string graph_name = sub_graph_op->getSymbolAttr()->str(); + + // Create QNN model + auto qnn_model = qnn_backend->createQnnGraph(graph_name); + + // Add graph inputs + for (auto& input : sub_graph_op->inputs()) { + auto input_tensor = input->cast_(); + auto quantize_param = createQuantizeParams(input_tensor->tensor_); + qnn_model->addTensor(input_tensor->name(), + QNN_TENSOR_TYPE_APP_WRITE, + input_tensor->tensor_, + quantize_param); + } + + // Process operations + for (auto& region_op : graph_region->ops()) { + if (auto linalg_op = cast(region_op)) { + auto op_types = linalg_op->getAOpTypes(); + if (patterns_.contains(op_types)) { + patterns_[op_types]->addNode( + graph_name, linalg_op, + op_inputs, op_outputs + ); + } + } + } + + // Finalize graph + qnn_backend->graphFinalize(graph_name); + } + +Runtime Execution Workflow +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Standard inference execution through the dispatcher system: + +.. code-block:: text + + Application::forward() + │ + ├─── Module::forward() [DeviceTypes::kQNN] + │ │ + │ ├─── Module::__main() + │ │ │ + │ │ ├─── Task::createExecuteModuleTask() + │ │ │ └─── task->custom_context_ptr = module + │ │ │ + │ │ └─── DispatcherManager::submit(qnn_dispatcher_id, task) + │ │ + │ ▼ + QNNDispatcher::receive(task) + │ │ + │ └─── QNNDispatcher::process(task) + │ │ + │ ├─── case kExecuteModule: + │ │ │ + │ │ ├─── Extract module name + │ │ │ + │ │ ├─── Call module->forward() to setup outputs + │ │ │ └─── Creates output tensor shapes + │ │ │ + │ │ └─── qnnBackend->graphExecute(moduleName, inputs, outputs) + │ │ │ + │ │ ├─── Lookup QNNModel by name + │ │ │ + │ │ ├─── Copy input data to QNN tensors + │ │ │ └─── Handles quantization if needed + │ │ │ + │ │ ├─── qnnInterface.graphExecute() + │ │ │ └─── QNN SDK executes on HTP/DSP + │ │ │ + │ │ └─── Copy output data from QNN tensors + │ │ └─── Handles dequantization if needed + │ │ + │ └─── case kExecuteOp: + │ └─── Execute X2X/Embedding ops individually + │ + ▼ + Output Tensors (returned to application) + + +Quantization Support +-------------------- + +The QNN backend provides comprehensive quantization support for efficient inference: + +Quantization Metadata Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Quantization scales are attached to tensors as metadata: + +.. code-block:: cpp + + // Set quantization scale + inline void setQuantScale(Tensor& tensor, float scale) { + auto scale_view = std::make_shared( + Tensor::empty({1}, kFloat32, kCPU).alloc() + ); + scale_view->ptr()[0] = scale; + tensor.attachedViews()[QNN_QUANT_SCALE_NAME] = scale_view; + } + + // Get quantization scale + inline float getQuantScale(Tensor& tensor) { + if (!tensor.attachedViews().contains(QNN_QUANT_SCALE_NAME)) { + return 1.0f; // Default scale + } + return tensor.attachedViews()[QNN_QUANT_SCALE_NAME]->ptr()[0]; + } + +QNN Quantization Parameters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Convert MLLM quantization to QNN format: + +.. code-block:: cpp + + Qnn_QuantizeParams_t createQuantizeParams(const Tensor& tensor) { + if (tensor.dtype() == kInt8 || tensor.dtype() == kInt16) { + float scale = getQuantScale(tensor); + return Qnn_QuantizeParams_t{ + QNN_DEFINITION_DEFINED, + QNN_QUANTIZATION_ENCODING_SCALE_OFFSET, + {.scaleOffsetEncoding = { + .scale = scale, + .offset = 0 // Zero-point offset + }} + }; + } + // Undefined quantization for float tensors + return DEFAULT_QUANTIZE_PARAMS; + } + +Scale Propagation +~~~~~~~~~~~~~~~~~ + +Quantization scales propagate through reshape operations: + +.. code-block:: cpp + + void propagateQuantScale(const Tensor& input, Tensor& output) { + if (input.dtype() == kInt8 || input.dtype() == kInt16) { + float scale = getQuantScale(input); + setQuantScale(output, scale); + } + } + +Custom Operations +----------------- + +The QNN backend supports custom operations through the QNN Op Package mechanism: + +DequantizeAdd Custom Op +~~~~~~~~~~~~~~~~~~~~~~~~ + +A custom fused operation combining dequantization and addition: + +**Purpose:** + +* Fuse int8 dequantization with element-wise addition +* Improve accuracy for quantized models + +**Usage Example:** + +.. code-block:: cpp + + // In QwenAttentionProjNPU + class QwenAttentionProjNPU : public nn::Module { + nn::qnn::DequantizeAdd q_proj_dequantize_add_; + nn::qnn::DequantizeAdd k_proj_dequantize_add_; + nn::qnn::DequantizeAdd v_proj_dequantize_add_; + + QwenAttentionProjNPU(const std::string& name, const QwenNPUConfig& cfg) + : nn::Module(name) { + // Register custom ops + q_proj_dequantize_add_ = reg( + "self_attn.q_proj_dequantize_add" + ); + // ... + } + }; + +**Pattern Registration:** + +.. code-block:: cpp + + // In QNNGraphBuildPass constructor + patterns_.emplace( + Context::instance().lookupCustomizedOpId(kQNN, "DequantizeAdd"), + std::make_shared() + ); + +Performance Optimization +------------------------ + +Power Configuration +~~~~~~~~~~~~~~~~~~~ + +The QNN backend provides power profile management: + +.. code-block:: cpp + + class QNNPerf { + void setPowerConfigBurst() { + // High performance mode + // - Maximum clock frequencies + // - Higher power consumption + // - Lower latency + } + + void setPowerConfigBalanced() { + // Balanced mode + // - Moderate clock frequencies + // - Balanced power/performance + // - Medium latency + } + + void setRpcLatencyAndPolling() { + // Configure RPC latency for HTP communication + } + }; + +Profiling Support(TODO) +~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: This is not yet implemented. Better profiling info printing should be added. + +Enable detailed profiling for performance analysis: + +.. code-block:: cpp + + enum class ProfilingLevel { + OFF, // No profiling + BASIC, // Basic timing information + DETAILED, // Detailed layer-wise profiling + INVALID + }; + + // Create runtime with profiling + auto runtime = QNNRuntime::create( + ProfilingLevel::DETAILED, + QNN_LOG_LEVEL_INFO + ); + +Context Serialization +~~~~~~~~~~~~~~~~~~~~~ + +.. note:: TODO: Context retrieve should support retrieve by file name and dynamic switching. + +Serialize compiled graphs to avoid recompilation: + +.. code-block:: cpp + + // Save context to binary file + qnn_backend->saveContext("qnn_context.bin"); + + // Load pre-compiled context + Qnn_ContextHandle_t context; + std::vector> models; + runtime->retrieveContext(context, models); + +Best Practices +-------------- + +Graph Partitioning +~~~~~~~~~~~~~~~~~~ + +For optimal performance, partition your model strategically: + +**Guidelines:** + +* **QNN Subgraphs**: Place compute-intensive operations (Linear, Conv, Attention) +* **CPU Operations**: Keep dynamic operations (KVCache, RoPE) on CPU +* **Minimize Data Transfer**: Reduce tensor copies between QNN and CPU + +**Example Partitioning:** + +.. code-block:: cpp + + class QwenDecoder : public Module { + // QNN: Attention projections + QwenAttentionProjNPU self_attn_proj_; // -> kQNN + + // CPU: KV cache and RoPE + QwenAttentionMatmul self_attn_matmul_; // -> kCPU + + // QNN: Output projection and MLP + QwenOutProjAndMLP self_attn_out_mlp_; // -> kQNN + }; + +Quantization Strategy +~~~~~~~~~~~~~~~~~~~~~ + +**Recommendations:** + +1. **Per-Tensor Quantization**: Attach scales to input/output tensors +2. **Scale Initialization**: Set scales during model loading +3. **Dynamic Range**: Use calibration data to determine optimal scales +4. **Precision**: INT8 for most operations, INT16 for critical layers + +.. code-block:: cpp + + // During model loading + void loadQuantizedModel(const ParameterFile::ptr_t& params) { + for (auto& [name, tensor] : *params) { + if (tensor.dtype() == kInt8) { + // Scale stored in parameter file + float scale = params->getScale(name); + setQuantScale(tensor, scale); + } + } + } + +Error Handling +~~~~~~~~~~~~~~ + +Always check return codes from QNN operations: + +.. code-block:: cpp + + #define CALL_QNN(apiCall) do { \ + int errorCode = ((apiCall) & 0xFFFF); \ + if (errorCode != QNN_SUCCESS) { \ + MLLM_ERROR("QNN Error in {}, line {}: error code {}", \ + __FILE__, __LINE__, errorCode); \ + assert(errorCode == QNN_SUCCESS); \ + } \ + } while (0) + + // Usage + CALL_QNN(qnnInterface.graphFinalize(graph, nullptr, nullptr)); + +Troubleshooting +--------------- + +Common Issues +~~~~~~~~~~~~~ + +**Issue: Graph finalization fails** + +* **Cause**: Incompatible tensor dimensions or unsupported operations +* **Solution**: Check QNN SDK documentation for supported ops and constraints + +**Issue: Incorrect output values** + +* **Cause**: Quantization scale mismatch or missing scale propagation +* **Solution**: Verify quantization scales are correctly set and propagated + +**Issue: Performance degradation** + +* **Cause**: Excessive CPU-QNN data transfers or suboptimal partitioning +* **Solution**: Profile with Perfetto, optimize graph boundaries + +Debug Logging +~~~~~~~~~~~~~ + +Enable verbose QNN logging: + +.. code-block:: cpp + + auto runtime = QNNRuntime::create( + ProfilingLevel::DETAILED, + QNN_LOG_LEVEL_VERBOSE // Maximum verbosity + ); + +API Reference +------------- + +QNNBackend API +~~~~~~~~~~~~~~ + +.. code-block:: cpp + + class QNNBackend : public Backend { + public: + // Graph lifecycle + std::shared_ptr createQnnGraph(const std::string& graphName); + bool graphFinalize(const std::string& graphName); + void graphExecute(const std::string& graphName, + std::vector& inputs, + std::vector& outputs); + + // Tensor management + bool addTensor(const std::string& graphName, + const std::string& tensorName, + Qnn_TensorType_t type, + const Tensor& tensor, + Qnn_QuantizeParams_t quantize = DEFAULT_QUANTIZE_PARAMS); + + bool addStaticTensor(const std::string& graphName, + const std::string& tensorName, + const Tensor& tensor, + Qnn_QuantizeParams_t quantize = DEFAULT_QUANTIZE_PARAMS); + + std::shared_ptr getTensorWrapper( + const std::string& graphName, + const std::string& tensorName); + + // Node management + void graphAddNode(const std::string& graphName, + const std::string& nodeName, + const std::string& nodeType, + const std::vector& inputTensorNames, + const std::vector& outputTensorNames, + const std::vector>& tensorParams, + const std::vector>& scalarParams, + const std::string& packageName = "qti.aisw"); + + // Properties + bool isWeightOnDevice() override; + const QNN_INTERFACE_VER_TYPE& qnnInterface() const; + Qnn_BackendHandle_t backendHandle() const; + Qnn_ContextHandle_t context() const; + }; + +For more information on the overall framework architecture, see :doc:`../arch/arch`. + diff --git a/docs/qnn_backend/index.rst b/docs/qnn_backend/index.rst new file mode 100644 index 000000000..b7092f938 --- /dev/null +++ b/docs/qnn_backend/index.rst @@ -0,0 +1,9 @@ +QNN Backend +==================== + +.. toctree:: + :maxdepth: 2 + + setup_env + core_design + qnn_model_convert diff --git a/docs/qnn_backend/qnn_model_convert.rst b/docs/qnn_backend/qnn_model_convert.rst new file mode 100644 index 000000000..f157c7f2a --- /dev/null +++ b/docs/qnn_backend/qnn_model_convert.rst @@ -0,0 +1,2 @@ +QNN Model Conversion +==================== \ No newline at end of file diff --git a/docs/qnn_backend/setup_env.rst b/docs/qnn_backend/setup_env.rst new file mode 100644 index 000000000..8b5554533 --- /dev/null +++ b/docs/qnn_backend/setup_env.rst @@ -0,0 +1,136 @@ +QNN Environment Setup +===================== + +Overview +-------- + +This section describes how to set up the QNN development environment, following the official QNN documentation. For more details, see: `QNN Linux Setup `_. + +Prerequisites +------------- + +The QNN backend relies on two main SDKs: + +- **Qualcomm QNN SDK**: Required for QNN backend compilation +- **Hexagon SDK**: Required for QNN custom operator(LLaMAOpPackage in mllm) compilation + +Version Requirements +~~~~~~~~~~~~~~~~~~~~ + +- **QNN**: Linux v2.34+ +- **Hexagon SDK**: Linux 5.x + +.. warning:: + Some accounts may not have permission to access the Hexagon SDK and may need to contact Qualcomm for support. + +SDK Download and Installation +----------------------------- + +QNN SDK Installation +~~~~~~~~~~~~~~~~~~~~~ + +1. Download the QNN SDK from the `official Qualcomm website `_ +2. Unzip the downloaded file +3. Set the environment variable ``QNN_SDK_ROOT`` to point to the unzipped directory + +Hexagon SDK Installation +~~~~~~~~~~~~~~~~~~~~~~~~ + +The `Hexagon SDK `_ is Qualcomm's official development environment for programming and optimizing applications on the Hexagon DSP — the core processor architecture used in Snapdragon chips for efficient, low-power computation. + +By installing and sourcing the Hexagon SDK, developers can build the `custom op package `_, which is the LLaMAOpPackage in this project, enabling HVX capabilities. + +To install the Hexagon SDK, follow these steps: + +1. Download the Hexagon SDK using `QPM `_ (Qualcomm Package Manager) +2. Install the SDK following the QPM instructions + +Environment Setup +----------------- + +After downloading and installing both SDKs, set up the environment by running the following commands: + +.. code-block:: bash + + # Set up QNN SDK environment + source /bin/envsetup.sh + + # Set up Hexagon SDK environment + source /setup_sdk_env.source + +Environment Variables Verification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After setting up the environment, verify that the following environment variables are correctly set: + +.. code-block:: bash + + echo $QNN_SDK_ROOT # Should point to /path/to/your/qnn/sdk + echo $HEXAGON_SDK_ROOT # Should point to /path/to/your/hexagon/sdk + +.. note:: + These environment variables are essential for the QNN op package compilation process. + +Op Package Compilation +----------------------- + +To use QNN offload, both CPU and HTP QNN op packages are required. The following steps will build the QNN op packages needed by the project. + +Prerequisites for Compilation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ensure the following environment variables are set: + +- ``QNN_SDK_ROOT`` +- ``HEXAGON_SDK_ROOT`` +- ``ANDROID_NDK_ROOT`` + +Compilation Commands +~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + cd mllm/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/ + make htp_aarch64 && make htp_v75 + +This will build the necessary QNN op packages for both AArch64 and HVX v75 targets. + +Development Tips +---------------- + +LSP Configuration for HVX Development +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To enable Language Server Protocol (LSP) support for HVX development, configure clangd to use the Hexagon toolchain: + +1. Create or edit ``.vscode/settings.json`` in your project root +2. Add the following configuration: + +.. code-block:: json + + { + "clangd.path": "$HEXAGON_SDK_ROOT/tools/HEXAGON_Tools/8.7.06/Tools/bin/hexagon-clangd" + } + +Generating Compilation Database +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To generate the ``compile_commands.json`` file for the Op package: + +.. code-block:: bash + + cd mllm/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/ + compiledb make htp_v75 -C . + +This compilation database is useful for IDE features like code completion and error highlighting. + +Next Steps +---------- + +After completing the environment setup, you can proceed to: + +- Model conversion and quantization +- Building the project with QNN backend +- Running QNN-accelerated models + +For detailed instructions on these steps, refer to the respective documentation sections.