diff --git a/docs/_static/img/arch.png b/docs/_static/img/arch.png
new file mode 100644
index 000000000..72a0c42ec
Binary files /dev/null and b/docs/_static/img/arch.png differ
diff --git a/docs/_static/img/qnn-trace-execute-seq.png b/docs/_static/img/qnn-trace-execute-seq.png
new file mode 100644
index 000000000..784ef2bb0
Binary files /dev/null and b/docs/_static/img/qnn-trace-execute-seq.png differ
diff --git a/docs/arch/arch.rst b/docs/arch/arch.rst
new file mode 100644
index 000000000..9989913d2
--- /dev/null
+++ b/docs/arch/arch.rst
@@ -0,0 +1,346 @@
+MLLM Framework Core Architecture
+===================================
+
+Overview
+--------
+
+The MLLM framework employs a hierarchical execution model with three main components:
+
+* **Module**: High-level abstraction for neural network modules
+* **Layer**: Abstraction for individual operations/layers  
+* **Dispatcher**: Execution engine for different backends
+
+This architecture supports both regular operation execution and intermediate representation (IR) tracing workflows, enabling flexible deployment across multiple hardware backends including CPU, QNN (also: Qualcomm AI Engine Direct/QAIRT), and custom accelerators.
+
+.. figure:: ../_static/img/arch.png
+   :width: 100%
+   :alt: Overview
+   :align: center
+
+   Figure 1: MLLM Framework Core Architecture.
+
+Core Components
+---------------
+
+Module
+~~~~~~~
+
+The ``Module`` class serves as the top-level container for neural network components. Key responsibilities include:
+
+* **Hierarchical Organization**: Modules can contain other modules and layers, forming a tree structure
+* **Parameter Management**: Loading and saving model parameters from/to files
+* **Device Management**: Moving modules and their components across different devices
+* **Forward Execution**: Orchestrating the execution flow through child components
+
+**Key Methods:**
+
+.. code-block:: cpp
+
+    class Module {
+        std::vector<Tensor> forward(const std::vector<Tensor>& inputs, 
+                                   const std::vector<AnyValue>& args);
+        void to(DeviceTypes device_type);
+        void load(const ParameterFile::ptr_t& param_file);
+        
+        // Named module registration (similar to PyTorch's named_modules)
+        template<typename T, typename... Args>
+        auto reg(const std::string& name, Args&&... args);
+    };
+
+**Named Module Registration:**
+
+The ``reg()`` method provides functionality similar to PyTorch's ``named_modules()``, enabling hierarchical module organization with automatic name management in C++:
+
+.. code-block:: cpp
+
+    class MyModel : public nn::Module {
+    public:
+        MyModel(const std::string& name) : nn::Module(name) {
+            // Register sub-modules with names
+            encoder_ = reg<EncoderModule>("encoder", config);
+            decoder_ = reg<DecoderModule>("decoder", config);
+            
+            // Register layers with names
+            linear1_ = reg<nn::Linear>("fc1", 768, 3072, false);
+            linear2_ = reg<nn::Linear>("fc2", 3072, 768, false);
+        }
+        
+    private:
+        EncoderModule encoder_;  // Absolute name: "model.encoder"
+        DecoderModule decoder_;  // Absolute name: "model.decoder"
+        nn::Linear linear1_;     // Absolute name: "model.fc1"
+        nn::Linear linear2_;     // Absolute name: "model.fc2"
+    };
+
+**Key Features:**
+
+* **Automatic Name Hierarchy**: Constructs fully-qualified names (e.g., ``"model.encoder.layer0.attention"``)
+* **Parameter Mapping**: Links module names to parameter files for loading/saving
+* **Device Management**: Enables selective device placement by module name
+* **Type Safety**: Template-based registration with compile-time type checking
+
+**Comparison with PyTorch:**
+
+.. code-block:: python
+
+    # PyTorch
+    class MyModel(nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.encoder = EncoderModule()  # Automatically named "encoder"
+            self.decoder = DecoderModule()  # Automatically named "decoder"
+    
+    # Print all named modules
+    for name, module in model.named_modules():
+        print(f"{name}: {module}")
+
+.. code-block:: cpp
+
+    // MLLM Framework
+    class MyModel : public nn::Module {
+    public:
+        MyModel(const std::string& name) : nn::Module(name) {
+            encoder_ = reg<EncoderModule>("encoder");  // Explicitly named "encoder"
+            decoder_ = reg<DecoderModule>("decoder");  // Explicitly named "decoder"
+        }
+    };
+    
+    // Names are automatically constructed: "model.encoder", "model.decoder"
+    // Used for parameter loading: params->load("model.encoder.weight")
+
+The ``reg()`` method bridges the gap between Python's dynamic attribute naming and C++'s static type system, providing a clean API for building hierarchical neural networks.
+
+Layer Abstraction
+~~~~~~~~~~~~~~~~~
+
+The ``Layer`` class represents individual operations or layers within a module:
+
+* **Operation Encapsulation**: Wraps backend-specific operations (BaseOp)
+* **Device Abstraction**: Handles operation instantiation for different backends
+* **Task Creation**: Creates execution tasks for the dispatcher system
+
+**Key Methods:**
+
+.. code-block:: cpp
+
+    class Layer {
+        std::vector<Tensor> __main(const std::vector<Tensor>& inputs);
+        Layer& to(DeviceTypes device_type);
+        OpTypes opType() const;
+    };
+
+Dispatcher System
+~~~~~~~~~~~~~~~~~
+
+The dispatcher system provides backend-specific execution engines:
+
+**CPUDispatcher**
+  Handles CPU-based operation execution with full operation lifecycle:
+  
+  * ``reshape()``: Tensor shape computation
+  * ``setup()``: Operation initialization  
+  * ``forward()``: Actual computation
+
+**IRTraceDispatcher**
+  Captures execution traces for IR generation:
+  
+  * Records operation calls and tensor flows
+  * Enables graph optimization and analysis
+  * Supports compilation workflows
+
+**QNNDispatcher**
+  Manages QNN backend execution:
+  
+  * Specialized for QNN graph execution
+  * Handles module-level execution for QNN graphs
+  * Selective operation execution (X2X, Embedding ops)
+
+Execution Workflows
+-------------------
+
+Op Execution Workflow
+~~~~~~~~~~~~~~~~~~~~~~
+
+The standard execution path for neural network inference:
+
+.. code-block:: text
+
+    Module::forward()
+        │
+        ├─── Module::__main()
+        │    │
+        │    ├─── Task::createExecuteModuleTask()
+        │    │
+        │    └─── DispatcherManager::submit()
+        │         │
+        │         └─── [CPU|QNN]Dispatcher::receive()
+        │              │
+        │              └─── [CPU|QNN]Dispatcher::process()
+        │
+        └─── Layer::__main()
+             │
+             ├─── Task::createExecuteOpTask()
+             │
+             └─── DispatcherManager::submit()
+                  │
+                  └─── [CPU|QNN]Dispatcher::receive()
+                       │
+                       └─── [CPU|QNN]Dispatcher::process()
+                            │
+                            ├─── Op::reshape()
+                            ├─── Op::setup()
+                            └─── Op::forward()
+
+**Execution Flow Details:**
+
+1. **Module Entry**: ``Module::forward()`` is called with input tensors
+2. **Task Creation**: Creates ``kExecuteModule`` or ``kExecuteOp`` tasks
+3. **Dispatcher Selection**: Routes to appropriate backend dispatcher based on device type
+4. **Backend Processing**: Dispatcher executes the operation using backend-specific logic
+5. **Result Return**: Output tensors are returned through the task system
+
+IR Execution Workflow
+~~~~~~~~~~~~~~~~~~~~~~
+
+When trace mode is enabled, the framework captures an intermediate representation:
+
+.. code-block:: text
+
+    Module::forward() [trace_mode=true]
+        │
+        ├─── Module::__trace()
+        │    │
+        │    ├─── IRContext::create<CallGraphOp>()
+        │    ├─── IRContext::create<SubGraphOp>()
+        │    │
+        │    ├─── Module::forward() [recursive]
+        │    │
+        │    └─── IRContext::create<ReturnOp>()
+        │
+        └─── Layer::__main() [trace_mode=true]
+             │
+             ├─── Task::createExecuteOpTask()
+             │    └─── task->custom_context_ptr = ir_context
+             │
+             └─── IRTraceDispatcher::receive()
+                  │
+                  └─── IRTraceDispatcher::process()
+                       │
+                       ├─── Op::reshape()
+                       └─── Op::trace()
+
+**IR Workflow Details:**
+
+1. **Trace Initialization**: ``Context::thisThread()->trace_mode`` enables IR capture
+2. **Graph Construction**: Creates IR graph nodes (``CallGraphOp``, ``SubGraphOp``)
+3. **Operation Tracing**: Each operation call is recorded in the IR graph
+4. **Graph Completion**: ``ReturnOp`` finalizes the subgraph structure
+5. **IR Output**: Complete computational graph is available for optimization/compilation
+
+For more details on IR tracing and compilation, refer to the :doc:`MLLM IR <../compile/ir>` section.
+
+Synchronous vs Asynchronous Execution
+--------------------------------------
+
+Synchronous Execution
+~~~~~~~~~~~~~~~~~~~~~
+
+Currently, the primary execution mode uses synchronous task processing:
+
+.. code-block:: cpp
+
+    // In Dispatcher::receive()
+    void CPUDispatcher::receive(const Task::ptr_t& task) {
+        process(task);  // Blocks until completion
+    }
+
+**Characteristics:**
+
+* **Blocking Operation**: Each task completes before returning
+* **Simple Flow Control**: Sequential execution guarantees
+* **Immediate Results**: Output tensors available immediately after task submission
+
+Asynchronous Execution (Future Enhancement)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The framework includes infrastructure for asynchronous execution:
+
+.. code-block:: cpp
+
+    // In Dispatcher::asyncReceive() 
+    TaskResult::sender_t CPUDispatcher::asyncReceive(const Task::ptr_t& task) {
+        auto scheduler = thread_pool_.get_scheduler();
+        return stdexec::schedule(scheduler) | 
+               stdexec::then([this, task] { process(task); });
+    }
+
+**Design Features:**
+
+* **Non-blocking Submission**: Tasks return immediately with a sender/future
+* **Thread Pool Integration**: Uses ``exec::static_thread_pool`` for parallel execution
+* **Sender/Receiver Pattern**: Based on C++26 sender/receiver async model
+* **Pipeline Capability**: Enables operation pipelining and overlapping
+
+**Current Status:**
+
+The asynchronous execution path is implemented but not fully integrated:
+
+* ``IRTraceDispatcher::asyncReceive()`` returns an error
+* Most dispatchers have placeholder async implementations
+* Synchronization points (``syncWait()``) are not fully implemented
+
+Task System Architecture
+-------------------------
+
+The task system provides a unified interface for operation execution:
+
+**Task Types:**
+
+* ``kExecuteOp``: Single operation execution
+* ``kExecuteModule``: Module-level execution (for QNN graphs)
+
+**Task Structure:**
+
+.. code-block:: cpp
+
+    struct Task {
+        TaskTypes type;
+        BaseOp::ptr_t op;                    // Operation to execute
+        std::vector<Tensor> inputs;          // Input tensors
+        std::vector<Tensor> outputs;         // Output tensors
+        std::vector<AnyValue> args;          // Additional arguments
+        void* custom_context_ptr;            // Backend-specific context
+    };
+
+**Dispatcher Interface:**
+
+.. code-block:: cpp
+
+    class Dispatcher {
+        virtual void receive(const Task::ptr_t& task) = 0;
+        virtual TaskResult::sender_t asyncReceive(const Task::ptr_t& task) = 0;
+        virtual void process(const Task::ptr_t& task) = 0;
+        virtual void syncWait() = 0;
+    };
+
+Backend Integration
+-------------------
+
+The framework supports multiple execution backends through the dispatcher pattern:
+
+**CPU Backend**
+  * Full operation support with reshape/setup/forward lifecycle
+  * Direct tensor computation on CPU
+  * Perfetto tracing integration for performance analysis
+
+**QNN Backend** 
+  * Optimized execution for Qualcomm Neural Processing Units
+  * Graph-level execution for improved performance
+  * Selective operation fallback to CPU when needed
+
+**IR Tracing Backend**
+  * Captures computational graphs for analysis and optimization
+  * Enables ahead-of-time compilation workflows
+  * Supports graph transformation and optimization passes
+
+This architecture provides a flexible foundation for deploying neural networks across diverse hardware platforms while maintaining a consistent programming interface.
\ No newline at end of file
diff --git a/docs/arch/index.rst b/docs/arch/index.rst
index 3cea39e0c..619bc019c 100644
--- a/docs/arch/index.rst
+++ b/docs/arch/index.rst
@@ -4,5 +4,7 @@ Architectures
 .. toctree::
    :maxdepth: 2
 
+   arch
    tensor
    support_ops
+   op_plugin_system
diff --git a/docs/arch/op_plugin_system.rst b/docs/arch/op_plugin_system.rst
new file mode 100644
index 000000000..fedf53592
--- /dev/null
+++ b/docs/arch/op_plugin_system.rst
@@ -0,0 +1,2 @@
+Op Plugin System
+=================
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index 8135b933f..ddbd668a0 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -346,6 +346,11 @@ Documents
 
    cpu_backend/index
 
+.. toctree::
+   :maxdepth: 2
+
+   qnn_backend/index
+
 .. toctree::
    :maxdepth: 2
 
diff --git a/docs/qnn_backend/core_design.rst b/docs/qnn_backend/core_design.rst
new file mode 100644
index 000000000..1822b7d0c
--- /dev/null
+++ b/docs/qnn_backend/core_design.rst
@@ -0,0 +1,785 @@
+QNN Backend Design
+====================
+
+Overview
+--------
+
+The QNN (Qualcomm Neural Network) Backend provides optimized execution of neural network models on Qualcomm's AI Engine Direct (formerly SNPE/QNN SDK). This backend enables efficient deployment on Qualcomm-powered devices including smartphones, embedded systems, and edge AI platforms.
+
+**Key Features:**
+
+* **Hardware Acceleration**: Leverages Qualcomm's Hexagon DSP and HTP (Hexagon Tensor Processor)
+* **Graph-Level Optimization**: Executes entire subgraphs as optimized QNN graphs
+* **Mixed Precision Support**: INT8/INT16 quantization with dynamic scale propagation
+* **Context Caching**: Serializes compiled graphs to binary format for fast loading
+* **Custom Operations**: Extensible custom op support through QNN op packages
+
+.. figure:: ../_static/img/qnn-trace-execute-seq.png
+   :width: 90%
+   :alt: Overview
+   :align: center
+
+   Figure 1: QNN Backend Execution Sequence.
+
+
+Architecture Components
+-----------------------
+
+The QNN backend architecture consists of several key components working together:
+
+.. code-block:: text
+
+    ┌─────────────────────────────────────────────────────────────┐
+    │                    MLLM Framework                           │
+    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
+    │  │   Module     │  │    Layer     │  │  Dispatcher  │       │
+    │  └──────┬───────┘  └────-─┬───────┘  └──────┬───────┘       │
+    └─────────┼─────────────────┼─────────────────┼───────────────┘
+              │                 │                 │
+              └─────────────────┴─────────────────┘
+                                │
+    ┌──────────────────────────────────────────────────────────────┐
+    │              QNN Backend Infrastructure                      │
+    │                                                              │
+    │  ┌────────────────────────────────────────────────────────┐  │
+    │  │              QNNBackend (Core Manager)                 │  │
+    │  │  - Runtime Management  - Context Management            │  │
+    │  │  - Graph Registry      - Tensor Management             │  │
+    │  └─────────┬──────────────────────────────────────────────┘  │
+    │            │                                                 │
+    │  ┌─────────┴──────────┬──────────────┬─────────────────┐     │
+    │  │                    │              │                 │     │
+    │  ▼                    ▼              ▼                 ▼     │
+    │  QNNRuntime       QNNModel     QNNDispatcher    QNNGraphBuildPass
+    │  (SDK Interface)  (Graph Mgmt) (Execution)      (Compilation)│
+    │                                                              │
+    └────────────────────────────┬─────────────────────────────────┘
+                                 │
+    ┌────────────────────────────▼──────────────────────────────────┐
+    │              Qualcomm QNN SDK                                 │
+    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
+    │  │ QNN Interface│  │  QNN Context │  │  QNN Graph   │         │
+    │  └──────────────┘  └──────────────┘  └──────────────┘         │
+    │                                                               │
+    │  ┌──────────────────────────────────────────────────────┐     │
+    │  │         Hardware Backends (HTP/DSP)                  │     │
+    │  └──────────────────────────────────────────────────────┘     │
+    └───────────────────────────────────────────────────────────────┘
+
+QNNBackend: Core Manager
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``QNNBackend`` class serves as the central orchestrator for QNN operations:
+
+**Responsibilities:**
+
+* **Runtime Initialization**: Manages QNN SDK initialization and device configuration
+* **Context Management**: Creates and maintains QNN execution contexts
+* **Graph Registry**: Maps graph names to ``QNNModel`` instances
+* **Tensor Management**: Handles tensor creation, quantization, and data transfer
+* **Performance Tuning**: Configures power profiles and performance settings
+
+**Key Methods:**
+
+.. code-block:: cpp
+
+    class QNNBackend : public Backend {
+        // Graph lifecycle management
+        std::shared_ptr<QNNModel> createQnnGraph(const std::string& graphName);
+        bool graphFinalize(const std::string& graphName);
+        void graphExecute(const std::string& graphName, 
+                         std::vector<Tensor>& inputs,
+                         std::vector<Tensor>& outputs);
+        
+        // Tensor management
+        bool addTensor(const std::string& graphName, 
+                      const std::string& tensorName,
+                      Qnn_TensorType_t type, 
+                      const Tensor& tensor,
+                      Qnn_QuantizeParams_t quantize);
+        
+        // Component access
+        const QNN_INTERFACE_VER_TYPE& qnnInterface() const;
+        Qnn_BackendHandle_t backendHandle() const;
+        Qnn_ContextHandle_t context() const;
+    };
+
+QNNRuntime: SDK Interface Layer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Manages low-level QNN SDK initialization and resource lifecycle:
+
+**Components:**
+
+* **Interface Loading**: Dynamically loads QNN library symbols
+* **Backend Selection**: Initializes appropriate backend (HTP/DSP/GPU)
+* **Device Management**: Configures device-specific settings
+* **Logging & Profiling**: Optional debug and performance profiling
+
+**Initialization Flow:**
+
+.. code-block:: cpp
+
+    // Create runtime with profiling
+    auto runtime = QNNRuntime::create(
+        ProfilingLevel::BASIC,     // Enable profiling
+        QNN_LOG_LEVEL_WARN         // Set log level
+    );
+    
+    // Create execution context
+    Qnn_ContextHandle_t context;
+    runtime->createContext(context);
+
+QNNModel: Graph Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Represents a single QNN computational graph with complete lifecycle management:
+
+**Graph Lifecycle:**
+
+1. **Initialization**: Create graph with name and configuration
+2. **Tensor Addition**: Register input/output/intermediate tensors
+3. **Node Addition**: Add QNN operations with parameters
+4. **Finalization**: Compile and optimize the graph
+5. **Execution**: Run inference with input data
+
+**Key Operations:**
+
+.. code-block:: cpp
+
+    class QNNModel {
+        // Initialization
+        ModelError_t initialize(const Qnn_ContextHandle_t& context,
+                               const char* graphName,
+                               bool debug);
+        
+        // Tensor management
+        ModelError_t addTensor(const std::string& tensorName,
+                              Qnn_TensorType_t type,
+                              const Tensor& tensor,
+                              Qnn_QuantizeParams_t quantize);
+        
+        ModelError_t addStaticTensor(const std::string& tensorName,
+                                    const Tensor& tensor,
+                                    Qnn_QuantizeParams_t quantize);
+        
+        std::shared_ptr<QNNTensorWrapper> getTensorWrapper(
+            const std::string& tensorName);
+        
+        // Node management
+        ModelError_t addNode(Qnn_OpConfigVersion_t version,
+                            const std::string& name,
+                            const std::string& packageName,
+                            const std::string& type,
+                            const std::vector<...>& tensorParams,
+                            const std::vector<...>& scalarParams,
+                            const std::vector<std::string>& inputNames,
+                            const std::vector<std::string>& outputNames);
+        
+        // Finalization and execution
+        ModelError_t finalizeGraph(Qnn_ProfileHandle_t profileHandle,
+                                  Qnn_SignalHandle_t signalHandle);
+        
+        bool isGraphFinalized() const;
+    };
+
+**Tensor Wrappers:**
+
+The backend uses C++ RAII wrappers to manage QNN's C-style resources:
+
+* ``QNNTensorWrapper``: Manages tensor metadata and data buffers
+* ``QNNParamTensorWrapper``: Wraps constant tensor parameters
+* ``QNNParamScalarWrapper``: Wraps scalar parameters
+
+QNNDispatcher: Execution Engine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Handles task execution routing between CPU and QNN:
+
+**Execution Strategy:**
+
+.. code-block:: cpp
+
+    void QNNDispatcher::process(const Task::ptr_t& task) {
+        switch (task->type) {
+            case TaskTypes::kExecuteOp: {
+                // Selective execution: only X2X and Embedding on QNN
+                task->op->reshape(task->inputs, task->outputs);
+                if (task->op->getOpType() == OpTypes::kX2X || 
+                    task->op->getOpType() == OpTypes::kEmbedding) {
+                    task->op->setup(task->inputs, task->outputs);
+                    task->op->forward(task->inputs, task->outputs);
+                }
+                break;
+            }
+            case TaskTypes::kExecuteModule: {
+                // Full module execution on QNN
+                auto qnnBackend = getBackend(kQNN);
+                auto moduleName = getModuleName(task);
+                
+                // Forward pass to populate outputs
+                task->outputs = module->forward(task->inputs, task->args);
+                
+                // Execute the QNN graph
+                qnnBackend->graphExecute(moduleName, 
+                                        task->inputs, 
+                                        task->outputs);
+                break;
+            }
+        }
+    }
+
+**Execution Modes:**
+
+* **Op-Level**: Individual operations (X2X, Embedding) executed separately
+* **Module-Level**: Entire subgraphs executed as optimized QNN graphs
+
+QNNGraphBuildPass: Compilation Pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Transforms MLLM IR into executable QNN graphs through pattern matching:
+
+**Compilation Flow:**
+
+1. **IR Traversal**: Iterate through ``SubGraphOp`` nodes marked for QNN
+2. **Pattern Matching**: Match MLLM operations to QNN operation patterns
+3. **Graph Construction**: Build QNN graph with nodes and tensors
+4. **Optimization**: Apply QNN SDK optimizations
+5. **Finalization**: Compile graph for target hardware
+
+**Pattern Registration:**
+
+.. code-block:: cpp
+
+    class QNNGraphBuildPass : public Pass {
+        QNNGraphBuildPass() {
+            // Register operation patterns
+            regPattern<QNNAddPattern, 
+                      QNNMulPattern,
+                      QNNLinearPattern,
+                      QNNRMSNormPattern,
+                      QNNViewPattern,
+                      QNNTransposePattern,
+                      QNNX2XPattern,
+                      QNNCastTypePattern,
+                      QNNSiLUPattern>();
+            
+            // Register custom ops
+            patterns_.emplace(
+                customOpId("DequantizeAdd"),
+                std::make_shared<QNNDequantizeAddPattern>()
+            );
+        }
+    };
+
+**Pattern Example:**
+
+.. code-block:: cpp
+
+    class QNNLinearPattern : public QNNOpPattern {
+        bool addNode(const std::string& graphName,
+                    const ir::linalg::LinalgIROp::ptr_t& op,
+                    const std::vector<TensorValue::ptr_t>& inputs,
+                    const std::vector<TensorValue::ptr_t>& outputs) override {
+            // Add input tensors
+            addTensor(graphName, inputs[0], QNN_TENSOR_TYPE_NATIVE);
+            addTensor(graphName, inputs[1], QNN_TENSOR_TYPE_STATIC);
+            
+            // Add output tensor
+            addTensor(graphName, outputs[0], QNN_TENSOR_TYPE_NATIVE);
+            
+            // Create QNN FullyConnected node
+            backend->graphAddNode(
+                graphName,
+                op->name(),
+                "FullyConnected",
+                {inputs[0]->name(), inputs[1]->name()},
+                {outputs[0]->name()},
+                {}, // tensor params
+                {}  // scalar params
+            );
+            
+            return true;
+        }
+    };
+
+Execution Workflows
+-------------------
+
+Compilation Workflow
+~~~~~~~~~~~~~~~~~~~~
+
+The QNN backend compilation workflow transforms traced IR into executable graphs:
+
+.. code-block:: text
+
+    User Model (Python/C++)
+            │
+            ├─── Model::trace()  [trace_mode=true]
+            │    └─── Creates IR representation
+            │
+            ▼
+    IR Module (mllm::ir::ModuleOp)
+            │
+            ├─── Contains SubGraphOp(s) marked as DeviceTypes::kQNN
+            │
+            ▼
+    QNNGraphBuildPass::run()
+            │
+            ├─── For each QNN SubGraphOp:
+            │    │
+            │    ├─── backend->createQnnGraph(graphName)
+            │    │    └─── Creates QNNModel instance
+            │    │
+            │    ├─── Add graph input tensors
+            │    │    └─── qnnModel->addTensor(..., QNN_TENSOR_TYPE_APP_WRITE)
+            │    │
+            │    ├─── Traverse IR operations
+            │    │    │
+            │    │    ├─── Match to QNN patterns
+            │    │    │    └─── pattern->addNode(graphName, op, inputs, outputs)
+            │    │    │
+            │    │    └─── Create QNN ops with:
+            │    │         ├─── Tensor parameters (weights, constants)
+            │    │         ├─── Scalar parameters (hyperparameters)
+            │    │         ├─── Input tensor names
+            │    │         └─── Output tensor names
+            │    │
+            │    └─── backend->graphFinalize(graphName)
+            │         │
+            │         ├─── qnnModel->finalizeGraph(...)
+            │         │    └─── Calls qnnInterface.graphFinalize()
+            │         │         └─── QNN SDK optimizes and compiles graph
+            │         │
+            │         └─── Graph ready for execution
+            │
+            ▼
+    Compiled QNN Graphs (ready for inference)
+
+**Code Example:**
+
+.. code-block:: cpp
+
+    // In QNNGraphBuildPass::buildQnnGraph()
+    void QNNGraphBuildPass::buildQnnGraph(
+        const ir::graph::SubGraphOp::ptr_t& sub_graph_op) {
+        
+        auto qnn_backend = getQNNBackend();
+        std::string graph_name = sub_graph_op->getSymbolAttr()->str();
+        
+        // Create QNN model
+        auto qnn_model = qnn_backend->createQnnGraph(graph_name);
+        
+        // Add graph inputs
+        for (auto& input : sub_graph_op->inputs()) {
+            auto input_tensor = input->cast_<TensorValue>();
+            auto quantize_param = createQuantizeParams(input_tensor->tensor_);
+            qnn_model->addTensor(input_tensor->name(), 
+                                QNN_TENSOR_TYPE_APP_WRITE,
+                                input_tensor->tensor_,
+                                quantize_param);
+        }
+        
+        // Process operations
+        for (auto& region_op : graph_region->ops()) {
+            if (auto linalg_op = cast<LinalgIROp>(region_op)) {
+                auto op_types = linalg_op->getAOpTypes();
+                if (patterns_.contains(op_types)) {
+                    patterns_[op_types]->addNode(
+                        graph_name, linalg_op, 
+                        op_inputs, op_outputs
+                    );
+                }
+            }
+        }
+        
+        // Finalize graph
+        qnn_backend->graphFinalize(graph_name);
+    }
+
+Runtime Execution Workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Standard inference execution through the dispatcher system:
+
+.. code-block:: text
+
+    Application::forward()
+            │
+            ├─── Module::forward()  [DeviceTypes::kQNN]
+            │    │
+            │    ├─── Module::__main()
+            │    │    │
+            │    │    ├─── Task::createExecuteModuleTask()
+            │    │    │    └─── task->custom_context_ptr = module
+            │    │    │
+            │    │    └─── DispatcherManager::submit(qnn_dispatcher_id, task)
+            │    │
+            │    ▼
+            QNNDispatcher::receive(task)
+            │    │
+            │    └─── QNNDispatcher::process(task)
+            │         │
+            │         ├─── case kExecuteModule:
+            │         │    │
+            │         │    ├─── Extract module name
+            │         │    │
+            │         │    ├─── Call module->forward() to setup outputs
+            │         │    │    └─── Creates output tensor shapes
+            │         │    │
+            │         │    └─── qnnBackend->graphExecute(moduleName, inputs, outputs)
+            │         │         │
+            │         │         ├─── Lookup QNNModel by name
+            │         │         │
+            │         │         ├─── Copy input data to QNN tensors
+            │         │         │    └─── Handles quantization if needed
+            │         │         │
+            │         │         ├─── qnnInterface.graphExecute()
+            │         │         │    └─── QNN SDK executes on HTP/DSP
+            │         │         │
+            │         │         └─── Copy output data from QNN tensors
+            │         │              └─── Handles dequantization if needed
+            │         │
+            │         └─── case kExecuteOp:
+            │              └─── Execute X2X/Embedding ops individually
+            │
+            ▼
+    Output Tensors (returned to application)
+
+
+Quantization Support
+--------------------
+
+The QNN backend provides comprehensive quantization support for efficient inference:
+
+Quantization Metadata Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Quantization scales are attached to tensors as metadata:
+
+.. code-block:: cpp
+
+    // Set quantization scale
+    inline void setQuantScale(Tensor& tensor, float scale) {
+        auto scale_view = std::make_shared<TensorView>(
+            Tensor::empty({1}, kFloat32, kCPU).alloc()
+        );
+        scale_view->ptr<float>()[0] = scale;
+        tensor.attachedViews()[QNN_QUANT_SCALE_NAME] = scale_view;
+    }
+    
+    // Get quantization scale
+    inline float getQuantScale(Tensor& tensor) {
+        if (!tensor.attachedViews().contains(QNN_QUANT_SCALE_NAME)) {
+            return 1.0f;  // Default scale
+        }
+        return tensor.attachedViews()[QNN_QUANT_SCALE_NAME]->ptr<float>()[0];
+    }
+
+QNN Quantization Parameters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Convert MLLM quantization to QNN format:
+
+.. code-block:: cpp
+
+    Qnn_QuantizeParams_t createQuantizeParams(const Tensor& tensor) {
+        if (tensor.dtype() == kInt8 || tensor.dtype() == kInt16) {
+            float scale = getQuantScale(tensor);
+            return Qnn_QuantizeParams_t{
+                QNN_DEFINITION_DEFINED,
+                QNN_QUANTIZATION_ENCODING_SCALE_OFFSET,
+                {.scaleOffsetEncoding = {
+                    .scale = scale,
+                    .offset = 0  // Zero-point offset
+                }}
+            };
+        }
+        // Undefined quantization for float tensors
+        return DEFAULT_QUANTIZE_PARAMS;
+    }
+
+Scale Propagation
+~~~~~~~~~~~~~~~~~
+
+Quantization scales propagate through reshape operations:
+
+.. code-block:: cpp
+
+    void propagateQuantScale(const Tensor& input, Tensor& output) {
+        if (input.dtype() == kInt8 || input.dtype() == kInt16) {
+            float scale = getQuantScale(input);
+            setQuantScale(output, scale);
+        }
+    }
+
+Custom Operations
+-----------------
+
+The QNN backend supports custom operations through the QNN Op Package mechanism:
+
+DequantizeAdd Custom Op
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A custom fused operation combining dequantization and addition:
+
+**Purpose:**
+
+* Fuse int8 dequantization with element-wise addition
+* Improve accuracy for quantized models
+
+**Usage Example:**
+
+.. code-block:: cpp
+
+    // In QwenAttentionProjNPU
+    class QwenAttentionProjNPU : public nn::Module {
+        nn::qnn::DequantizeAdd q_proj_dequantize_add_;
+        nn::qnn::DequantizeAdd k_proj_dequantize_add_;
+        nn::qnn::DequantizeAdd v_proj_dequantize_add_;
+        
+        QwenAttentionProjNPU(const std::string& name, const QwenNPUConfig& cfg) 
+            : nn::Module(name) {
+            // Register custom ops
+            q_proj_dequantize_add_ = reg<nn::qnn::DequantizeAdd>(
+                "self_attn.q_proj_dequantize_add"
+            );
+            // ...
+        }
+    };
+
+**Pattern Registration:**
+
+.. code-block:: cpp
+
+    // In QNNGraphBuildPass constructor
+    patterns_.emplace(
+        Context::instance().lookupCustomizedOpId(kQNN, "DequantizeAdd"),
+        std::make_shared<QNNDequantizeAddPattern>()
+    );
+
+Performance Optimization
+------------------------
+
+Power Configuration
+~~~~~~~~~~~~~~~~~~~
+
+The QNN backend provides power profile management:
+
+.. code-block:: cpp
+
+    class QNNPerf {
+        void setPowerConfigBurst() {
+            // High performance mode
+            // - Maximum clock frequencies
+            // - Higher power consumption
+            // - Lower latency
+        }
+        
+        void setPowerConfigBalanced() {
+            // Balanced mode
+            // - Moderate clock frequencies
+            // - Balanced power/performance
+            // - Medium latency
+        }
+        
+        void setRpcLatencyAndPolling() {
+            // Configure RPC latency for HTP communication
+        }
+    };
+
+Profiling Support(TODO)
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note:: This is not yet implemented. Better profiling info printing should be added.
+
+Enable detailed profiling for performance analysis:
+
+.. code-block:: cpp
+
+    enum class ProfilingLevel {
+        OFF,       // No profiling
+        BASIC,     // Basic timing information
+        DETAILED,  // Detailed layer-wise profiling
+        INVALID
+    };
+    
+    // Create runtime with profiling
+    auto runtime = QNNRuntime::create(
+        ProfilingLevel::DETAILED,
+        QNN_LOG_LEVEL_INFO
+    );
+
+Context Serialization
+~~~~~~~~~~~~~~~~~~~~~
+
+.. note:: TODO: Context retrieve should support retrieve by file name and dynamic switching.
+
+Serialize compiled graphs to avoid recompilation:
+
+.. code-block:: cpp
+
+    // Save context to binary file
+    qnn_backend->saveContext("qnn_context.bin");
+    
+    // Load pre-compiled context
+    Qnn_ContextHandle_t context;
+    std::vector<std::shared_ptr<QNNModel>> models;
+    runtime->retrieveContext(context, models);
+
+Best Practices
+--------------
+
+Graph Partitioning
+~~~~~~~~~~~~~~~~~~
+
+For optimal performance, partition your model strategically:
+
+**Guidelines:**
+
+* **QNN Subgraphs**: Place compute-intensive operations (Linear, Conv, Attention)
+* **CPU Operations**: Keep dynamic operations (KVCache, RoPE) on CPU
+* **Minimize Data Transfer**: Reduce tensor copies between QNN and CPU
+
+**Example Partitioning:**
+
+.. code-block:: cpp
+
+    class QwenDecoder : public Module {
+        // QNN: Attention projections
+        QwenAttentionProjNPU self_attn_proj_;  // -> kQNN
+        
+        // CPU: KV cache and RoPE
+        QwenAttentionMatmul self_attn_matmul_; // -> kCPU
+        
+        // QNN: Output projection and MLP
+        QwenOutProjAndMLP self_attn_out_mlp_;  // -> kQNN
+    };
+
+Quantization Strategy
+~~~~~~~~~~~~~~~~~~~~~
+
+**Recommendations:**
+
+1. **Per-Tensor Quantization**: Attach scales to input/output tensors
+2. **Scale Initialization**: Set scales during model loading
+3. **Dynamic Range**: Use calibration data to determine optimal scales
+4. **Precision**: INT8 for most operations, INT16 for critical layers
+
+.. code-block:: cpp
+
+    // During model loading
+    void loadQuantizedModel(const ParameterFile::ptr_t& params) {
+        for (auto& [name, tensor] : *params) {
+            if (tensor.dtype() == kInt8) {
+                // Scale stored in parameter file
+                float scale = params->getScale(name);
+                setQuantScale(tensor, scale);
+            }
+        }
+    }
+
+Error Handling
+~~~~~~~~~~~~~~
+
+Always check return codes from QNN operations:
+
+.. code-block:: cpp
+
+    #define CALL_QNN(apiCall) do {                                       \
+        int errorCode = ((apiCall) & 0xFFFF);                            \
+        if (errorCode != QNN_SUCCESS) {                                  \
+            MLLM_ERROR("QNN Error in {}, line {}: error code {}",        \
+                      __FILE__, __LINE__, errorCode);                    \
+            assert(errorCode == QNN_SUCCESS);                            \
+        }                                                                \
+    } while (0)
+    
+    // Usage
+    CALL_QNN(qnnInterface.graphFinalize(graph, nullptr, nullptr));
+
+Troubleshooting
+---------------
+
+Common Issues
+~~~~~~~~~~~~~
+
+**Issue: Graph finalization fails**
+
+* **Cause**: Incompatible tensor dimensions or unsupported operations
+* **Solution**: Check QNN SDK documentation for supported ops and constraints
+
+**Issue: Incorrect output values**
+
+* **Cause**: Quantization scale mismatch or missing scale propagation
+* **Solution**: Verify quantization scales are correctly set and propagated
+
+**Issue: Performance degradation**
+
+* **Cause**: Excessive CPU-QNN data transfers or suboptimal partitioning
+* **Solution**: Profile with Perfetto, optimize graph boundaries
+
+Debug Logging
+~~~~~~~~~~~~~
+
+Enable verbose QNN logging:
+
+.. code-block:: cpp
+
+    auto runtime = QNNRuntime::create(
+        ProfilingLevel::DETAILED,
+        QNN_LOG_LEVEL_VERBOSE  // Maximum verbosity
+    );
+
+API Reference
+-------------
+
+QNNBackend API
+~~~~~~~~~~~~~~
+
+.. code-block:: cpp
+
+    class QNNBackend : public Backend {
+    public:
+        // Graph lifecycle
+        std::shared_ptr<QNNModel> createQnnGraph(const std::string& graphName);
+        bool graphFinalize(const std::string& graphName);
+        void graphExecute(const std::string& graphName,
+                         std::vector<Tensor>& inputs,
+                         std::vector<Tensor>& outputs);
+        
+        // Tensor management
+        bool addTensor(const std::string& graphName,
+                      const std::string& tensorName,
+                      Qnn_TensorType_t type,
+                      const Tensor& tensor,
+                      Qnn_QuantizeParams_t quantize = DEFAULT_QUANTIZE_PARAMS);
+        
+        bool addStaticTensor(const std::string& graphName,
+                            const std::string& tensorName,
+                            const Tensor& tensor,
+                            Qnn_QuantizeParams_t quantize = DEFAULT_QUANTIZE_PARAMS);
+        
+        std::shared_ptr<QNNTensorWrapper> getTensorWrapper(
+            const std::string& graphName,
+            const std::string& tensorName);
+        
+        // Node management
+        void graphAddNode(const std::string& graphName,
+                         const std::string& nodeName,
+                         const std::string& nodeType,
+                         const std::vector<std::string>& inputTensorNames,
+                         const std::vector<std::string>& outputTensorNames,
+                         const std::vector<std::shared_ptr<QNNParamTensorWrapper>>& tensorParams,
+                         const std::vector<std::shared_ptr<QNNParamScalarWrapper>>& scalarParams,
+                         const std::string& packageName = "qti.aisw");
+        
+        // Properties
+        bool isWeightOnDevice() override;
+        const QNN_INTERFACE_VER_TYPE& qnnInterface() const;
+        Qnn_BackendHandle_t backendHandle() const;
+        Qnn_ContextHandle_t context() const;
+    };
+
+For more information on the overall framework architecture, see :doc:`../arch/arch`.
+
diff --git a/docs/qnn_backend/index.rst b/docs/qnn_backend/index.rst
new file mode 100644
index 000000000..b7092f938
--- /dev/null
+++ b/docs/qnn_backend/index.rst
@@ -0,0 +1,9 @@
+QNN Backend
+====================
+
+.. toctree::
+   :maxdepth: 2
+
+   setup_env
+   core_design
+   qnn_model_convert
diff --git a/docs/qnn_backend/qnn_model_convert.rst b/docs/qnn_backend/qnn_model_convert.rst
new file mode 100644
index 000000000..f157c7f2a
--- /dev/null
+++ b/docs/qnn_backend/qnn_model_convert.rst
@@ -0,0 +1,2 @@
+QNN Model Conversion
+====================
\ No newline at end of file
diff --git a/docs/qnn_backend/setup_env.rst b/docs/qnn_backend/setup_env.rst
new file mode 100644
index 000000000..8b5554533
--- /dev/null
+++ b/docs/qnn_backend/setup_env.rst
@@ -0,0 +1,136 @@
+QNN Environment Setup
+=====================
+
+Overview
+--------
+
+This section describes how to set up the QNN development environment, following the official QNN documentation. For more details, see: `QNN Linux Setup <https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/linux_setup.html>`_.
+
+Prerequisites
+-------------
+
+The QNN backend relies on two main SDKs:
+
+- **Qualcomm QNN SDK**: Required for QNN backend compilation
+- **Hexagon SDK**: Required for QNN custom operator(LLaMAOpPackage in mllm) compilation
+
+Version Requirements
+~~~~~~~~~~~~~~~~~~~~
+
+- **QNN**: Linux v2.34+
+- **Hexagon SDK**: Linux 5.x
+
+.. warning::
+   Some accounts may not have permission to access the Hexagon SDK and may need to contact Qualcomm for support.
+
+SDK Download and Installation
+-----------------------------
+
+QNN SDK Installation
+~~~~~~~~~~~~~~~~~~~~~
+
+1. Download the QNN SDK from the `official Qualcomm website <https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk>`_
+2. Unzip the downloaded file
+3. Set the environment variable ``QNN_SDK_ROOT`` to point to the unzipped directory
+
+Hexagon SDK Installation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `Hexagon SDK <https://www.qualcomm.com/developer/software/hexagon-npu-sdk>`_ is Qualcomm's official development environment for programming and optimizing applications on the Hexagon DSP — the core processor architecture used in Snapdragon chips for efficient, low-power computation.
+
+By installing and sourcing the Hexagon SDK, developers can build the `custom op package <https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-10/op_packages.html>`_, which is the LLaMAOpPackage in this project, enabling HVX capabilities.
+
+To install the Hexagon SDK, follow these steps:
+
+1. Download the Hexagon SDK using `QPM <https://qpm.qualcomm.com/>`_ (Qualcomm Package Manager)
+2. Install the SDK following the QPM instructions
+
+Environment Setup
+-----------------
+
+After downloading and installing both SDKs, set up the environment by running the following commands:
+
+.. code-block:: bash
+
+   # Set up QNN SDK environment
+   source <path-to-qnn-sdk>/bin/envsetup.sh
+   
+   # Set up Hexagon SDK environment
+   source <path-to-hexagon-sdk>/setup_sdk_env.source
+
+Environment Variables Verification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After setting up the environment, verify that the following environment variables are correctly set:
+
+.. code-block:: bash
+
+   echo $QNN_SDK_ROOT      # Should point to /path/to/your/qnn/sdk
+   echo $HEXAGON_SDK_ROOT  # Should point to /path/to/your/hexagon/sdk
+
+.. note::
+   These environment variables are essential for the QNN op package compilation process.
+
+Op Package Compilation
+-----------------------
+
+To use QNN offload, both CPU and HTP QNN op packages are required. The following steps will build the QNN op packages needed by the project.
+
+Prerequisites for Compilation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ensure the following environment variables are set:
+
+- ``QNN_SDK_ROOT``
+- ``HEXAGON_SDK_ROOT`` 
+- ``ANDROID_NDK_ROOT``
+
+Compilation Commands
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   cd mllm/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/
+   make htp_aarch64 && make htp_v75
+
+This will build the necessary QNN op packages for both AArch64 and HVX v75 targets.
+
+Development Tips
+----------------
+
+LSP Configuration for HVX Development
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To enable Language Server Protocol (LSP) support for HVX development, configure clangd to use the Hexagon toolchain:
+
+1. Create or edit ``.vscode/settings.json`` in your project root
+2. Add the following configuration:
+
+.. code-block:: json
+
+   {
+     "clangd.path": "$HEXAGON_SDK_ROOT/tools/HEXAGON_Tools/8.7.06/Tools/bin/hexagon-clangd"
+   }
+
+Generating Compilation Database
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To generate the ``compile_commands.json`` file for the Op package:
+
+.. code-block:: bash
+
+   cd mllm/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/
+   compiledb make htp_v75 -C .
+
+This compilation database is useful for IDE features like code completion and error highlighting.
+
+Next Steps
+----------
+
+After completing the environment setup, you can proceed to:
+
+- Model conversion and quantization
+- Building the project with QNN backend
+- Running QNN-accelerated models
+
+For detailed instructions on these steps, refer to the respective documentation sections.