Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/_static/img/arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/qnn-trace-execute-seq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
346 changes: 346 additions & 0 deletions docs/arch/arch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
MLLM Framework Core Architecture
===================================

Overview
--------

The MLLM framework employs a hierarchical execution model with three main components:

* **Module**: High-level abstraction for neural network modules
* **Layer**: Abstraction for individual operations/layers
* **Dispatcher**: Execution engine for different backends

This architecture supports both regular operation execution and intermediate representation (IR) tracing workflows, enabling flexible deployment across multiple hardware backends including CPU, QNN (also: Qualcomm AI Engine Direct/QAIRT), and custom accelerators.

.. figure:: ../_static/img/arch.png
:width: 100%
:alt: Overview
:align: center

Figure 1: MLLM Framework Core Architecture.

Core Components
---------------

Module
~~~~~~~

The ``Module`` class serves as the top-level container for neural network components. Key responsibilities include:

* **Hierarchical Organization**: Modules can contain other modules and layers, forming a tree structure
* **Parameter Management**: Loading and saving model parameters from/to files
* **Device Management**: Moving modules and their components across different devices
* **Forward Execution**: Orchestrating the execution flow through child components

**Key Methods:**

.. code-block:: cpp

class Module {
std::vector<Tensor> forward(const std::vector<Tensor>& inputs,
const std::vector<AnyValue>& args);
void to(DeviceTypes device_type);
void load(const ParameterFile::ptr_t& param_file);

// Named module registration (similar to PyTorch's named_modules)
template<typename T, typename... Args>
auto reg(const std::string& name, Args&&... args);
};

**Named Module Registration:**

The ``reg()`` method provides functionality similar to PyTorch's ``named_modules()``, enabling hierarchical module organization with automatic name management in C++:

.. code-block:: cpp

class MyModel : public nn::Module {
public:
MyModel(const std::string& name) : nn::Module(name) {
// Register sub-modules with names
encoder_ = reg<EncoderModule>("encoder", config);
decoder_ = reg<DecoderModule>("decoder", config);

// Register layers with names
linear1_ = reg<nn::Linear>("fc1", 768, 3072, false);
linear2_ = reg<nn::Linear>("fc2", 3072, 768, false);
}

private:
EncoderModule encoder_; // Absolute name: "model.encoder"
DecoderModule decoder_; // Absolute name: "model.decoder"
nn::Linear linear1_; // Absolute name: "model.fc1"
nn::Linear linear2_; // Absolute name: "model.fc2"
};

**Key Features:**

* **Automatic Name Hierarchy**: Constructs fully-qualified names (e.g., ``"model.encoder.layer0.attention"``)
* **Parameter Mapping**: Links module names to parameter files for loading/saving
* **Device Management**: Enables selective device placement by module name
* **Type Safety**: Template-based registration with compile-time type checking

**Comparison with PyTorch:**

.. code-block:: python

# PyTorch
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.encoder = EncoderModule() # Automatically named "encoder"
self.decoder = DecoderModule() # Automatically named "decoder"

# Print all named modules
for name, module in model.named_modules():
print(f"{name}: {module}")

.. code-block:: cpp

// MLLM Framework
class MyModel : public nn::Module {
public:
MyModel(const std::string& name) : nn::Module(name) {
encoder_ = reg<EncoderModule>("encoder"); // Explicitly named "encoder"
decoder_ = reg<DecoderModule>("decoder"); // Explicitly named "decoder"
}
};

// Names are automatically constructed: "model.encoder", "model.decoder"
// Used for parameter loading: params->load("model.encoder.weight")

The ``reg()`` method bridges the gap between Python's dynamic attribute naming and C++'s static type system, providing a clean API for building hierarchical neural networks.

Layer Abstraction
~~~~~~~~~~~~~~~~~

The ``Layer`` class represents individual operations or layers within a module:

* **Operation Encapsulation**: Wraps backend-specific operations (BaseOp)
* **Device Abstraction**: Handles operation instantiation for different backends
* **Task Creation**: Creates execution tasks for the dispatcher system

**Key Methods:**

.. code-block:: cpp

class Layer {
std::vector<Tensor> __main(const std::vector<Tensor>& inputs);
Layer& to(DeviceTypes device_type);
OpTypes opType() const;
};

Dispatcher System
~~~~~~~~~~~~~~~~~

The dispatcher system provides backend-specific execution engines:

**CPUDispatcher**
Handles CPU-based operation execution with full operation lifecycle:

* ``reshape()``: Tensor shape computation
* ``setup()``: Operation initialization
* ``forward()``: Actual computation

**IRTraceDispatcher**
Captures execution traces for IR generation:

* Records operation calls and tensor flows
* Enables graph optimization and analysis
* Supports compilation workflows

**QNNDispatcher**
Manages QNN backend execution:

* Specialized for QNN graph execution
* Handles module-level execution for QNN graphs
* Selective operation execution (X2X, Embedding ops)

Execution Workflows
-------------------

Op Execution Workflow
~~~~~~~~~~~~~~~~~~~~~~

The standard execution path for neural network inference:

.. code-block:: text

Module::forward()
├─── Module::__main()
│ │
│ ├─── Task::createExecuteModuleTask()
│ │
│ └─── DispatcherManager::submit()
│ │
│ └─── [CPU|QNN]Dispatcher::receive()
│ │
│ └─── [CPU|QNN]Dispatcher::process()
└─── Layer::__main()
├─── Task::createExecuteOpTask()
└─── DispatcherManager::submit()
└─── [CPU|QNN]Dispatcher::receive()
└─── [CPU|QNN]Dispatcher::process()
├─── Op::reshape()
├─── Op::setup()
└─── Op::forward()

**Execution Flow Details:**

1. **Module Entry**: ``Module::forward()`` is called with input tensors
2. **Task Creation**: Creates ``kExecuteModule`` or ``kExecuteOp`` tasks
3. **Dispatcher Selection**: Routes to appropriate backend dispatcher based on device type
4. **Backend Processing**: Dispatcher executes the operation using backend-specific logic
5. **Result Return**: Output tensors are returned through the task system

IR Execution Workflow
~~~~~~~~~~~~~~~~~~~~~~

When trace mode is enabled, the framework captures an intermediate representation:

.. code-block:: text

Module::forward() [trace_mode=true]
├─── Module::__trace()
│ │
│ ├─── IRContext::create<CallGraphOp>()
│ ├─── IRContext::create<SubGraphOp>()
│ │
│ ├─── Module::forward() [recursive]
│ │
│ └─── IRContext::create<ReturnOp>()
└─── Layer::__main() [trace_mode=true]
├─── Task::createExecuteOpTask()
│ └─── task->custom_context_ptr = ir_context
└─── IRTraceDispatcher::receive()
└─── IRTraceDispatcher::process()
├─── Op::reshape()
└─── Op::trace()

**IR Workflow Details:**

1. **Trace Initialization**: ``Context::thisThread()->trace_mode`` enables IR capture
2. **Graph Construction**: Creates IR graph nodes (``CallGraphOp``, ``SubGraphOp``)
3. **Operation Tracing**: Each operation call is recorded in the IR graph
4. **Graph Completion**: ``ReturnOp`` finalizes the subgraph structure
5. **IR Output**: Complete computational graph is available for optimization/compilation

For more details on IR tracing and compilation, refer to the :doc:`MLLM IR <../compile/ir>` section.

Synchronous vs Asynchronous Execution
--------------------------------------

Synchronous Execution
~~~~~~~~~~~~~~~~~~~~~

Currently, the primary execution mode uses synchronous task processing:

.. code-block:: cpp

// In Dispatcher::receive()
void CPUDispatcher::receive(const Task::ptr_t& task) {
process(task); // Blocks until completion
}

**Characteristics:**

* **Blocking Operation**: Each task completes before returning
* **Simple Flow Control**: Sequential execution guarantees
* **Immediate Results**: Output tensors available immediately after task submission

Asynchronous Execution (Future Enhancement)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The framework includes infrastructure for asynchronous execution:

.. code-block:: cpp

// In Dispatcher::asyncReceive()
TaskResult::sender_t CPUDispatcher::asyncReceive(const Task::ptr_t& task) {
auto scheduler = thread_pool_.get_scheduler();
return stdexec::schedule(scheduler) |
stdexec::then([this, task] { process(task); });
}

**Design Features:**

* **Non-blocking Submission**: Tasks return immediately with a sender/future
* **Thread Pool Integration**: Uses ``exec::static_thread_pool`` for parallel execution
* **Sender/Receiver Pattern**: Based on C++26 sender/receiver async model
* **Pipeline Capability**: Enables operation pipelining and overlapping

**Current Status:**

The asynchronous execution path is implemented but not fully integrated:

* ``IRTraceDispatcher::asyncReceive()`` returns an error
* Most dispatchers have placeholder async implementations
* Synchronization points (``syncWait()``) are not fully implemented

Task System Architecture
-------------------------

The task system provides a unified interface for operation execution:

**Task Types:**

* ``kExecuteOp``: Single operation execution
* ``kExecuteModule``: Module-level execution (for QNN graphs)

**Task Structure:**

.. code-block:: cpp

struct Task {
TaskTypes type;
BaseOp::ptr_t op; // Operation to execute
std::vector<Tensor> inputs; // Input tensors
std::vector<Tensor> outputs; // Output tensors
std::vector<AnyValue> args; // Additional arguments
void* custom_context_ptr; // Backend-specific context
};

**Dispatcher Interface:**

.. code-block:: cpp

class Dispatcher {
virtual void receive(const Task::ptr_t& task) = 0;
virtual TaskResult::sender_t asyncReceive(const Task::ptr_t& task) = 0;
virtual void process(const Task::ptr_t& task) = 0;
virtual void syncWait() = 0;
};

Backend Integration
-------------------

The framework supports multiple execution backends through the dispatcher pattern:

**CPU Backend**
* Full operation support with reshape/setup/forward lifecycle
* Direct tensor computation on CPU
* Perfetto tracing integration for performance analysis

**QNN Backend**
* Optimized execution for Qualcomm Neural Processing Units
* Graph-level execution for improved performance
* Selective operation fallback to CPU when needed

**IR Tracing Backend**
* Captures computational graphs for analysis and optimization
* Enables ahead-of-time compilation workflows
* Supports graph transformation and optimization passes

This architecture provides a flexible foundation for deploying neural networks across diverse hardware platforms while maintaining a consistent programming interface.
2 changes: 2 additions & 0 deletions docs/arch/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,7 @@ Architectures
.. toctree::
:maxdepth: 2

arch
tensor
support_ops
op_plugin_system
2 changes: 2 additions & 0 deletions docs/arch/op_plugin_system.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Op Plugin System
=================
5 changes: 5 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,11 @@ Documents

cpu_backend/index

.. toctree::
:maxdepth: 2

qnn_backend/index

.. toctree::
:maxdepth: 2

Expand Down
Loading