diff --git a/docs/source/_static/img/runtime-overview-high-level.png b/docs/source/_static/img/runtime-overview-high-level.png new file mode 100644 index 00000000000..de2a57d8e7a Binary files /dev/null and b/docs/source/_static/img/runtime-overview-high-level.png differ diff --git a/docs/source/runtime-overview.md b/docs/source/runtime-overview.md index 0f0ff2c594b..4f870d3c3b0 100644 --- a/docs/source/runtime-overview.md +++ b/docs/source/runtime-overview.md @@ -1,3 +1,167 @@ -# Runtime Overview +# ExecuTorch Runtime Overview -TBA +This document discusses the design of the ExecuTorch runtime, which executes +ExecuTorch program files on edge devices like smartphones, wearables, and +embedded devices. The code for the main execution API is under +[`executorch/runtime/executor/`](https://github.com/pytorch/executorch/tree/main/runtime/executor). + +Before reading this document we recommend that you read [How Does ExecuTorch +Work](intro-how-it-works.md). + +At the highest level, the ExecuTorch runtime is responsible for: + +* Loading binary `.pte` program files that were generated by the + `to_executorch()` step of the model-lowering process. +* Executing the series of instructions that implement a lowered model. + +This diagram shows the high-level flow of and components involved with exporting +and executing an ExecuTorch program: + +![High-level diagram of the ExecuTorch +Runtime](/_static/img/runtime-overview-high-level.png) + +The runtime is also responsible for: + +* Managing the memory used during load and execution, potentially across + multiple memory banks like SRAM and DRAM. +* Mapping symbolic operator names like `"aten::add.out"` to concrete C++ + functions or [_kernels_](kernel-library-overview.md) that implement the + semantics of those operators. +* Dispatching predetermined sections of the model to [backend + delegates](compiler-delegate-and-partitioner.md) for acceleration. +* Optionally gathering [profiling data](sdk-profiling.md) during load and + execution. + +## Design Goals + +The ExecuTorch runtime was designed to run on a wide variety of edge devices, +from modern smartphone CPUs to resource-constrained microcontrollers and DSPs. +It has first-class support for +[delegating](compiler-delegate-and-partitioner.md) execution to one or more +backends to take advantage of architecture-specific optimizations and modern +heterogeneous architectures. It is small and portable enough to run directly in +bare-metal embedded environments with no operating systems, dynamic memory, or +threads. + +### Low Execution Overhead + +#### Memory + +* The core runtime library is less than 50kB when built without kernels or + backends. +* Constant tensors point directly into the `.pte` file data, avoiding copies of + that data. The alignment of these data chunks can be adjusted at `.pte` + creation time. +* Backend delegates can choose to unload their precompiled data after model + initialization, reducing peak memory usage. +* Mutable tensor memory layout is planned ahead of time and packed into a small + set of user-allocated buffers, providing fine-grained control over memory + location. This is especially useful on systems with heterogeneous memory + hierarchies, allowing placement onto (e.g.) SRAM or DRAM close to the core + that will operate on the data. + +#### CPU + +* Model execution is a simple loop over an array of instructions, most of which + are function pointers to kernels and backend delegates. This keeps the + execution overhead small, on the order of microseconds to nanoseconds per + operation. +* The implementation of an operation (like "add" or "conv3d") can be fully + customized for a particular target system without needing to modify the + original model or generated `.pte` file. + +### Familiar PyTorch Semantics + +ExecuTorch is a first-class component of the PyTorch stack, and reuses APIs and +semantics whenever possible. + +* The C++ types used by ExecuTorch are source-compatible with the corresponding + types from core PyTorch's `c10::` and `at::` libraries, and ExecuTorch + provides + [`aten_bridge`](https://github.com/pytorch/executorch/blob/main/extension/aten_util/aten_bridge.h) + to convert between the two. This can be helpful for projects that already use + PyTorch C++ types. +* The semantics of operators like `aten::add` and `aten::sigmoid` are identical + between ExecuTorch and core PyTorch. ExecuTorch provides a testing framework + to ensure this, and to help test future implementations of these operators. + +### Portable Code and Architecture + +The ExecuTorch runtime is implemented with portability in mind, so that users +can build it for a wide variety of target systems. + +#### C++ Language Considerations + +* The code is C++11-compatible to work with older toolchains. +* The runtime does not use exceptions or RTTI, although it is not antagonistic + to them. +* The code is compatible with GCC and Clang, and has also been built with + several proprietary embedded toolchains. +* The repo provides both CMake and buck2 build systems to make integration + easier. + +#### Operating System Considerations + +The runtime makes no direct system calls. All access to memory, files, logging, +and clocks are abstracted through the [_Runtime Platform Abstraction Layer +(PAL)_](runtime-platform-abstraction-layer.md) and injected interfaces like +`DataLoader` and `MemoryAllocator`. [TODO: link these types to their generated +docs] + +Applications can control all memory allocation through the `MemoryManager`, +`MemoryAllocator`, `HierarchicalAllocator`, and `DataLoader` classes. The core +runtime makes no direct calls to `malloc()` or `new`, or to types like +`std::vector` that allocate under the hood. This makes it possible to: + +* Run in environments without a heap, but still use the heap if desired. +* Avoid synchronization on the heap during model load and execution. +* Control which memory region to use for different types of data. For example, + one set of mutable tensors could live in SRAM while another set lived in DRAM. +* Easily monitor how much memory the runtime uses. + +However, please note that specific kernel or backend implementations may use +arbitrary runtime or operating system features. Users should double-check the +docs for the kernel and backend libraries that they use. + +#### Threading Considerations + +The core runtime does no threading or locking, and does not use thread local +variables. But, it plays well with higher-level synchronization. + +* Each `Program` instance is immutable and therefore _[fully + thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#thread-safe)_. + Multiple threads may concurrently access a single `Program` instance. +* Each `Method` instance is mutable but self-contained, and therefore + _[conditionally + thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#conditionally-thread-safe)_. + Multiple threads can concurrently access and execute independent `Method` + instances, but access and execution of a single instance must be serialized. + +However, please note: + +* There are two global tables that may be read during `Program::load_method()`: + the kernel registration table and the backend registration table. + * In practice, these tables are only modified at process/system load time, + and are effectively frozen before the first `Program` is loaded. But some + applications may need to be aware of these tables, especially if they + manually mutate them after process/system load time. +* Specific kernel or backend implementations may have their own threading + restrictions. Users should double-check the docs for the kernel and backend + libraries that they use. + +## Further Reading + +For more details about the ExecuTorch runtime, please see: + +* The + [`executor_runner`](https://github.com/pytorch/executorch/blob/main/examples/executor_runner/executor_runner.cpp) + example tool +* [Runtime API](runtime-api.md) +* [Runtime Build and Cross Compilation](runtime-build-and-cross-compilation.md) +* [Runtime Platform Abstraction Layer](runtime-platform-abstraction-layer.md) +* [Custom Memory Allocation](runtime-custom-memory-allocator.md) +* [Runtime Error Handling](runtime-error-handling.md) +* [Runtime Profiling](sdk-profiling.md) +* [Backends and Delegates](compiler-delegate-and-partitioner.md) +* [Backend Delegate Implementation](runtime-backend-delegate-implementation-and-linking.md) +* [Kernel Library Overview](kernel-library-overview.md)