From 43007fc29479287f33b11ed2f22c5be73f70b5f3 Mon Sep 17 00:00:00 2001
From: psiddh <2467117+psiddh@users.noreply.github.com>
Date: Mon, 9 Mar 2026 16:41:53 -0400
Subject: [PATCH 01/23] Expand building Claude skill to cover general ET
 building from source

The existing building skill only covered runners (Makefile targets) and
CMake workflow presets. This expands it to be a comprehensive guide for
building ExecuTorch from source, including:

- Prerequisites and toolchain requirements
- Building the Python package (install_executorch.sh with all flags)
- Building the C++ runtime standalone (presets, workflows, manual CMake)
- Building model runners (Makefile)
- Cross-compilation (Android, iOS, macOS, Windows)
- Complete build options reference with dependency chains
- Common build patterns (minimal, XNNPACK, profiling, tests, subdirectory)
- Troubleshooting section covering 12 common build issues:
  - Submodule issues
  - Stale build artifacts
  - CMake version conflicts
  - Python version mismatch
  - Dependency version conflicts
  - Missing python-dev headers
  - Linking errors with --whole-archive
  - XNNPACK build failures
  - Windows symlink errors
  - MSVC kernel compilation failures
  - Intel macOS limitations
  - Duplicate kernel registration
- Build output reference table
- Tips for faster and more reliable builds
---
 .claude/skills/building/SKILL.md | 348 ++++++++++++++++++++++++++++++-
 1 file changed, 339 insertions(+), 9 deletions(-)
diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 7ff7be38df1..ab63f1606e4 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -1,23 +1,353 @@
 ---
 name: building
-description: Build ExecuTorch runners or C++ libraries. Use when compiling runners for Llama, Whisper, or other models, or building the C++ runtime.
+description: Build ExecuTorch from source — Python package, C++ runtime, runners, cross-compilation, and backend-specific builds. Use when compiling anything in the ExecuTorch repo, diagnosing build failures, or setting up platform-specific builds.
 ---
 
 # Building
 
-## Runners (Makefile)
+## Prerequisites
+
+Before building, ensure the environment is set up (see `/setup` skill):
+```bash
+conda activate executorch
+```
+
+Required toolchain:
+- **Python** 3.10–3.13
+- **CMake** >= 3.24, < 4.0
+- **C++17** compiler: `g++` >= 7, `clang++` >= 5, or MSVC 2022+ with Clang-CL
+- **Git submodules** must be initialized (handled by `install_executorch.sh`, or manually: `git submodule sync && git submodule update --init --recursive`)
+
+Optional but recommended:
+- **ccache** — automatically detected and used if installed (`sudo apt install ccache` / `brew install ccache`)
+- **Ninja** — faster than Make (`sudo apt install ninja-build` / `brew install ninja`); use with `-G Ninja`
+
+## 1. Building the Python Package
+
+This installs the ExecuTorch Python package (exir, runtime bindings, etc.) into the active environment.
+
+```bash
+# First time (installs deps + builds + installs)
+./install_executorch.sh
+
+# Editable mode (Python changes reflected without rebuild)
+./install_executorch.sh --editable
+
+# Minimal (skip example dependencies)
+./install_executorch.sh --minimal
+
+# Subsequent installs (deps already present)
+pip install -e . --no-build-isolation
+```
+
+**Enable additional backends** during Python install:
+```bash
+CMAKE_ARGS="-DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh
+CMAKE_ARGS="-DEXECUTORCH_BUILD_COREML=ON -DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh
+```
+
+**Verify Python install:**
+```bash
+python -m executorch.examples.xnnpack.aot_compiler --model_name="mv2" --delegate
+```
+
+## 2. Building the C++ Runtime (Standalone)
+
+### Using Presets (Recommended)
+
+```bash
+cmake -B cmake-out --preset <preset> -DCMAKE_BUILD_TYPE=Release
+cmake --build cmake-out -j$(nproc)
+```
+
+| Preset | Platform | What it builds |
+|--------|----------|----------------|
+| `linux` | Linux x86_64 | Runtime + XNNPACK + LLM + executor_runner |
+| `macos` | macOS | Runtime + XNNPACK + CoreML + MPS + executor_runner |
+| `windows` | Windows | Runtime + XNNPACK + executor_runner |
+| `llm-release` | Host | LLM extension (CPU, Release) |
+| `llm-release-cuda` | Linux/Windows | LLM extension (CUDA, Release) |
+| `llm-release-metal` | macOS | LLM extension (Metal, Release) |
+| `llm-debug` | Host | LLM extension (CPU, Debug) |
+| `llm-debug-cuda` | Linux/Windows | LLM extension (CUDA, Debug) |
+| `llm-debug-metal` | macOS | LLM extension (Metal, Debug) |
+| `profiling` | Host | Runtime with profiling/event tracing |
+| `android-arm64-v8a` | Android | JNI bindings + runtime for arm64 |
+| `android-x86_64` | Android | JNI bindings + runtime for x86_64 |
+| `ios` | iOS | Frameworks for device |
+| `ios-simulator` | iOS Sim | Frameworks for simulator |
+| `arm-baremetal` | Embedded | Cortex-M / Ethos-U bare-metal |
+| `zephyr` | RTOS | Zephyr RTOS build |
+
+### Using CMake Workflow Presets
+
+Workflow presets combine configure + build + install in one command:
+```bash
+cmake --workflow --preset llm-release        # CPU
+cmake --workflow --preset llm-release-cuda   # CUDA
+cmake --workflow --preset llm-release-metal  # Metal
+```
+
+### Manual CMake (No Preset)
+
+```bash
+mkdir -p cmake-out
+cmake -B cmake-out \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DEXECUTORCH_BUILD_XNNPACK=ON \
+  -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON
+cmake --build cmake-out -j$(nproc)
+```
+
+### Verify C++ Build
+
+```bash
+# Enable executor_runner if not already
+cmake -B cmake-out --preset linux -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON
+cmake --build cmake-out -j$(nproc)
+cmake-out/executor_runner --model_path=mv2_xnnpack_fp32.pte
+```
+
+## 3. Building Runners (Makefile)
+
+Model-specific runners use the top-level `Makefile`:
 ```bash
 make help              # list all targets
-make llama-cpu         # Llama
-make whisper-metal     # Whisper on Metal
+make llama-cpu         # Llama on CPU
+make llama-cuda        # Llama on CUDA
+make llama-cuda-debug  # Llama on CUDA (debug)
+make llava-cpu         # Llava on CPU
+make gemma3-cpu        # Gemma3 on CPU
 make gemma3-cuda       # Gemma3 on CUDA
+make whisper-cpu       # Whisper on CPU
+make whisper-metal     # Whisper on Metal
+make parakeet-cpu      # Parakeet on CPU
+make parakeet-metal    # Parakeet on Metal
+make clean             # remove cmake-out/
+```
+
+Output binaries: `cmake-out/examples/models/<model>/<runner>`
+
+Each `make` target internally runs `cmake --workflow --preset` for the core libraries, then builds the runner on top.
+
+## 4. Cross-Compilation
+
+### Android
+
+```bash
+# AAR (Java bindings)
+export ANDROID_ABIS=arm64-v8a
+export BUILD_AAR_DIR=aar-out
+mkdir -p $BUILD_AAR_DIR
+sh scripts/build_android_library.sh
+
+# Native C++ (direct cross-compile)
+cmake -B cmake-out \
+  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+  -DANDROID_ABI=arm64-v8a \
+  --preset android-arm64-v8a
+cmake --build cmake-out -j$(nproc)
 ```
 
-Output: `cmake-out/examples/models/<model>/<runner>`
+### iOS / macOS Frameworks
 
-## C++ Libraries (CMake)
 ```bash
-cmake --list-presets                    # list presets
-cmake --workflow --preset llm-release   # LLM CPU
-cmake --workflow --preset llm-release-metal  # LLM Metal
+# Build all frameworks
+./scripts/build_apple_frameworks.sh
+
+# With specific backends
+./scripts/build_apple_frameworks.sh --coreml --mps --xnnpack
 ```
+
+Link frameworks in Xcode with `-all_load` linker flag.
+
+### Windows
+
+Requires Visual Studio 2022+ with Clang-CL:
+```bash
+cmake -B cmake-out --preset windows -T ClangCL
+cmake --build cmake-out --config Release
+```
+
+**Windows-specific notes:**
+- Enable symlinks before cloning: `git config --system core.symlinks true`
+- Missing symlinks cause `version.py` errors during `pip install`
+- LLM custom kernels and quantized kernels do not compile with MSVC; use `-T ClangCL` or build with CUDA
+
+## 5. Key Build Options
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `CMAKE_BUILD_TYPE` | STRING | Debug | `Debug` or `Release`. Release disables logging/verification, adds optimizations |
+| `EXECUTORCH_BUILD_XNNPACK` | BOOL | OFF | XNNPACK CPU backend (requires CPUINFO + PTHREADPOOL) |
+| `EXECUTORCH_BUILD_COREML` | BOOL | OFF | Core ML backend (macOS/iOS only) |
+| `EXECUTORCH_BUILD_MPS` | BOOL | OFF | MPS GPU backend (macOS/iOS only) |
+| `EXECUTORCH_BUILD_CUDA` | BOOL | OFF | CUDA GPU backend (requires EXTENSION_TENSOR) |
+| `EXECUTORCH_BUILD_METAL` | BOOL | OFF | Metal backend (requires EXTENSION_TENSOR) |
+| `EXECUTORCH_BUILD_VULKAN` | BOOL | OFF | Vulkan GPU backend (Android) |
+| `EXECUTORCH_BUILD_QNN` | BOOL | OFF | Qualcomm QNN backend |
+| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | BOOL | OFF | Optimized kernel implementations |
+| `EXECUTORCH_BUILD_KERNELS_QUANTIZED` | BOOL | OFF | Quantized kernel implementations |
+| `EXECUTORCH_BUILD_KERNELS_LLM` | BOOL | OFF | LLM custom kernels (requires KERNELS_OPTIMIZED) |
+| `EXECUTORCH_BUILD_EXTENSION_MODULE` | BOOL | OFF | Module extension (requires DATA_LOADER + FLAT_TENSOR + NAMED_DATA_MAP) |
+| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | BOOL | OFF | Tensor extension |
+| `EXECUTORCH_BUILD_EXTENSION_LLM` | BOOL | OFF | LLM extension |
+| `EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER` | BOOL | OFF | LLM runner extension (requires EXTENSION_LLM) |
+| `EXECUTORCH_BUILD_PYBIND` | BOOL | OFF | Python bindings (requires EXTENSION_MODULE) |
+| `EXECUTORCH_BUILD_TESTS` | BOOL | OFF | CMake-based unit tests |
+| `EXECUTORCH_BUILD_DEVTOOLS` | BOOL | OFF | Developer tools (Inspector, ETDump) |
+| `EXECUTORCH_ENABLE_EVENT_TRACER` | BOOL | OFF | Event tracing (requires DEVTOOLS) |
+| `EXECUTORCH_OPTIMIZE_SIZE` | BOOL | OFF | Optimize for binary size (`-Os`, no exceptions/RTTI) |
+| `EXECUTORCH_ENABLE_LOGGING` | BOOL | (Debug=ON) | Runtime logging |
+| `EXECUTORCH_LOG_LEVEL` | STRING | Info | Log level: Debug, Info, Error, Fatal |
+| `EXECUTORCH_USE_SANITIZER` | BOOL | OFF | ASAN + UBSAN (not supported on MSVC) |
+| `EXECUTORCH_PAL_DEFAULT` | STRING | posix | Platform abstraction: `posix`, `minimal`, `android` |
+
+**Dependency chains** — enabling some options requires others:
+- `XNNPACK` requires `CPUINFO` + `PTHREADPOOL`
+- `KERNELS_LLM` requires `KERNELS_OPTIMIZED`
+- `EXTENSION_MODULE` requires `EXTENSION_DATA_LOADER` + `EXTENSION_FLAT_TENSOR` + `EXTENSION_NAMED_DATA_MAP`
+- `BUILD_PYBIND` requires `EXTENSION_MODULE`
+- `EXTENSION_LLM_RUNNER` requires `EXTENSION_LLM`
+- `EVENT_TRACER` requires `DEVTOOLS`
+- `CUDA` and `METAL` require `EXTENSION_TENSOR`
+
+CMake will error with a clear message if a required option is missing.
+
+## 6. Common Build Patterns
+
+### Build core runtime only (minimal)
+```bash
+cmake -B cmake-out -DCMAKE_BUILD_TYPE=Release
+cmake --build cmake-out -j$(nproc)
+```
+
+### Build with XNNPACK backend
+```bash
+cmake -B cmake-out -DCMAKE_BUILD_TYPE=Release \
+  -DEXECUTORCH_BUILD_XNNPACK=ON
+cmake --build cmake-out -j$(nproc)
+```
+
+### Build with profiling
+```bash
+cmake -B cmake-out --preset profiling
+cmake --build cmake-out -j$(nproc)
+```
+
+### Build tests
+```bash
+cmake -B cmake-out -DEXECUTORCH_BUILD_TESTS=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON
+cmake --build cmake-out -j$(nproc)
+ctest --test-dir cmake-out --output-on-failure
+```
+
+### Using ExecuTorch as a CMake subdirectory
+```cmake
+add_subdirectory(executorch)
+# Set options before add_subdirectory:
+set(EXECUTORCH_BUILD_XNNPACK ON)
+set(EXECUTORCH_BUILD_EXTENSION_MODULE ON)
+```
+
+## 7. Troubleshooting
+
+### Submodule issues
+**Symptom:** Build fails with missing headers or `CMakeLists.txt not found` in third-party dirs.
+```bash
+git submodule sync --recursive
+git submodule update --init --recursive
+```
+
+### Stale build artifacts
+**Symptom:** Mysterious failures after pulling new changes or switching branches.
+```bash
+./install_executorch.sh --clean
+# Or manually:
+rm -rf cmake-out/ pip-out/ buck-out/
+git submodule sync && git submodule update --init --recursive
+```
+
+### CMake version conflicts
+**Symptom:** `cmake` errors about policy versions or unsupported features.
+- ExecuTorch requires CMake >= 3.24, < 4.0
+- Check: `cmake --version`
+- If conda and system cmake conflict, ensure conda env cmake is used: `which cmake` should point to conda env
+
+### Python version mismatch
+**Symptom:** `install_executorch.sh` fails early with compatibility errors.
+- Supported: Python 3.10–3.13
+- Check: `python --version`
+
+### Dependency version conflicts
+**Symptom:** pip fails with conflicting torch/torchvision/torchaudio versions.
+- Use a fresh conda environment
+- If pinning to a specific PyTorch version: `./install_executorch.sh --use-pt-pinned-commit`
+
+### Missing `python-dev` headers
+**Symptom:** Build fails looking for `Python.h`.
+```bash
+sudo apt install python$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')-dev
+```
+
+### Linking errors with `--whole-archive`
+**Symptom:** Missing operator registrations at runtime despite building kernels.
+- Kernel binding libraries (e.g., `libportable_kernels_bindings.a`) use load-time registration
+- Must link with: `-Wl,--whole-archive <lib> -Wl,--no-whole-archive` (Linux) or `-Wl,-force_load,<lib>` (macOS)
+
+### XNNPACK build fails
+**Symptom:** Errors about missing `cpuinfo` or `pthreadpool`.
+- `EXECUTORCH_BUILD_XNNPACK=ON` requires `EXECUTORCH_BUILD_CPUINFO=ON` and `EXECUTORCH_BUILD_PTHREADPOOL=ON` (both ON by default unless `ARM_BAREMETAL` is set)
+
+### Windows symlink errors
+**Symptom:** `version.py` not found or import errors on Windows.
+```bash
+git config --system core.symlinks true
+# Re-clone the repo after enabling
+```
+
+### MSVC kernel compilation failures
+**Symptom:** LLM/quantized kernels fail to compile on Windows with MSVC.
+- Use Clang-CL: `cmake -B cmake-out -T ClangCL`
+- Or build with CUDA (which uses nvcc, not MSVC for kernels)
+
+### Intel macOS
+**Symptom:** `install_executorch.sh` fails — no prebuilt PyTorch wheels for Intel Mac.
+- Must build PyTorch from source, or use `--use-pt-pinned-commit --minimal`
+
+### Build directory not at repo root
+**Symptom:** Include path errors when ExecuTorch checkout is not the top-level directory.
+- ExecuTorch adds `..` to include directories; the build directory must be directly under the repo root or use `add_subdirectory` correctly
+
+### Duplicate kernel registration
+**Symptom:** Abort at runtime with duplicate kernel registration.
+- Only link one `gen_operators_lib` per target
+- Check for multiple kernel binding libraries being linked
+
+## 8. Build Output
+
+| Artifact | Location | Description |
+|----------|----------|-------------|
+| `executor_runner` | `cmake-out/executor_runner` | Standalone model runner |
+| Core runtime | `cmake-out/libexecutorch.a` | Core ExecuTorch runtime |
+| Portable ops | `cmake-out/kernels/portable/libportable_ops_lib.a` | Portable operator implementations |
+| XNNPACK backend | `cmake-out/backends/xnnpack/libxnnpack_backend.a` | XNNPACK delegate |
+| LLM runner | `cmake-out/examples/models/<model>/<runner>` | Model-specific runners |
+| Python package | site-packages | `executorch` Python module |
+| iOS frameworks | `cmake-out/*.xcframework` | iOS/macOS frameworks |
+| Android AAR | `aar-out/` | Android Java bindings |
+
+## 9. Tips
+
+- Always use `Release` for performance measurement; `Debug` is 5–10x slower and significantly larger
+- Use `ccache` to speed up rebuilds — ExecuTorch auto-detects it
+- Use `Ninja` generator (`-G Ninja`) for faster parallel builds
+- Use `cmake --list-presets` to see all available presets
+- After `git pull`, always clean and re-init submodules before rebuilding
+- For LLM workflows, `make <model>-<backend>` is the simplest path
+- Set `EXECUTORCH_OPTIMIZE_SIZE=ON` for size-constrained deployments
+- Check `cmake-out/compile_commands.json` for IDE integration (auto-generated)

From f7d0f375b20f4ffefb6d3d8928bc7613a0414fd2 Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 09:21:06 -0700
Subject: [PATCH 02/23] Refactor building skill from reference manual to
 action-oriented flow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reorganize the building skill to be concise and action-driven rather
than encyclopedic. Key changes:
- Add "do not ask questions" directive — detect and fix env issues automatically
- Make conda the mandatory path (no venvs, no system Python)
- Auto-fix for cmake >= 4.0 and PEP 668 errors
- Default action is `./install_executorch.sh --editable`
- Complete runner table from actual Makefile (adds voxtral, voxtral_realtime,
  sortformer, silero-vad, llava, debug variants)
- Fix macOS commands: $(sysctl -n hw.ncpu) instead of $(nproc)
- Note that --preset macos uses Xcode generator
- Condense troubleshooting into scannable table
- Trim cmake options to most-used flags (full list in CMakeLists.txt)

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 446 ++++++++++---------------------
 1 file changed, 137 insertions(+), 309 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index ab63f1606e4..9f5f37db85a 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -3,95 +3,103 @@ name: building
 description: Build ExecuTorch from source — Python package, C++ runtime, runners, cross-compilation, and backend-specific builds. Use when compiling anything in the ExecuTorch repo, diagnosing build failures, or setting up platform-specific builds.
 ---
 
-# Building
+# Building ExecuTorch
 
-## Prerequisites
+IMPORTANT: Do NOT ask the user questions about environment setup. Detect issues and fix them automatically. The standard path is conda — always use it.
+
+## Step 1: Ensure conda environment (do this silently, fix issues automatically)
 
-Before building, ensure the environment is set up (see `/setup` skill):
 ```bash
+# Check if executorch conda env exists; create if not
+conda env list | grep executorch || conda create -yn executorch python=3.12 'cmake<4'
+# Activate
 conda activate executorch
+# Verify — fix automatically if wrong
+python --version          # need 3.10–3.13
+cmake --version           # need >= 3.24, < 4.0; if cmake >= 4, run: pip install 'cmake>=3.24,<4'
 ```
 
-Required toolchain:
-- **Python** 3.10–3.13
-- **CMake** >= 3.24, < 4.0
-- **C++17** compiler: `g++` >= 7, `clang++` >= 5, or MSVC 2022+ with Clang-CL
-- **Git submodules** must be initialized (handled by `install_executorch.sh`, or manually: `git submodule sync && git submodule update --init --recursive`)
-
-Optional but recommended:
-- **ccache** — automatically detected and used if installed (`sudo apt install ccache` / `brew install ccache`)
-- **Ninja** — faster than Make (`sudo apt install ninja-build` / `brew install ninja`); use with `-G Ninja`
+If cmake is >= 4.0 inside the conda env, fix it: `pip install 'cmake>=3.24,<4'` — do not ask the user.
 
-## 1. Building the Python Package
+Parallel jobs: `$(sysctl -n hw.ncpu)` on macOS, `$(nproc)` on Linux.
 
-This installs the ExecuTorch Python package (exir, runtime bindings, etc.) into the active environment.
+## Step 2: Build (route by what the user needs)
 
+### Python package (default — use this unless user asks for something specific)
 ```bash
-# First time (installs deps + builds + installs)
-./install_executorch.sh
-
-# Editable mode (Python changes reflected without rebuild)
-./install_executorch.sh --editable
-
-# Minimal (skip example dependencies)
-./install_executorch.sh --minimal
-
-# Subsequent installs (deps already present)
-pip install -e . --no-build-isolation
+conda activate executorch
+./install_executorch.sh --editable    # editable install from source
 ```
+This handles everything: submodules, deps, C++ build, Python install. Takes ~10 min on Apple Silicon.
 
-**Enable additional backends** during Python install:
-```bash
-CMAKE_ARGS="-DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh
-CMAKE_ARGS="-DEXECUTORCH_BUILD_COREML=ON -DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh
-```
+For subsequent rebuilds (deps already present): `pip install -e . --no-build-isolation`
+
+For minimal install (skip example deps): `./install_executorch.sh --minimal`
 
-**Verify Python install:**
+Enable additional backends:
 ```bash
-python -m executorch.examples.xnnpack.aot_compiler --model_name="mv2" --delegate
+CMAKE_ARGS="-DEXECUTORCH_BUILD_COREML=ON -DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh --editable
 ```
 
-## 2. Building the C++ Runtime (Standalone)
+Verify: `python -c "from executorch.exir import to_edge_transform_and_lower; print('OK')"`
 
-### Using Presets (Recommended)
+### LLM / ASR model runner (simplest path for running models)
 
 ```bash
-cmake -B cmake-out --preset <preset> -DCMAKE_BUILD_TYPE=Release
-cmake --build cmake-out -j$(nproc)
-```
-
-| Preset | Platform | What it builds |
-|--------|----------|----------------|
-| `linux` | Linux x86_64 | Runtime + XNNPACK + LLM + executor_runner |
-| `macos` | macOS | Runtime + XNNPACK + CoreML + MPS + executor_runner |
-| `windows` | Windows | Runtime + XNNPACK + executor_runner |
-| `llm-release` | Host | LLM extension (CPU, Release) |
-| `llm-release-cuda` | Linux/Windows | LLM extension (CUDA, Release) |
-| `llm-release-metal` | macOS | LLM extension (Metal, Release) |
-| `llm-debug` | Host | LLM extension (CPU, Debug) |
-| `llm-debug-cuda` | Linux/Windows | LLM extension (CUDA, Debug) |
-| `llm-debug-metal` | macOS | LLM extension (Metal, Debug) |
-| `profiling` | Host | Runtime with profiling/event tracing |
-| `android-arm64-v8a` | Android | JNI bindings + runtime for arm64 |
-| `android-x86_64` | Android | JNI bindings + runtime for x86_64 |
-| `ios` | iOS | Frameworks for device |
-| `ios-simulator` | iOS Sim | Frameworks for simulator |
-| `arm-baremetal` | Embedded | Cortex-M / Ethos-U bare-metal |
-| `zephyr` | RTOS | Zephyr RTOS build |
-
-### Using CMake Workflow Presets
-
-Workflow presets combine configure + build + install in one command:
+conda activate executorch
+make <model>-<backend>
+```
+
+Available targets (run `make help` for full list):
+
+| Target | Backend | macOS | Linux |
+|--------|---------|-------|-------|
+| `llama-cpu` | CPU | yes | yes |
+| `llama-cuda` | CUDA | — | yes |
+| `llama-cuda-debug` | CUDA (debug) | — | yes |
+| `llava-cpu` | CPU | yes | yes |
+| `whisper-cpu` | CPU | yes | yes |
+| `whisper-metal` | Metal | yes | — |
+| `whisper-cuda` | CUDA | — | yes |
+| `parakeet-cpu` | CPU | yes | yes |
+| `parakeet-metal` | Metal | yes | — |
+| `parakeet-cuda` | CUDA | — | yes |
+| `voxtral-cpu` | CPU | yes | yes |
+| `voxtral-cuda` | CUDA | — | yes |
+| `voxtral-metal` | Metal | yes | — |
+| `voxtral_realtime-cpu` | CPU | yes | yes |
+| `voxtral_realtime-cuda` | CUDA | — | yes |
+| `voxtral_realtime-metal` | Metal | yes | — |
+| `gemma3-cpu` | CPU | yes | yes |
+| `gemma3-cuda` | CUDA | — | yes |
+| `sortformer-cpu` | CPU | yes | yes |
+| `sortformer-cuda` | CUDA | — | yes |
+| `silero-vad-cpu` | CPU | yes | yes |
+| `clean` | — | yes | yes |
+
+Output: `cmake-out/examples/models/<model>/<runner>`
+
+### C++ runtime (standalone)
+
+**With presets (recommended):**
+
+| Platform | Command |
+|----------|---------|
+| macOS | `cmake -B cmake-out --preset macos` (uses Xcode generator — requires Xcode) |
+| Linux | `cmake -B cmake-out --preset linux -DCMAKE_BUILD_TYPE=Release` |
+| Windows | `cmake -B cmake-out --preset windows -T ClangCL` |
+
+Then: `cmake --build cmake-out -j$(sysctl -n hw.ncpu)` (macOS) or `cmake --build cmake-out -j$(nproc)` (Linux)
+
+**LLM libraries via workflow presets** (configure + build + install in one command):
 ```bash
 cmake --workflow --preset llm-release        # CPU
-cmake --workflow --preset llm-release-cuda   # CUDA
-cmake --workflow --preset llm-release-metal  # Metal
+cmake --workflow --preset llm-release-metal  # Metal (macOS)
+cmake --workflow --preset llm-release-cuda   # CUDA (Linux)
 ```
 
-### Manual CMake (No Preset)
-
+**Manual CMake (custom flags):**
 ```bash
-mkdir -p cmake-out
 cmake -B cmake-out \
   -DCMAKE_BUILD_TYPE=Release \
   -DEXECUTORCH_BUILD_XNNPACK=ON \
@@ -99,255 +107,75 @@ cmake -B cmake-out \
   -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
   -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
   -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON
-cmake --build cmake-out -j$(nproc)
-```
-
-### Verify C++ Build
-
-```bash
-# Enable executor_runner if not already
-cmake -B cmake-out --preset linux -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON
-cmake --build cmake-out -j$(nproc)
-cmake-out/executor_runner --model_path=mv2_xnnpack_fp32.pte
+cmake --build cmake-out -j$(sysctl -n hw.ncpu)
 ```
 
-## 3. Building Runners (Makefile)
+Run `cmake --list-presets` to see all available presets.
 
-Model-specific runners use the top-level `Makefile`:
-```bash
-make help              # list all targets
-make llama-cpu         # Llama on CPU
-make llama-cuda        # Llama on CUDA
-make llama-cuda-debug  # Llama on CUDA (debug)
-make llava-cpu         # Llava on CPU
-make gemma3-cpu        # Gemma3 on CPU
-make gemma3-cuda       # Gemma3 on CUDA
-make whisper-cpu       # Whisper on CPU
-make whisper-metal     # Whisper on Metal
-make parakeet-cpu      # Parakeet on CPU
-make parakeet-metal    # Parakeet on Metal
-make clean             # remove cmake-out/
-```
-
-Output binaries: `cmake-out/examples/models/<model>/<runner>`
-
-Each `make` target internally runs `cmake --workflow --preset` for the core libraries, then builds the runner on top.
-
-## 4. Cross-Compilation
-
-### Android
+### Cross-compilation
 
+**iOS/macOS frameworks:**
 ```bash
-# AAR (Java bindings)
-export ANDROID_ABIS=arm64-v8a
-export BUILD_AAR_DIR=aar-out
-mkdir -p $BUILD_AAR_DIR
-sh scripts/build_android_library.sh
-
-# Native C++ (direct cross-compile)
-cmake -B cmake-out \
-  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-  -DANDROID_ABI=arm64-v8a \
-  --preset android-arm64-v8a
-cmake --build cmake-out -j$(nproc)
-```
-
-### iOS / macOS Frameworks
-
-```bash
-# Build all frameworks
-./scripts/build_apple_frameworks.sh
-
-# With specific backends
 ./scripts/build_apple_frameworks.sh --coreml --mps --xnnpack
 ```
-
-Link frameworks in Xcode with `-all_load` linker flag.
-
-### Windows
-
-Requires Visual Studio 2022+ with Clang-CL:
-```bash
-cmake -B cmake-out --preset windows -T ClangCL
-cmake --build cmake-out --config Release
-```
-
-**Windows-specific notes:**
-- Enable symlinks before cloning: `git config --system core.symlinks true`
-- Missing symlinks cause `version.py` errors during `pip install`
-- LLM custom kernels and quantized kernels do not compile with MSVC; use `-T ClangCL` or build with CUDA
-
-## 5. Key Build Options
-
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `CMAKE_BUILD_TYPE` | STRING | Debug | `Debug` or `Release`. Release disables logging/verification, adds optimizations |
-| `EXECUTORCH_BUILD_XNNPACK` | BOOL | OFF | XNNPACK CPU backend (requires CPUINFO + PTHREADPOOL) |
-| `EXECUTORCH_BUILD_COREML` | BOOL | OFF | Core ML backend (macOS/iOS only) |
-| `EXECUTORCH_BUILD_MPS` | BOOL | OFF | MPS GPU backend (macOS/iOS only) |
-| `EXECUTORCH_BUILD_CUDA` | BOOL | OFF | CUDA GPU backend (requires EXTENSION_TENSOR) |
-| `EXECUTORCH_BUILD_METAL` | BOOL | OFF | Metal backend (requires EXTENSION_TENSOR) |
-| `EXECUTORCH_BUILD_VULKAN` | BOOL | OFF | Vulkan GPU backend (Android) |
-| `EXECUTORCH_BUILD_QNN` | BOOL | OFF | Qualcomm QNN backend |
-| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | BOOL | OFF | Optimized kernel implementations |
-| `EXECUTORCH_BUILD_KERNELS_QUANTIZED` | BOOL | OFF | Quantized kernel implementations |
-| `EXECUTORCH_BUILD_KERNELS_LLM` | BOOL | OFF | LLM custom kernels (requires KERNELS_OPTIMIZED) |
-| `EXECUTORCH_BUILD_EXTENSION_MODULE` | BOOL | OFF | Module extension (requires DATA_LOADER + FLAT_TENSOR + NAMED_DATA_MAP) |
-| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | BOOL | OFF | Tensor extension |
-| `EXECUTORCH_BUILD_EXTENSION_LLM` | BOOL | OFF | LLM extension |
-| `EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER` | BOOL | OFF | LLM runner extension (requires EXTENSION_LLM) |
-| `EXECUTORCH_BUILD_PYBIND` | BOOL | OFF | Python bindings (requires EXTENSION_MODULE) |
-| `EXECUTORCH_BUILD_TESTS` | BOOL | OFF | CMake-based unit tests |
-| `EXECUTORCH_BUILD_DEVTOOLS` | BOOL | OFF | Developer tools (Inspector, ETDump) |
-| `EXECUTORCH_ENABLE_EVENT_TRACER` | BOOL | OFF | Event tracing (requires DEVTOOLS) |
-| `EXECUTORCH_OPTIMIZE_SIZE` | BOOL | OFF | Optimize for binary size (`-Os`, no exceptions/RTTI) |
-| `EXECUTORCH_ENABLE_LOGGING` | BOOL | (Debug=ON) | Runtime logging |
-| `EXECUTORCH_LOG_LEVEL` | STRING | Info | Log level: Debug, Info, Error, Fatal |
-| `EXECUTORCH_USE_SANITIZER` | BOOL | OFF | ASAN + UBSAN (not supported on MSVC) |
-| `EXECUTORCH_PAL_DEFAULT` | STRING | posix | Platform abstraction: `posix`, `minimal`, `android` |
-
-**Dependency chains** — enabling some options requires others:
-- `XNNPACK` requires `CPUINFO` + `PTHREADPOOL`
-- `KERNELS_LLM` requires `KERNELS_OPTIMIZED`
-- `EXTENSION_MODULE` requires `EXTENSION_DATA_LOADER` + `EXTENSION_FLAT_TENSOR` + `EXTENSION_NAMED_DATA_MAP`
-- `BUILD_PYBIND` requires `EXTENSION_MODULE`
-- `EXTENSION_LLM_RUNNER` requires `EXTENSION_LLM`
-- `EVENT_TRACER` requires `DEVTOOLS`
-- `CUDA` and `METAL` require `EXTENSION_TENSOR`
-
-CMake will error with a clear message if a required option is missing.
-
-## 6. Common Build Patterns
-
-### Build core runtime only (minimal)
-```bash
-cmake -B cmake-out -DCMAKE_BUILD_TYPE=Release
-cmake --build cmake-out -j$(nproc)
-```
-
-### Build with XNNPACK backend
-```bash
-cmake -B cmake-out -DCMAKE_BUILD_TYPE=Release \
-  -DEXECUTORCH_BUILD_XNNPACK=ON
-cmake --build cmake-out -j$(nproc)
-```
-
-### Build with profiling
-```bash
-cmake -B cmake-out --preset profiling
-cmake --build cmake-out -j$(nproc)
-```
-
-### Build tests
-```bash
-cmake -B cmake-out -DEXECUTORCH_BUILD_TESTS=ON \
-  -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON
-cmake --build cmake-out -j$(nproc)
-ctest --test-dir cmake-out --output-on-failure
-```
-
-### Using ExecuTorch as a CMake subdirectory
-```cmake
-add_subdirectory(executorch)
-# Set options before add_subdirectory:
-set(EXECUTORCH_BUILD_XNNPACK ON)
-set(EXECUTORCH_BUILD_EXTENSION_MODULE ON)
-```
-
-## 7. Troubleshooting
-
-### Submodule issues
-**Symptom:** Build fails with missing headers or `CMakeLists.txt not found` in third-party dirs.
-```bash
-git submodule sync --recursive
-git submodule update --init --recursive
-```
-
-### Stale build artifacts
-**Symptom:** Mysterious failures after pulling new changes or switching branches.
-```bash
-./install_executorch.sh --clean
-# Or manually:
-rm -rf cmake-out/ pip-out/ buck-out/
-git submodule sync && git submodule update --init --recursive
-```
-
-### CMake version conflicts
-**Symptom:** `cmake` errors about policy versions or unsupported features.
-- ExecuTorch requires CMake >= 3.24, < 4.0
-- Check: `cmake --version`
-- If conda and system cmake conflict, ensure conda env cmake is used: `which cmake` should point to conda env
-
-### Python version mismatch
-**Symptom:** `install_executorch.sh` fails early with compatibility errors.
-- Supported: Python 3.10–3.13
-- Check: `python --version`
-
-### Dependency version conflicts
-**Symptom:** pip fails with conflicting torch/torchvision/torchaudio versions.
-- Use a fresh conda environment
-- If pinning to a specific PyTorch version: `./install_executorch.sh --use-pt-pinned-commit`
-
-### Missing `python-dev` headers
-**Symptom:** Build fails looking for `Python.h`.
-```bash
-sudo apt install python$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')-dev
-```
-
-### Linking errors with `--whole-archive`
-**Symptom:** Missing operator registrations at runtime despite building kernels.
-- Kernel binding libraries (e.g., `libportable_kernels_bindings.a`) use load-time registration
-- Must link with: `-Wl,--whole-archive <lib> -Wl,--no-whole-archive` (Linux) or `-Wl,-force_load,<lib>` (macOS)
-
-### XNNPACK build fails
-**Symptom:** Errors about missing `cpuinfo` or `pthreadpool`.
-- `EXECUTORCH_BUILD_XNNPACK=ON` requires `EXECUTORCH_BUILD_CPUINFO=ON` and `EXECUTORCH_BUILD_PTHREADPOOL=ON` (both ON by default unless `ARM_BAREMETAL` is set)
-
-### Windows symlink errors
-**Symptom:** `version.py` not found or import errors on Windows.
-```bash
-git config --system core.symlinks true
-# Re-clone the repo after enabling
-```
-
-### MSVC kernel compilation failures
-**Symptom:** LLM/quantized kernels fail to compile on Windows with MSVC.
-- Use Clang-CL: `cmake -B cmake-out -T ClangCL`
-- Or build with CUDA (which uses nvcc, not MSVC for kernels)
-
-### Intel macOS
-**Symptom:** `install_executorch.sh` fails — no prebuilt PyTorch wheels for Intel Mac.
-- Must build PyTorch from source, or use `--use-pt-pinned-commit --minimal`
-
-### Build directory not at repo root
-**Symptom:** Include path errors when ExecuTorch checkout is not the top-level directory.
-- ExecuTorch adds `..` to include directories; the build directory must be directly under the repo root or use `add_subdirectory` correctly
-
-### Duplicate kernel registration
-**Symptom:** Abort at runtime with duplicate kernel registration.
-- Only link one `gen_operators_lib` per target
-- Check for multiple kernel binding libraries being linked
-
-## 8. Build Output
-
-| Artifact | Location | Description |
-|----------|----------|-------------|
-| `executor_runner` | `cmake-out/executor_runner` | Standalone model runner |
-| Core runtime | `cmake-out/libexecutorch.a` | Core ExecuTorch runtime |
-| Portable ops | `cmake-out/kernels/portable/libportable_ops_lib.a` | Portable operator implementations |
-| XNNPACK backend | `cmake-out/backends/xnnpack/libxnnpack_backend.a` | XNNPACK delegate |
-| LLM runner | `cmake-out/examples/models/<model>/<runner>` | Model-specific runners |
-| Python package | site-packages | `executorch` Python module |
-| iOS frameworks | `cmake-out/*.xcframework` | iOS/macOS frameworks |
-| Android AAR | `aar-out/` | Android Java bindings |
-
-## 9. Tips
-
-- Always use `Release` for performance measurement; `Debug` is 5–10x slower and significantly larger
-- Use `ccache` to speed up rebuilds — ExecuTorch auto-detects it
-- Use `Ninja` generator (`-G Ninja`) for faster parallel builds
-- Use `cmake --list-presets` to see all available presets
-- After `git pull`, always clean and re-init submodules before rebuilding
+Link in Xcode with `-all_load` linker flag.
+
+**Android:**
+```bash
+export ANDROID_ABIS=arm64-v8a BUILD_AAR_DIR=aar-out
+mkdir -p $BUILD_AAR_DIR && sh scripts/build_android_library.sh
+```
+
+## Key build options
+
+Most commonly needed flags (full list: `CMakeLists.txt`):
+
+| Flag | What it enables |
+|------|-----------------|
+| `EXECUTORCH_BUILD_XNNPACK` | XNNPACK CPU backend |
+| `EXECUTORCH_BUILD_COREML` | Core ML (macOS/iOS) |
+| `EXECUTORCH_BUILD_MPS` | MPS GPU (macOS/iOS) |
+| `EXECUTORCH_BUILD_METAL` | Metal compute (macOS, requires EXTENSION_TENSOR) |
+| `EXECUTORCH_BUILD_CUDA` | CUDA GPU (Linux, requires EXTENSION_TENSOR) |
+| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Optimized kernels |
+| `EXECUTORCH_BUILD_KERNELS_QUANTIZED` | Quantized kernels |
+| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Module extension (requires DATA_LOADER + FLAT_TENSOR + NAMED_DATA_MAP) |
+| `EXECUTORCH_BUILD_EXTENSION_LLM` | LLM extension |
+| `EXECUTORCH_BUILD_TESTS` | Unit tests (`ctest --test-dir cmake-out --output-on-failure`) |
+| `EXECUTORCH_BUILD_DEVTOOLS` | DevTools (Inspector, ETDump) |
+| `EXECUTORCH_OPTIMIZE_SIZE` | Size-optimized build (`-Os`, no exceptions/RTTI) |
+| `CMAKE_BUILD_TYPE` | `Release` (default for presets) or `Debug` (5-10x slower) |
+
+## Troubleshooting
+
+| Symptom | Fix |
+|---------|-----|
+| Missing headers / `CMakeLists.txt not found` in third-party | `git submodule sync --recursive && git submodule update --init --recursive` |
+| Mysterious failures after `git pull` or branch switch | `rm -rf cmake-out/ pip-out/ && git submodule sync && git submodule update --init --recursive` |
+| CMake >= 4.0 (too new) | `pip install 'cmake>=3.24,<4'` inside the conda env |
+| `externally-managed-environment` / PEP 668 error | You're using system Python, not conda. Activate conda env first. |
+| pip conflicts with torch versions | Fresh conda env; or `./install_executorch.sh --use-pt-pinned-commit` |
+| Missing `Python.h` (Linux) | `sudo apt install python3.X-dev` |
+| Missing operator registrations at runtime | Link kernel libs with `-Wl,-force_load,<lib>` (macOS) or `-Wl,--whole-archive <lib> -Wl,--no-whole-archive` (Linux) |
+| `install_executorch.sh` fails on Intel Mac | No prebuilt PyTorch wheels; use `--use-pt-pinned-commit --minimal` |
+| XNNPACK build errors about cpuinfo/pthreadpool | Ensure `EXECUTORCH_BUILD_CPUINFO=ON` and `EXECUTORCH_BUILD_PTHREADPOOL=ON` (both ON by default) |
+| Duplicate kernel registration abort | Only link one `gen_operators_lib` per target |
+
+## Build output
+
+| Artifact | Location |
+|----------|----------|
+| Core runtime | `cmake-out/libexecutorch.a` |
+| executor_runner | `cmake-out/executor_runner` |
+| Model runners | `cmake-out/examples/models/<model>/<runner>` |
+| XNNPACK backend | `cmake-out/backends/xnnpack/libxnnpack_backend.a` |
+| Python package | `site-packages/executorch` |
+| iOS frameworks | `cmake-out/*.xcframework` |
+| Android AAR | `aar-out/` |
+
+## Tips
+- Always use `Release` for benchmarking; `Debug` is 5–10x slower
+- `ccache` is auto-detected if installed (`brew install ccache`)
+- `Ninja` is faster than Make (`-G Ninja`) — but `--preset macos` uses Xcode generator
 - For LLM workflows, `make <model>-<backend>` is the simplest path
-- Set `EXECUTORCH_OPTIMIZE_SIZE=ON` for size-constrained deployments
-- Check `cmake-out/compile_commands.json` for IDE integration (auto-generated)
+- After `git pull`, clean and re-init submodules before rebuilding

From c6ba3b0676d6f3f0e0ee6d143279aa82e06c7865 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:32:28 -0700
Subject: [PATCH 03/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 9f5f37db85a..b7446d7efe0 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -107,7 +107,7 @@ cmake -B cmake-out \
   -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
   -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
   -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON
-cmake --build cmake-out -j$(sysctl -n hw.ncpu)
+cmake --build cmake-out --parallel "$(nproc 2>/dev/null || sysctl -n hw.ncpu)"
 ```
 
 Run `cmake --list-presets` to see all available presets.

From f4390deca0286f0e5f744405cf53e1add9b683ad Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:32:44 -0700
Subject: [PATCH 04/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index b7446d7efe0..350b6cdde38 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -163,12 +163,14 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 
 ## Build output
 
+Installed artifact locations under `CMAKE_INSTALL_PREFIX=cmake-out`:
+
 | Artifact | Location |
 |----------|----------|
-| Core runtime | `cmake-out/libexecutorch.a` |
+| Core runtime | `cmake-out/lib/libexecutorch.a` |
 | executor_runner | `cmake-out/executor_runner` |
 | Model runners | `cmake-out/examples/models/<model>/<runner>` |
-| XNNPACK backend | `cmake-out/backends/xnnpack/libxnnpack_backend.a` |
+| XNNPACK backend | `cmake-out/lib/libxnnpack_backend.a` |
 | Python package | `site-packages/executorch` |
 | iOS frameworks | `cmake-out/*.xcframework` |
 | Android AAR | `aar-out/` |

From cb41d2302fd0b6b173fa75e10f31a0b2d4135e65 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:33:18 -0700
Subject: [PATCH 05/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 350b6cdde38..5b4fa544938 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -105,6 +105,8 @@ cmake -B cmake-out \
   -DEXECUTORCH_BUILD_XNNPACK=ON \
   -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
   -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
+  -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
   -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
   -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON
 cmake --build cmake-out --parallel "$(nproc 2>/dev/null || sysctl -n hw.ncpu)"

From e0337646ca7cd9dda17a6939b1fab03a6b348c27 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:42:39 -0700
Subject: [PATCH 06/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 5b4fa544938..10d878b4c86 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -165,12 +165,12 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 
 ## Build output
 
-Installed artifact locations under `CMAKE_INSTALL_PREFIX=cmake-out`:
+Installed artifact locations after `cmake --install` (or `./install_executorch.sh`) with `CMAKE_INSTALL_PREFIX=cmake-out`:
 
 | Artifact | Location |
 |----------|----------|
 | Core runtime | `cmake-out/lib/libexecutorch.a` |
-| executor_runner | `cmake-out/executor_runner` |
+| executor_runner (built only; not installed by default) | **build tree**: `<build-dir>/executor_runner` (Ninja/Make) or `<build-dir>/<config>/executor_runner` (e.g., `cmake-out/Release/executor_runner` with Xcode/Visual Studio) |
 | Model runners | `cmake-out/examples/models/<model>/<runner>` |
 | XNNPACK backend | `cmake-out/lib/libxnnpack_backend.a` |
 | Python package | `site-packages/executorch` |

From d6c134b7f3fe14084da4d5417f06aab5743d3e06 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:42:50 -0700
Subject: [PATCH 07/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 10d878b4c86..32c28b45f24 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -95,7 +95,7 @@ Then: `cmake --build cmake-out -j$(sysctl -n hw.ncpu)` (macOS) or `cmake --build
 ```bash
 cmake --workflow --preset llm-release        # CPU
 cmake --workflow --preset llm-release-metal  # Metal (macOS)
-cmake --workflow --preset llm-release-cuda   # CUDA (Linux)
+cmake --workflow --preset llm-release-cuda   # CUDA (Linux/Windows)
 ```
 
 **Manual CMake (custom flags):**

From ffc0722008be325839e9b75f0d58d7b2d8747b11 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 09:43:02 -0700
Subject: [PATCH 08/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 32c28b45f24..1ec474ab3cc 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -138,7 +138,7 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 | `EXECUTORCH_BUILD_COREML` | Core ML (macOS/iOS) |
 | `EXECUTORCH_BUILD_MPS` | MPS GPU (macOS/iOS) |
 | `EXECUTORCH_BUILD_METAL` | Metal compute (macOS, requires EXTENSION_TENSOR) |
-| `EXECUTORCH_BUILD_CUDA` | CUDA GPU (Linux, requires EXTENSION_TENSOR) |
+| `EXECUTORCH_BUILD_CUDA` | CUDA GPU (Linux/Windows, requires EXTENSION_TENSOR) |
 | `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Optimized kernels |
 | `EXECUTORCH_BUILD_KERNELS_QUANTIZED` | Quantized kernels |
 | `EXECUTORCH_BUILD_EXTENSION_MODULE` | Module extension (requires DATA_LOADER + FLAT_TENSOR + NAMED_DATA_MAP) |

From c6b9d343eac862025deb0e5a82ef38b232f4b490 Mon Sep 17 00:00:00 2001
From: aliafzal <4312898+aliafzal@users.noreply.github.com>
Date: Tue, 10 Mar 2026 11:25:41 -0700
Subject: [PATCH 09/23] Fix Cadence CPU runner CMake build

Differential Revision: D95846702

Pull Request resolved: https://github.com/pytorch/executorch/pull/18021
---
 backends/cadence/build_cadence_runner.sh       |  9 ++++++++-
 .../cadence/generic/operators/CMakeLists.txt   | 18 +++---------------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/backends/cadence/build_cadence_runner.sh b/backends/cadence/build_cadence_runner.sh
index a8f44719dc7..82968b196b3 100755
--- a/backends/cadence/build_cadence_runner.sh
+++ b/backends/cadence/build_cadence_runner.sh
@@ -31,12 +31,19 @@ main() {
 
   local example_dir=backends/cadence
   local build_dir="cmake-out/${example_dir}"
-  local cmake_prefix_path="${PWD}/cmake-out/lib/cmake/ExecuTorch;${PWD}/cmake-out/third-party/gflags"
+  # Detect lib vs lib64
+  if [ -d "${PWD}/cmake-out/lib64/cmake/ExecuTorch" ]; then
+    libdir="lib64"
+  else
+    libdir="lib"
+  fi
+  local cmake_prefix_path="${PWD}/cmake-out/${libdir}/cmake/ExecuTorch;${PWD}/cmake-out/third-party/gflags"
   rm -rf ${build_dir}
   CXXFLAGS="-fno-exceptions -fno-rtti" cmake -DCMAKE_PREFIX_PATH="${cmake_prefix_path}" \
     -DCMAKE_BUILD_TYPE=Release \
     -DEXECUTORCH_CADENCE_CPU_RUNNER=ON \
     -DEXECUTORCH_ENABLE_LOGGING=ON \
+    -DPYTHON_EXECUTABLE="$(which python3)" \
     -B"${build_dir}" \
     "${example_dir}"
   cmake --build "${build_dir}" --config Release -j16
diff --git a/backends/cadence/generic/operators/CMakeLists.txt b/backends/cadence/generic/operators/CMakeLists.txt
index b9afdc01cde..77d0b4949a3 100644
--- a/backends/cadence/generic/operators/CMakeLists.txt
+++ b/backends/cadence/generic/operators/CMakeLists.txt
@@ -79,21 +79,9 @@ target_include_directories(
 )
 
 # Custom ops that are needed to run the test model.
-add_library(
-  custom_ops
-  "quantized_add_out.cpp"
-  "quantized_linear_out.cpp"
-  "quantized_conv2d_nchw_out.cpp"
-  "quantized_conv2d_nhwc_out.cpp"
-  "quantized_relu_out.cpp"
-  "quantized_layer_norm.cpp"
-  "quantize_per_tensor.cpp"
-  "quantized_fully_connected_out.cpp"
-  "dequantize_per_tensor.cpp"
-  "quantized_matmul_out.cpp"
-  "op_requantize_out.cpp"
-  "im2row_out.cpp"
-)
+file(GLOB custom_ops_srcs "*.cpp")
+add_library(custom_ops ${custom_ops_srcs})
+
 target_include_directories(
   custom_ops PUBLIC ${ROOT_DIR}/.. ${CMAKE_BINARY_DIR}
                     ${_common_include_directories}

From 286ccef8944cc20a93f118dc6e2b2fd4a0370981 Mon Sep 17 00:00:00 2001
From: Gregory Comer <gjcomer@meta.com>
Date: Tue, 10 Mar 2026 11:38:21 -0700
Subject: [PATCH 10/23] Skip Samsung jobs that require secrests on forked PRs
 (#18064)

### Summary
Don't run Samsung jobs which require secrets on forked PRs. They fail.
It looks like they used to be disabled on forks, but this line ended up
getting left commented out after the jobs were disabled and re-enabled.
This PR restores it to the original state.
---
 .github/workflows/pull.yml | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml
index d88996ff8cb..0652c805b53 100644
--- a/.github/workflows/pull.yml
+++ b/.github/workflows/pull.yml
@@ -1057,7 +1057,8 @@ jobs:
 
   test-samsung-quantmodels-linux:
     name: test-samsung-quantmodels-linux
-    # if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
+    # Skip this job if the pull request is from a fork (secrets are not available)
+    if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
     uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
     permissions:
       id-token: write
@@ -1094,7 +1095,8 @@ jobs:
 
   test-samsung-models-linux:
     name: test-samsung-models-linux
-    # if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
+    # Skip this job if the pull request is from a fork (secrets are not available)
+    if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
     uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
     permissions:
       id-token: write

From c85bfe157ddb01c193425967e84d3ebf8c4d76be Mon Sep 17 00:00:00 2001
From: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
Date: Tue, 10 Mar 2026 15:55:21 -0400
Subject: [PATCH 11/23] Autoglob: don't expose mode labels (#18069)

Summary: I've migrated the codebase to stop using these.

Reviewed By: d16r

Differential Revision: D95810888

Co-authored-by: Adam Ernst <adamjernst@meta.com>
---
 extension/apple/BUCK     | 1 -
 extension/llm/apple/BUCK | 1 -
 2 files changed, 2 deletions(-)

diff --git a/extension/apple/BUCK b/extension/apple/BUCK
index 05371edfbdb..5fca78dd7c2 100644
--- a/extension/apple/BUCK
+++ b/extension/apple/BUCK
@@ -1,6 +1,5 @@
 load("@fbcode_macros//build_defs:build_file_migration.bzl", "fbcode_target", "non_fbcode_target")
 load("@fbsource//tools/build_defs:platform_defs.bzl", "IOS")
-load("@fbsource//tools/build_defs/apple:autoglob.bzl", "EXPORT_UNLESS_INTERNAL")
 load("@fbsource//tools/build_defs/apple:fb_apple_library.bzl", "fb_apple_library")
 load("@fbsource//tools/build_defs/apple:fb_apple_resource.bzl", "fb_apple_resource")
 load("@fbsource//xplat/executorch/build/fb:clients.bzl", "EXECUTORCH_CLIENTS")
diff --git a/extension/llm/apple/BUCK b/extension/llm/apple/BUCK
index 26dd36145ba..36da3c77935 100644
--- a/extension/llm/apple/BUCK
+++ b/extension/llm/apple/BUCK
@@ -1,6 +1,5 @@
 load("@fbcode_macros//build_defs:build_file_migration.bzl", "non_fbcode_target")
 load("@fbsource//tools/build_defs:platform_defs.bzl", "IOS")
-load("@fbsource//tools/build_defs/apple:autoglob.bzl", "EXPORT_UNLESS_INTERNAL")
 load("@fbsource//tools/build_defs/apple:fb_apple_library.bzl", "fb_apple_library")
 load("@fbsource//xplat/executorch/build/fb:clients.bzl", "EXECUTORCH_CLIENTS")
 load("@fbsource//tools/build_defs/apple:fb_apple_resource.bzl", "fb_apple_resource")

From 179f84e47ff93fb48aa8b631429c5b56c4c17e68 Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 13:21:28 -0700
Subject: [PATCH 12/23] Harden building skill from e2e testing

- Add venv fallback when conda is not installed
- Handle conda PermissionError by checking env directory on disk
- Auto-fix cmake: missing or < 3.24 gets pip-installed, >= 4.0 works fine
- Add troubleshooting entries for conda not found and PEP 668 errors
- Remove heavy-handed directive banner; let skill structure guide behavior

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 1ec474ab3cc..55fe237c10d 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -5,22 +5,24 @@ description: Build ExecuTorch from source — Python package, C++ runtime, runne
 
 # Building ExecuTorch
 
-IMPORTANT: Do NOT ask the user questions about environment setup. Detect issues and fix them automatically. The standard path is conda — always use it.
-
-## Step 1: Ensure conda environment (do this silently, fix issues automatically)
+## Step 1: Ensure Python environment (detect and fix automatically)
 
 ```bash
 # Check if executorch conda env exists; create if not
-conda env list | grep executorch || conda create -yn executorch python=3.12 'cmake<4'
+# Note: `conda env list` may fail with PermissionError on some setups.
+# Fallback: check if the env directory exists on disk.
+conda env list 2>/dev/null | grep executorch || \
+  ls "$CONDA_PREFIX/../envs/" 2>/dev/null | grep executorch || \
+  conda create -yn executorch python=3.12
+
 # Activate
 conda activate executorch
-# Verify — fix automatically if wrong
+
+# Verify
 python --version          # need 3.10–3.13
-cmake --version           # need >= 3.24, < 4.0; if cmake >= 4, run: pip install 'cmake>=3.24,<4'
+cmake --version           # need >= 3.24; cmake 4.x works in practice
 ```
 
-If cmake is >= 4.0 inside the conda env, fix it: `pip install 'cmake>=3.24,<4'` — do not ask the user.
-
 Parallel jobs: `$(sysctl -n hw.ncpu)` on macOS, `$(nproc)` on Linux.
 
 ## Step 2: Build (route by what the user needs)
@@ -154,7 +156,8 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 |---------|-----|
 | Missing headers / `CMakeLists.txt not found` in third-party | `git submodule sync --recursive && git submodule update --init --recursive` |
 | Mysterious failures after `git pull` or branch switch | `rm -rf cmake-out/ pip-out/ && git submodule sync && git submodule update --init --recursive` |
-| CMake >= 4.0 (too new) | `pip install 'cmake>=3.24,<4'` inside the conda env |
+| `conda env list` PermissionError | Use `CONDA_NO_PLUGINS=true conda env list` or check env dir directly |
+| CMake >= 4.0 | Works in practice despite `< 4.0` in docs; only fix if build actually fails |
 | `externally-managed-environment` / PEP 668 error | You're using system Python, not conda. Activate conda env first. |
 | pip conflicts with torch versions | Fresh conda env; or `./install_executorch.sh --use-pt-pinned-commit` |
 | Missing `Python.h` (Linux) | `sudo apt install python3.X-dev` |

From d75acb2a06faced75b326d584d98351e8f4bbcca Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 13:24:53 -0700
Subject: [PATCH 13/23] Add routing table to building skill for
 Android/iOS/model targets

Explicit decision tree at the top of Step 2 so Claude routes to the
right section based on keywords (Android, iOS, model names, cmake)
instead of always defaulting to the Python package build.

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 55fe237c10d..2c6d3ada155 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -25,9 +25,16 @@ cmake --version           # need >= 3.24; cmake 4.x works in practice
 
 Parallel jobs: `$(sysctl -n hw.ncpu)` on macOS, `$(nproc)` on Linux.
 
-## Step 2: Build (route by what the user needs)
+## Step 2: Build
 
-### Python package (default — use this unless user asks for something specific)
+Route based on what the user asks for:
+- User mentions **Android** → skip to [Cross-compilation: Android](#cross-compilation)
+- User mentions **iOS** or **frameworks** → skip to [Cross-compilation: iOS](#cross-compilation)
+- User mentions a **model name** (llama, whisper, etc.) → skip to [LLM / ASR model runner](#llm--asr-model-runner-simplest-path-for-running-models)
+- User mentions **C++ runtime** or **cmake** → skip to [C++ runtime](#c-runtime-standalone)
+- Otherwise → default to **Python package** below
+
+### Python package (default)
 ```bash
 conda activate executorch
 ./install_executorch.sh --editable    # editable install from source

From 62d417dec4b5c8112c7954474c3c4843c2ea3df2 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 13:28:31 -0700
Subject: [PATCH 14/23] Update .claude/skills/building/SKILL.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .claude/skills/building/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 2c6d3ada155..7d4b5f75208 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -98,7 +98,7 @@ Output: `cmake-out/examples/models/<model>/<runner>`
 | Linux | `cmake -B cmake-out --preset linux -DCMAKE_BUILD_TYPE=Release` |
 | Windows | `cmake -B cmake-out --preset windows -T ClangCL` |
 
-Then: `cmake --build cmake-out -j$(sysctl -n hw.ncpu)` (macOS) or `cmake --build cmake-out -j$(nproc)` (Linux)
+Then: `cmake --build cmake-out --config Release -j$(sysctl -n hw.ncpu)` (macOS) or `cmake --build cmake-out -j$(nproc)` (Linux)
 
 **LLM libraries via workflow presets** (configure + build + install in one command):
 ```bash

From 74d0c7ec50d1bface42d381e1ab9639ac7f078a1 Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 13:34:22 -0700
Subject: [PATCH 15/23] Address PR review comments on building skill

- Add ANDROID_NDK requirement and verification to Android section
- Fix CMAKE_BUILD_TYPE description: not all presets set it
- Separate build output table by flow (pip vs cmake vs cross-compilation)

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 7d4b5f75208..5ad9b132510 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -132,7 +132,11 @@ Run `cmake --list-presets` to see all available presets.
 Link in Xcode with `-all_load` linker flag.
 
 **Android:**
+
+Requires `ANDROID_NDK` on PATH (typically set by Android Studio or standalone NDK install).
 ```bash
+# Verify NDK is available
+echo $ANDROID_NDK           # must point to NDK root, e.g. ~/Library/Android/sdk/ndk/<version>
 export ANDROID_ABIS=arm64-v8a BUILD_AAR_DIR=aar-out
 mkdir -p $BUILD_AAR_DIR && sh scripts/build_android_library.sh
 ```
@@ -155,7 +159,7 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 | `EXECUTORCH_BUILD_TESTS` | Unit tests (`ctest --test-dir cmake-out --output-on-failure`) |
 | `EXECUTORCH_BUILD_DEVTOOLS` | DevTools (Inspector, ETDump) |
 | `EXECUTORCH_OPTIMIZE_SIZE` | Size-optimized build (`-Os`, no exceptions/RTTI) |
-| `CMAKE_BUILD_TYPE` | `Release` (default for presets) or `Debug` (5-10x slower) |
+| `CMAKE_BUILD_TYPE` | `Release` or `Debug` (5-10x slower). Some presets (e.g. `llm-release`) set this; others require it explicitly. |
 
 ## Troubleshooting
 
@@ -175,15 +179,25 @@ Most commonly needed flags (full list: `CMakeLists.txt`):
 
 ## Build output
 
-Installed artifact locations after `cmake --install` (or `./install_executorch.sh`) with `CMAKE_INSTALL_PREFIX=cmake-out`:
+**From `./install_executorch.sh` (Python package):**
+
+| Artifact | Location |
+|----------|----------|
+| Python package | `site-packages/executorch` |
+
+**From CMake builds** (`cmake --install` with `CMAKE_INSTALL_PREFIX=cmake-out`):
 
 | Artifact | Location |
 |----------|----------|
 | Core runtime | `cmake-out/lib/libexecutorch.a` |
-| executor_runner (built only; not installed by default) | **build tree**: `<build-dir>/executor_runner` (Ninja/Make) or `<build-dir>/<config>/executor_runner` (e.g., `cmake-out/Release/executor_runner` with Xcode/Visual Studio) |
-| Model runners | `cmake-out/examples/models/<model>/<runner>` |
 | XNNPACK backend | `cmake-out/lib/libxnnpack_backend.a` |
-| Python package | `site-packages/executorch` |
+| executor_runner | `cmake-out/executor_runner` (Ninja/Make) or `cmake-out/Release/executor_runner` (Xcode) |
+| Model runners | `cmake-out/examples/models/<model>/<runner>` |
+
+**From cross-compilation:**
+
+| Artifact | Location |
+|----------|----------|
 | iOS frameworks | `cmake-out/*.xcframework` |
 | Android AAR | `aar-out/` |
 

From 8c0a60bf88ef50de6ab8312215956502402c2139 Mon Sep 17 00:00:00 2001
From: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
Date: Tue, 10 Mar 2026 16:39:03 -0400
Subject: [PATCH 16/23] Re-export Q_ANNOTATION_KEY from quantizer annotators
 package

Differential Revision: D95862010

Pull Request resolved: https://github.com/pytorch/executorch/pull/18063
---
 examples/qualcomm/custom_op/custom_ops_1.py | 3 +--
 examples/qualcomm/oss_scripts/fastvit.py    | 6 ++----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/examples/qualcomm/custom_op/custom_ops_1.py b/examples/qualcomm/custom_op/custom_ops_1.py
index 31b3b6ff3ec..ed99eabc9c8 100644
--- a/examples/qualcomm/custom_op/custom_ops_1.py
+++ b/examples/qualcomm/custom_op/custom_ops_1.py
@@ -70,11 +70,10 @@ def annotate_custom(gm: torch.fx.GraphModule) -> None:
     This function is specific for custom op.
     The source_fn of the rewritten nn module turns out to be "my_ops.mul3.default"
     """
-    from executorch.backends.qualcomm.quantizer.annotators import _is_annotated
-
     from executorch.backends.qualcomm.quantizer.qconfig import (
         get_ptq_per_channel_quant_config,
     )
+    from executorch.backends.qualcomm.quantizer.rules import _is_annotated
     from torch.fx import Node
     from torchao.quantization.pt2e.quantizer import QuantizationAnnotation
     from torchao.quantization.pt2e.quantizer.quantizer import Q_ANNOTATION_KEY
diff --git a/examples/qualcomm/oss_scripts/fastvit.py b/examples/qualcomm/oss_scripts/fastvit.py
index 3e620ab0300..87d90bb61b7 100644
--- a/examples/qualcomm/oss_scripts/fastvit.py
+++ b/examples/qualcomm/oss_scripts/fastvit.py
@@ -12,16 +12,13 @@
 import numpy as np
 import torch
 
-from executorch.backends.qualcomm.quantizer.annotators import (
-    QuantizationConfig,
-    QuantizationSpec,
-)
 from executorch.backends.qualcomm.quantizer.observers.per_channel_param_observer import (
     PerChannelParamObserver,
 )
 from executorch.backends.qualcomm.quantizer.qconfig import (
     _derived_bias_quant_spec,
     MovingAverageMinMaxObserver,
+    QuantizationConfig,
 )
 
 from executorch.backends.qualcomm.quantizer.quantizer import QuantDtype
@@ -40,6 +37,7 @@
     SimpleADB,
     topk_accuracy,
 )
+from torchao.quantization.pt2e.quantizer import QuantizationSpec
 
 
 def get_instance(repo_path: str, checkpoint_path: str):

From 6209f273ded2976d1e10541a22e4adb7308e9429 Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 13:41:09 -0700
Subject: [PATCH 17/23] Fix fresh-Mac gaps: Xcode CLT, conda shell hook, Python
 version fallback

Three issues that would break a fresh Mac checkout:
- Add Xcode Command Line Tools prerequisite check
- Add conda shell.bash hook for non-interactive shells (Claude Code / CI)
- Add brew install python@3.12 guidance for venv path when only 3.14+ exists

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 33 ++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index 5ad9b132510..d349b50bd24 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -5,24 +5,45 @@ description: Build ExecuTorch from source — Python package, C++ runtime, runne
 
 # Building ExecuTorch
 
+## Prerequisites (macOS)
+
+A C++ compiler is required. On macOS, ensure Xcode Command Line Tools are installed:
+```bash
+xcode-select -p || xcode-select --install
+```
+
 ## Step 1: Ensure Python environment (detect and fix automatically)
 
+**Path A — conda (preferred):**
 ```bash
+# Initialize conda for non-interactive shells (required in Claude Code / CI)
+eval "$(conda shell.bash hook 2>/dev/null)"
+
 # Check if executorch conda env exists; create if not
-# Note: `conda env list` may fail with PermissionError on some setups.
-# Fallback: check if the env directory exists on disk.
 conda env list 2>/dev/null | grep executorch || \
-  ls "$CONDA_PREFIX/../envs/" 2>/dev/null | grep executorch || \
+  ls "$(conda info --base 2>/dev/null)/envs/" 2>/dev/null | grep executorch || \
   conda create -yn executorch python=3.12
 
 # Activate
 conda activate executorch
+```
 
-# Verify
-python --version          # need 3.10–3.13
-cmake --version           # need >= 3.24; cmake 4.x works in practice
+**Path B — no conda (fall back to venv):**
+```bash
+# Find a compatible Python (3.10–3.13). On macOS with only Homebrew Python 3.14+,
+# install a compatible version first: brew install python@3.12
+python3.12 -m venv .executorch-venv   # or python3.11, python3.10, python3.13
+source .executorch-venv/bin/activate
+pip install --upgrade pip
 ```
 
+**Then verify (either path):**
+
+Run `python --version` and `cmake --version`. Fix automatically:
+- **Python not 3.10–3.13**: recreate the env with a correct Python version.
+- **cmake missing or < 3.24**: run `pip install 'cmake>=3.24'` inside the env.
+- **cmake >= 4.0**: works in practice, no action needed.
+
 Parallel jobs: `$(sysctl -n hw.ncpu)` on macOS, `$(nproc)` on Linux.
 
 ## Step 2: Build

From d2e8919374ef4c06bdcedf0d51a7f907888a34da Mon Sep 17 00:00:00 2001
From: Github Executorch <github_executorch@arm.com>
Date: Tue, 10 Mar 2026 13:42:53 -0700
Subject: [PATCH 18/23] =?UTF-8?q?Remove=20Xcode=20CLT=20prerequisite=20?=
 =?UTF-8?q?=E2=80=94=20not=20in=20ET=20docs,=20rarely=20needed?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Claude <noreply@anthropic.com>
---
 .claude/skills/building/SKILL.md | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
index d349b50bd24..d1322cdecae 100644
--- a/.claude/skills/building/SKILL.md
+++ b/.claude/skills/building/SKILL.md
@@ -5,13 +5,6 @@ description: Build ExecuTorch from source — Python package, C++ runtime, runne
 
 # Building ExecuTorch
 
-## Prerequisites (macOS)
-
-A C++ compiler is required. On macOS, ensure Xcode Command Line Tools are installed:
-```bash
-xcode-select -p || xcode-select --install
-```
-
 ## Step 1: Ensure Python environment (detect and fix automatically)
 
 **Path A — conda (preferred):**

From dda73d31679f424563a1d73bd87117eaf8bbd9f3 Mon Sep 17 00:00:00 2001
From: s09g <13538214+s09g@users.noreply.github.com>
Date: Tue, 10 Mar 2026 14:10:02 -0700
Subject: [PATCH 19/23] Add WASM/Emscripten compiler flags to
 runtime_wrapper.bzl

Differential Revision: D95904580

Pull Request resolved: https://github.com/pytorch/executorch/pull/18025
---
 shim_et/xplat/executorch/build/runtime_wrapper.bzl | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/shim_et/xplat/executorch/build/runtime_wrapper.bzl b/shim_et/xplat/executorch/build/runtime_wrapper.bzl
index 92fafc78bab..01004595ff1 100644
--- a/shim_et/xplat/executorch/build/runtime_wrapper.bzl
+++ b/shim_et/xplat/executorch/build/runtime_wrapper.bzl
@@ -123,6 +123,16 @@ def _patch_build_mode_flags(kwargs):
         # @oss-disable: "fbsource//xplat/assistant/oacr/native/scripts:compiler_flag_O2": ["-O2"],
     })
 
+    # Add pthread flags for Emscripten/WASM builds with threading support.
+    # Required when linking into WASM binaries that use -sUSE_PTHREADS=1.
+    # Without these flags, wasm-ld fails with:
+    #   "error: --shared-memory is disallowed by <file>.o because it was not
+    #    compiled with 'atomics' or 'bulk-memory' features."
+    kwargs["compiler_flags"] = kwargs["compiler_flags"] + select({
+        "DEFAULT": [],
+        # @oss-disable: "ovr_config//runtime:wasm-emscripten": ["-pthread", "-matomics", "-mbulk-memory"],
+    })
+
     return kwargs
 
 def _has_pytorch_dep(dep_list):

From 3baf6c20327d65599c24eba99804949f4501fae8 Mon Sep 17 00:00:00 2001
From: Nitin Jain <jainnitin@meta.com>
Date: Tue, 10 Mar 2026 14:54:44 -0700
Subject: [PATCH 20/23] Remove extern "C" wrapping and fix format specifiers
 for ARM embedded builds

Differential Revision: D95739935

Pull Request resolved: https://github.com/pytorch/executorch/pull/18000
---
 .../ops/cmsis_scratch_buffer_context.h        |  4 +--
 backends/cortex_m/ops/cortex_m_ops_common.h   | 33 ++++++++++---------
 backends/cortex_m/ops/op_maximum.cpp          |  5 ---
 backends/cortex_m/ops/op_minimum.cpp          |  5 ---
 backends/cortex_m/ops/op_pad.cpp              |  4 ---
 backends/cortex_m/ops/op_quantized_add.cpp    |  5 ---
 .../cortex_m/ops/op_quantized_avg_pool2d.cpp  |  4 ---
 backends/cortex_m/ops/op_quantized_conv2d.cpp |  4 ---
 .../ops/op_quantized_depthwise_conv2d.cpp     |  4 ---
 backends/cortex_m/ops/op_quantized_linear.cpp |  4 ---
 .../cortex_m/ops/op_quantized_max_pool2d.cpp  |  4 ---
 backends/cortex_m/ops/op_quantized_mul.cpp    |  5 ---
 .../ops/op_quantized_transpose_conv2d.cpp     |  4 ---
 backends/cortex_m/ops/op_softmax.cpp          |  5 ---
 backends/cortex_m/ops/op_transpose.cpp        |  5 ---
 15 files changed, 18 insertions(+), 77 deletions(-)

diff --git a/backends/cortex_m/ops/cmsis_scratch_buffer_context.h b/backends/cortex_m/ops/cmsis_scratch_buffer_context.h
index 4b9fdaebdf7..4672f05e777 100644
--- a/backends/cortex_m/ops/cmsis_scratch_buffer_context.h
+++ b/backends/cortex_m/ops/cmsis_scratch_buffer_context.h
@@ -7,10 +7,8 @@
  */
 #pragma once
 
-#include "cortex_m_ops_common.h"
-extern "C" {
 #include "arm_nnfunctions.h"
-}
+#include "cortex_m_ops_common.h"
 
 namespace cortex_m {
 namespace native {
diff --git a/backends/cortex_m/ops/cortex_m_ops_common.h b/backends/cortex_m/ops/cortex_m_ops_common.h
index 1b31367881f..4c0f83d6eb6 100644
--- a/backends/cortex_m/ops/cortex_m_ops_common.h
+++ b/backends/cortex_m/ops/cortex_m_ops_common.h
@@ -16,12 +16,12 @@
 #include <executorch/kernels/portable/cpu/util/kernel_ops_util.h>
 #include <executorch/runtime/platform/assert.h>
 
+#include <cinttypes>
 #include <limits>
 #include <optional>
 
-extern "C" {
 #include "arm_nn_types.h"
-}
+#include "arm_nnfunctions.h"
 
 using Tensor = torch::executor::Tensor;
 using ScalarType = executorch::aten::ScalarType;
@@ -47,19 +47,19 @@ inline void validate_cmsis_nn_tensor_requirements(
   // Basic dtype validation
   ET_CHECK_MSG(
       input1.scalar_type() == expected_dtype,
-      "Input1 dtype must be %hhd, got %hhd",
-      expected_dtype,
-      input1.scalar_type());
+      "Input1 dtype must be %d, got %d",
+      static_cast<int>(expected_dtype),
+      static_cast<int>(input1.scalar_type()));
   ET_CHECK_MSG(
       input2.scalar_type() == expected_dtype,
-      "Input2 dtype must be %hhd, got %hhd",
-      expected_dtype,
-      input2.scalar_type());
+      "Input2 dtype must be %d, got %d",
+      static_cast<int>(expected_dtype),
+      static_cast<int>(input2.scalar_type()));
   ET_CHECK_MSG(
       output.scalar_type() == expected_dtype,
-      "Output dtype must be %hhd, got %hhd",
-      expected_dtype,
-      output.scalar_type());
+      "Output dtype must be %d, got %d",
+      static_cast<int>(expected_dtype),
+      static_cast<int>(output.scalar_type()));
   if (require_same_sizes) {
     ET_CHECK_MSG(
         input1.sizes() == input2.sizes(),
@@ -78,16 +78,17 @@ inline void validate_single_quant_params(
     const int64_t multiplier,
     const int64_t shift,
     const char* param_name) {
+  (void)zero_point;
   ET_CHECK_MSG(
       multiplier >= std::numeric_limits<int32_t>::min() &&
           multiplier <= std::numeric_limits<int32_t>::max(),
-      "%s multiplier must be in int32 range [Value: %d]",
+      "%s multiplier must be in int32 range [Value: %" PRIi64 "]",
       param_name,
       multiplier);
 
   ET_CHECK_MSG(
       shift >= -31 && shift <= 31,
-      "%s shift must be in range [-31, 31] [Value: %d]",
+      "%s shift must be in range [-31, 31] [Value: %" PRIi64 "]",
       param_name,
       shift);
 }
@@ -172,7 +173,7 @@ inline bool check_int32_within_range(
       value > std::numeric_limits<int32_t>::max()) {
     ET_LOG(
         Error,
-        "%s: %s value (%ld) exceeds int32_t range",
+        "%s: %s value (%" PRIi64 ") exceeds int32_t range",
         op_name,
         value_name,
         value);
@@ -354,14 +355,14 @@ inline bool validate_per_channel_quant_params(
     if (multipliers[i] <= ARM_NN_Q31_MIN || multipliers[i] > ARM_NN_Q31_MAX) {
       ET_LOG(
           Error,
-          "weight_multiplier[%d] out of CMSIS-NN range: %d",
+          "weight_multiplier[%d] out of CMSIS-NN range: %" PRIi64,
           i,
           multipliers[i]);
       return false;
     }
     // Shift: {-31, 30} for arm_nn_requantize
     if (shifts[i] < -31 || shifts[i] > 30) {
-      ET_LOG(Error, "weight_shift[%d] out of range: %d", i, shifts[i]);
+      ET_LOG(Error, "weight_shift[%d] out of range: %" PRIi64, i, shifts[i]);
       return false;
     }
   }
diff --git a/backends/cortex_m/ops/op_maximum.cpp b/backends/cortex_m/ops/op_maximum.cpp
index 71a907f12ea..fc76f5c8c48 100644
--- a/backends/cortex_m/ops/op_maximum.cpp
+++ b/backends/cortex_m/ops/op_maximum.cpp
@@ -7,11 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_minimum.cpp b/backends/cortex_m/ops/op_minimum.cpp
index f220aa2664b..5a75cb8a1dc 100644
--- a/backends/cortex_m/ops/op_minimum.cpp
+++ b/backends/cortex_m/ops/op_minimum.cpp
@@ -9,11 +9,6 @@
 
 #include "cortex_m_ops_common.h"
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_pad.cpp b/backends/cortex_m/ops/op_pad.cpp
index 739c584c419..b400f4c7e19 100644
--- a/backends/cortex_m/ops/op_pad.cpp
+++ b/backends/cortex_m/ops/op_pad.cpp
@@ -8,10 +8,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_quantized_add.cpp b/backends/cortex_m/ops/op_quantized_add.cpp
index 2cab7dc37fb..b4bbfdaffce 100644
--- a/backends/cortex_m/ops/op_quantized_add.cpp
+++ b/backends/cortex_m/ops/op_quantized_add.cpp
@@ -9,11 +9,6 @@
 
 #include "cortex_m_ops_common.h"
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 using KernelRuntimeContext = torch::executor::KernelRuntimeContext;
diff --git a/backends/cortex_m/ops/op_quantized_avg_pool2d.cpp b/backends/cortex_m/ops/op_quantized_avg_pool2d.cpp
index ad77bb54aff..293c6ea6957 100644
--- a/backends/cortex_m/ops/op_quantized_avg_pool2d.cpp
+++ b/backends/cortex_m/ops/op_quantized_avg_pool2d.cpp
@@ -7,10 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_quantized_conv2d.cpp b/backends/cortex_m/ops/op_quantized_conv2d.cpp
index 3eae9507ba7..0fa6a3f8536 100644
--- a/backends/cortex_m/ops/op_quantized_conv2d.cpp
+++ b/backends/cortex_m/ops/op_quantized_conv2d.cpp
@@ -7,10 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_quantized_depthwise_conv2d.cpp b/backends/cortex_m/ops/op_quantized_depthwise_conv2d.cpp
index b3cf926c2e1..8dec61e0af1 100644
--- a/backends/cortex_m/ops/op_quantized_depthwise_conv2d.cpp
+++ b/backends/cortex_m/ops/op_quantized_depthwise_conv2d.cpp
@@ -7,10 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_quantized_linear.cpp b/backends/cortex_m/ops/op_quantized_linear.cpp
index f04b65fa1fb..5d018cbc0c4 100644
--- a/backends/cortex_m/ops/op_quantized_linear.cpp
+++ b/backends/cortex_m/ops/op_quantized_linear.cpp
@@ -9,10 +9,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 using KernelRuntimeContext = torch::executor::KernelRuntimeContext;
diff --git a/backends/cortex_m/ops/op_quantized_max_pool2d.cpp b/backends/cortex_m/ops/op_quantized_max_pool2d.cpp
index 470a7ae791e..181a29c1b65 100644
--- a/backends/cortex_m/ops/op_quantized_max_pool2d.cpp
+++ b/backends/cortex_m/ops/op_quantized_max_pool2d.cpp
@@ -7,10 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_quantized_mul.cpp b/backends/cortex_m/ops/op_quantized_mul.cpp
index 3d9d6ab54a4..524e74a6b9f 100644
--- a/backends/cortex_m/ops/op_quantized_mul.cpp
+++ b/backends/cortex_m/ops/op_quantized_mul.cpp
@@ -7,11 +7,6 @@
 
 #include "cortex_m_ops_common.h"
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 namespace {
diff --git a/backends/cortex_m/ops/op_quantized_transpose_conv2d.cpp b/backends/cortex_m/ops/op_quantized_transpose_conv2d.cpp
index 7126a2b2cf7..e3f6135c7b9 100644
--- a/backends/cortex_m/ops/op_quantized_transpose_conv2d.cpp
+++ b/backends/cortex_m/ops/op_quantized_transpose_conv2d.cpp
@@ -8,10 +8,6 @@
 
 #include "cortex_m_ops_common.h"
 
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_softmax.cpp b/backends/cortex_m/ops/op_softmax.cpp
index a2b8f27fac1..c07a538db84 100644
--- a/backends/cortex_m/ops/op_softmax.cpp
+++ b/backends/cortex_m/ops/op_softmax.cpp
@@ -11,11 +11,6 @@
 #include <cstdint>
 #include <limits>
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 
diff --git a/backends/cortex_m/ops/op_transpose.cpp b/backends/cortex_m/ops/op_transpose.cpp
index 25458435a3c..7fcbc034283 100644
--- a/backends/cortex_m/ops/op_transpose.cpp
+++ b/backends/cortex_m/ops/op_transpose.cpp
@@ -11,11 +11,6 @@
 #include <limits>
 #include <vector>
 
-// Include CMSIS-NN headers with C linkage
-extern "C" {
-#include "arm_nnfunctions.h"
-}
-
 namespace cortex_m {
 namespace native {
 

From bad1aec67e9d8e133efbc1b37a2335fac0cdb910 Mon Sep 17 00:00:00 2001
From: Erlend Aune <erlend.aune.1983@gmail.com>
Date: Tue, 10 Mar 2026 23:07:53 +0100
Subject: [PATCH 21/23] Xnnpack disable workspace nonlock (#17780)

### Summary
Remove lock on XNNPACK Disabled Workspace Mode.

### Test plan
See test.


cc @GregoryComer @digantdesai @cbilgin
---
 backends/xnnpack/runtime/XNNWorkspace.h       |  8 +++++++
 .../xnnpack/runtime/XNNWorkspaceManager.cpp   |  1 +
 backends/xnnpack/test/CMakeLists.txt          |  2 +-
 .../test/runtime/test_workspace_manager.cpp   | 24 +++++++++++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/backends/xnnpack/runtime/XNNWorkspace.h b/backends/xnnpack/runtime/XNNWorkspace.h
index 507953a10ab..b7ef442c460 100644
--- a/backends/xnnpack/runtime/XNNWorkspace.h
+++ b/backends/xnnpack/runtime/XNNWorkspace.h
@@ -34,6 +34,9 @@ class XNNWorkspace {
   XNNWorkspace& operator=(XNNWorkspace&&) = delete;
 
   std::pair<std::unique_lock<std::mutex>, xnn_workspace_t> acquire() {
+    if (!lock_required_) {
+      return {std::unique_lock<std::mutex>{}, workspace_.get()};
+    }
     auto lock = std::unique_lock<std::mutex>(mutex_);
     return {std::move(lock), workspace_.get()};
   }
@@ -52,6 +55,10 @@ class XNNWorkspace {
     return id_;
   }
 
+  void disable_locking() {
+    lock_required_ = false;
+  }
+
   static runtime::Result<std::shared_ptr<XNNWorkspace>> create() {
     // Because this class can't be moved, we need to construct it in-place.
     xnn_workspace_t workspace = nullptr;
@@ -72,6 +79,7 @@ class XNNWorkspace {
   static inline std::atomic<uint64_t> next_id_{0};
   std::mutex mutex_;
   uint64_t id_;
+  bool lock_required_ = true;
   WorkspacePtr workspace_;
 };
 
diff --git a/backends/xnnpack/runtime/XNNWorkspaceManager.cpp b/backends/xnnpack/runtime/XNNWorkspaceManager.cpp
index d8c6dae4d6d..5af3395ed89 100644
--- a/backends/xnnpack/runtime/XNNWorkspaceManager.cpp
+++ b/backends/xnnpack/runtime/XNNWorkspaceManager.cpp
@@ -56,6 +56,7 @@ XNNWorkspaceManager::get_or_create_workspace(uintptr_t program_id) const {
       return create_result.error();
     }
 
+    create_result.get()->disable_locking();
     return create_result.get();
   } else if (mode == WorkspaceSharingMode::PerModel) {
     return get_or_create_model_workspace(program_id);
diff --git a/backends/xnnpack/test/CMakeLists.txt b/backends/xnnpack/test/CMakeLists.txt
index 395fb01d189..3d9c77d6ad6 100644
--- a/backends/xnnpack/test/CMakeLists.txt
+++ b/backends/xnnpack/test/CMakeLists.txt
@@ -17,7 +17,7 @@ set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../../..)
 
 include(${EXECUTORCH_ROOT}/tools/cmake/Test.cmake)
 
-set(_test_srcs runtime/test_xnnexecutor.cpp
+set(_test_srcs runtime/test_xnnexecutor.cpp runtime/test_workspace_manager.cpp
                ${EXECUTORCH_ROOT}/extension/threadpool/test/threadpool_test.cpp
 )
 
diff --git a/backends/xnnpack/test/runtime/test_workspace_manager.cpp b/backends/xnnpack/test/runtime/test_workspace_manager.cpp
index 8d3203f3f40..a7689966635 100644
--- a/backends/xnnpack/test/runtime/test_workspace_manager.cpp
+++ b/backends/xnnpack/test/runtime/test_workspace_manager.cpp
@@ -107,6 +107,18 @@ TEST_F(XNNWorkspaceManagerTest, DisabledMode) {
       workspace2->unsafe_get_workspace(), workspace3->unsafe_get_workspace());
 }
 
+TEST_F(XNNWorkspaceManagerTest, DisabledModeAcquireDoesNotLock) {
+  workspace_manager_->set_sharing_mode(WorkspaceSharingMode::Disabled);
+
+  auto workspace_result = workspace_manager_->get_or_create_workspace(12345);
+  ASSERT_TRUE(workspace_result.ok());
+  auto workspace = workspace_result.get();
+
+  auto [lock, ptr] = workspace->acquire();
+  ASSERT_NE(ptr, nullptr);
+  EXPECT_FALSE(lock.owns_lock());
+}
+
 TEST_F(XNNWorkspaceManagerTest, PerModelMode) {
   // In PerModel mode, calls with the same program_id should return the same
   // workspace.
@@ -139,6 +151,18 @@ TEST_F(XNNWorkspaceManagerTest, PerModelMode) {
       workspace1->unsafe_get_workspace(), workspace3->unsafe_get_workspace());
 }
 
+TEST_F(XNNWorkspaceManagerTest, PerModelAcquireStillLocks) {
+  workspace_manager_->set_sharing_mode(WorkspaceSharingMode::PerModel);
+
+  auto workspace_result = workspace_manager_->get_or_create_workspace(12345);
+  ASSERT_TRUE(workspace_result.ok());
+  auto workspace = workspace_result.get();
+
+  auto [lock, ptr] = workspace->acquire();
+  ASSERT_NE(ptr, nullptr);
+  EXPECT_TRUE(lock.owns_lock());
+}
+
 TEST_F(XNNWorkspaceManagerTest, GlobalMode) {
   // In Global mode, all calls should return the same workspace.
   workspace_manager_->set_sharing_mode(WorkspaceSharingMode::Global);

From cedfe4c1c62f16b0dc4ac0304f16bb021c6814fd Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 15:13:35 -0700
Subject: [PATCH 22/23] Fix heap-buffer-overflow in constant_pad_nd (#18018)

Summary:
Fix write-heap-buffer-overflow in set_all_to_value triggered via
apply_padding_to_dim, reported by fuzzer (T258811544).

Root causes:
1. Negative padding values silently cast to huge size_t, causing massive
out-of-bounds writes.
2. When out_data advances past out_data_end, the remaining computation
(out_data_end - out_data) wraps around to a huge size_t, causing bounds
checks to incorrectly pass.
3. No error propagation after recursive apply_padding_to_dim calls,
allowing the loop to continue writing after a child call has failed.

Fixes:
- Validate all padding values are non-negative in
check_constant_pad_args.
- Read padding as int64_t and explicitly check >= 0 before casting to
size_t.
- Guard remaining computation with out_data <= out_data_end check at all
three bounds-check sites to prevent size_t wraparound.
- Check ctx.failure_state() after recursive calls and bail out early.
- Remove dead pad_i >= 0 check (always true for size_t).


Differential Revision: D95762335
---
 kernels/portable/cpu/op_constant_pad_nd.cpp   | 37 +++++++++++++++++--
 kernels/portable/cpu/util/kernel_ops_util.cpp |  8 ++++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/kernels/portable/cpu/op_constant_pad_nd.cpp b/kernels/portable/cpu/op_constant_pad_nd.cpp
index d3f3fdd75d7..2127cca3d5c 100644
--- a/kernels/portable/cpu/op_constant_pad_nd.cpp
+++ b/kernels/portable/cpu/op_constant_pad_nd.cpp
@@ -51,9 +51,17 @@ void apply_padding_to_dim(
 
   size_t pad_before = 0;
   size_t pad_after = 0;
-  if (pad_i >= 0 && pad_i < pad.size() / 2) {
-    pad_before = pad[2 * pad_i];
-    pad_after = pad[2 * pad_i + 1];
+  if (pad_i < pad.size() / 2) {
+    int64_t pb = pad[2 * pad_i];
+    int64_t pa = pad[2 * pad_i + 1];
+    ET_KERNEL_CHECK_MSG(
+        ctx,
+        pb >= 0 && pa >= 0,
+        InvalidArgument,
+        /* void */,
+        "Padding values must be non-negative.");
+    pad_before = static_cast<size_t>(pb);
+    pad_after = static_cast<size_t>(pa);
   }
 
   size_t out_step_len = out_strides[dim];
@@ -62,6 +70,12 @@ void apply_padding_to_dim(
   // Do not copy padding beyond the out tensor bounds.
   // Use division to avoid potential overflow in multiplication.
   if (pad_before > 0) {
+    ET_KERNEL_CHECK_MSG(
+        ctx,
+        out_data <= out_data_end,
+        InvalidArgument,
+        /* void */,
+        "Out data pointer exceeds buffer bounds.");
     size_t remaining = out_data_end - out_data;
     ET_KERNEL_CHECK_MSG(
         ctx,
@@ -92,7 +106,12 @@ void apply_padding_to_dim(
           /* void */,
           "Out tensor overlaps with the input tensor. This is not supported.");
       // Bounds check before memcpy
-      // Use overflow-safe check for remaining >= copy_len
+      ET_KERNEL_CHECK_MSG(
+          ctx,
+          out_data <= out_data_end,
+          InvalidArgument,
+          /* void */,
+          "Out data pointer exceeds buffer bounds.");
       size_t remaining = out_data_end - out_data;
       ET_KERNEL_CHECK_MSG(
           ctx,
@@ -123,6 +142,10 @@ void apply_padding_to_dim(
           last_padded_dim,
           dim + 1);
 
+      if (ctx.failure_state() != Error::Ok) {
+        return;
+      }
+
       out_data += out_step_len;
       self_data += in_step_len;
     }
@@ -131,6 +154,12 @@ void apply_padding_to_dim(
   // Do not copy padding beyond the out tensor bounds.
   // Use division to avoid potential overflow in multiplication.
   if (pad_after > 0) {
+    ET_KERNEL_CHECK_MSG(
+        ctx,
+        out_data <= out_data_end,
+        InvalidArgument,
+        /* void */,
+        "Out data pointer exceeds buffer bounds.");
     size_t remaining = out_data_end - out_data;
     ET_KERNEL_CHECK_MSG(
         ctx,
diff --git a/kernels/portable/cpu/util/kernel_ops_util.cpp b/kernels/portable/cpu/util/kernel_ops_util.cpp
index daa85f6beec..46fac7bde39 100644
--- a/kernels/portable/cpu/util/kernel_ops_util.cpp
+++ b/kernels/portable/cpu/util/kernel_ops_util.cpp
@@ -564,6 +564,14 @@ bool check_constant_pad_args(
       pad.size() / 2,
       in.dim());
 
+  for (size_t i = 0; i < pad.size(); ++i) {
+    ET_CHECK_OR_RETURN_FALSE(
+        pad[i] >= 0,
+        "Padding values must be non-negative, but got pad[%zu] = %" PRId64,
+        i,
+        pad[i]);
+  }
+
   return true;
 }
 

From 518daa8cc0eb0873c6f83bc5a328769d3b01fc45 Mon Sep 17 00:00:00 2001
From: Siddartha Pothapragada <sidart@meta.com>
Date: Tue, 10 Mar 2026 15:57:53 -0700
Subject: [PATCH 23/23] Revise ethos doc links in CMakeLists.txt (#18075)

Updated documentation links for Ethos-U memory modes.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .../runtime/CMakeLists.txt                                | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/examples/arm/image_classification_example_ethos_u/runtime/CMakeLists.txt b/examples/arm/image_classification_example_ethos_u/runtime/CMakeLists.txt
index 9d9f0645bd5..6704c0d6fda 100644
--- a/examples/arm/image_classification_example_ethos_u/runtime/CMakeLists.txt
+++ b/examples/arm/image_classification_example_ethos_u/runtime/CMakeLists.txt
@@ -118,9 +118,11 @@ set(LINK_FILE_OUT
 # Shared_Sram, in the application, we set ETHOSU_ARENA to 0 so that the
 # intermediate tensors are placed in the SRAM. If you generate a pte for a
 # different memory mode, you need to change the placement in the linker script.
-# Read
-# https://docs.pytorch.org/executorch/stable/backends-arm-ethos-u.html#ethos-u-memory-modes
-# for more information.
+# For more information, see the stable documentation:
+# https://docs.pytorch.org/executorch/stable/backends/arm-ethos-u/arm-ethos-u-overview.html#ethos-u-memory-modes
+
+# For 1.0 compatibility (if required)
+# https://docs.pytorch.org/executorch/1.0/backends-arm-ethos-u.html#ethos-u-memory-modes
 set(ETHOSU_ARENA "0")
 # Generate linker script - we have a few if/else statements in
 # Corstone-320.ld/Corstone-300.ld that are compiled into a final linker script.