[aoti-et] Enable multimodal runner for Voxtral on CUDA by larryliu0820 · Pull Request #14980 · pytorch/executorch

larryliu0820 · 2025-10-10T04:40:31Z

This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend.

Workflow and Artifact Management Improvements:

Refactored .github/workflows/cuda.yml to split the Voxtral CUDA pipeline into three jobs: export-voxtral-cuda-artifact (exports and stores model artifacts), benchmark-voxtral-cuda (benchmarks using exported artifacts), and test-voxtral-cuda-e2e (runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [1] [2] [3] [4] [5]

Multimodal Runner Logic:

Added automatic conversion of audio tensors to bfloat16 in MultimodalPrefiller::prefill and implemented a helper function convert_to_bfloat16 in util.h to support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [1] [2]

CUDA Backend and Caching Enhancements:

Improved caching logic in common_shims.cpp for tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [1] [2]
Added dynamic symbol re-registration in CudaBackend to handle multiple shared objects in the same process, ensuring correct execution when switching between models.
Removed redundant logging statements in CUDA backend for cleaner output. [1] [2]

Build System Updates:

Updated CMakeLists.txt and executorch-config.cmake to include and link the CUDA backend (aoti_cuda) when building Voxtral and other components, improving build flexibility and CUDA support. [1] [2]

Debugging and Tuning Options:

Added support for enabling debug compilation in cuda_backend.py via the DEBUG environment variable, allowing easier troubleshooting and development.

pytorch-bot · 2025-10-10T04:40:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14980

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Macos CI runners unavailable

❌ 6 New Failures, 4 Pending

As of commit afc2159 with merge base 66c3dea ():

NEW FAILURES - The following jobs have failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t d391e429644cb32020a4e8ce8f03033022daa68aead16d3c25e5694b88a55879 /exec failed with exit code 139
pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
RuntimeError: Command docker exec -t 23591f13064fef1f13167b5629a02b4dc2db5fb7cf13b49de25114a255052092 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t 39047d1ddaa896eb401358e930a31b722846a0bd7e6fb0774f99edd45f0965af /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t eca9076bb71e07d14ee304e2eba56bfc5a23a4bafd68ab22c23bbfe690ae9c3b /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 35f714861cc3aec1fbb27e02610892e6c0f90c55f7dc43d78d14521a3a1080bc /exec failed with exit code 1
Test CUDA Builds / check-all-cuda-builds (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

backends/cuda/cuda_backend.py

Gasoonjia

Thansk for your great work!
The size/stride change for me is pretty strange: i con't image a case that the tensor ptr keeps the same while its size/stride got changed

extension/llm/runner/multimodal_prefiller.cpp

backends/aoti/common_shims.cpp

backends/cuda/runtime/cuda_backend.cpp

examples/models/voxtral/CMakeLists.txt

backends/aoti/common_shims.cpp

extension/llm/runner/util.h

larryliu0820 · 2025-10-10T18:59:12Z

@swolchok take another look?

swolchok

objections withdrawn. I think with some work you can further simplify the sizes()/strides() update stuff, up to you how much of it you want to do right now

backends/aoti/common_shims.cpp

extension/llm/runner/util.h

mergennachin

See inline

mergennachin · 2025-10-10T19:54:52Z

backends/aoti/common_shims.cpp

 AOTITorchError aoti_torch_get_strides(Tensor* tensor, int64_t** ret_strides) {
  auto it = internal::tensor_to_strides.find(tensor);
+  bool needs_update = false;
+


Can you make docblock something like this?

// CRITICAL: Multimodal models reuse tensors with different shapes across
// executions (e.g., variable-length audio). We MUST validate cached metadata
// matches current tensor state, or CUDA kernels will receive incorrect shapes
// leading to memory corruption and segfaults.

mergennachin · 2025-10-10T19:57:32Z

backends/cuda/runtime/cuda_backend.cpp

+    // Need to re-register all the symbols from the so_handle hosted by this
+    // CudaBackend instance. The reason is that these symbols are
+    // static/singleton across the whole process. When we share multiple methods
+    // (meaning multiple so_handle) in the same process, we need to re-register
+    // the symbols from the so_handle that is being used in this execution.
+    ET_CHECK_OK_OR_RETURN_ERROR(
+        register_shared_library_functions(handle->so_handle));
+


If we're loading the model once and doing execute/inference multiple times, it will register multiple times, no?

Can you do something like this?

void* last_registered_handle = nullptr; if (handle->so_handle != last_registered_handle) { ET_CHECK_OK_OR_RETURN_ERROR( register_shared_library_functions(handle->so_handle)); last_registered_handle = handle->so_handle; }

So the so_handle won't change. It's just we are mapping the symbols differently, especially AOTInductorModelContainerRun. Let's say we do the following:

load(token_embedding)

load(audio_encoder)

load(text_decoder)

run(audio_encoder) <-- here AOTInductorModelContainerRun maps to the symbol in text_decoder.so, so we need to remap the symbol to audio_encoder.so

@larryliu0820

Can you store the AOTInductorModelContainerRunFunc inside AOTIDelegateHandle?

struct AOTIDelegateHandle { void* so_handle; std::string so_path; AOTInductorModelContainerHandle container_handle; void* cuda_stream; AOTInductorModelContainerRunFunc run_func; // ... etc for all symbols };

Result<DelegateHandle*> init(...) const override { AOTIDelegateHandle* handle = new AOTIDelegateHandle(); handle->so_handle = so_handle; // Load symbols into THIS handle's struct (not global) handle->run_func = reinterpret_cast<AOTInductorModelContainerRunFunc>( dlsym(so_handle, "AOTInductorModelContainerRun")); // ... etc ET_CHECK_OR_RETURN_ERROR( handle->run_func != nullptr, AccessFailed, "Failed to load AOTInductorModelContainerRun"); return (DelegateHandle*)handle; }

Error execute(..., DelegateHandle* handle_, ...) const override { AOTIDelegateHandle* handle = (AOTIDelegateHandle*)handle_; // NO re-registration, use the handle's local symbols AOTIRuntimeError error = handle->run_func( ...) // ... rest of execution ... }

mergennachin · 2025-10-10T20:03:54Z

Also can you update the https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md to include additional CUDA instructions too?

This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend. **Workflow and Artifact Management Improvements:** * Refactored `.github/workflows/cuda.yml` to split the Voxtral CUDA pipeline into three jobs: `export-voxtral-cuda-artifact` (exports and stores model artifacts), `benchmark-voxtral-cuda` (benchmarks using exported artifacts), and `test-voxtral-cuda-e2e` (runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [[1]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89L90-R91) [[2]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R107) [[3]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R134-R185) [[4]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R196-R267) [[5]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R122) **Multimodal Runner Logic:** * Added automatic conversion of audio tensors to bfloat16 in `MultimodalPrefiller::prefill` and implemented a helper function `convert_to_bfloat16` in `util.h` to support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [[1]](diffhunk://#diff-ad4fcb32ffc5f1f7b4f87b5ee58927cb948a8c0976295befd10e3de445913ae4L96-R136) [[2]](diffhunk://#diff-db4801445eaa3bb4f1370fe41d3a00ae2e3ef354a23ad4d5ace141ecc3c6f413R144-R180) **CUDA Backend and Caching Enhancements:** * Improved caching logic in `common_shims.cpp` for tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [[1]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R54-R81) [[2]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R104-R130) * Added dynamic symbol re-registration in `CudaBackend` to handle multiple shared objects in the same process, ensuring correct execution when switching between models. * Removed redundant logging statements in CUDA backend for cleaner output. [[1]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L226) [[2]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L256) **Build System Updates:** * Updated `CMakeLists.txt` and `executorch-config.cmake` to include and link the CUDA backend (`aoti_cuda`) when building Voxtral and other components, improving build flexibility and CUDA support. [[1]](diffhunk://#diff-606feb24310595f592d98d021a2c90618346977d94decb80b35b7e26ed8ccc1eR89-R95) [[2]](diffhunk://#diff-6a78a155992483ff6f35d595ff6cef63b477d1c853f6482e77acae6ef443f0e4R56) **Debugging and Tuning Options:** * Added support for enabling debug compilation in `cuda_backend.py` via the `DEBUG` environment variable, allowing easier troubleshooting and development.

[aoti-et] Enable multimodal runner for Voxtral on CUDA

08f0ce0

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025

larryliu0820 added release notes: multimodal Changes and new features for multimodal support release notes: desktop for desktop/laptop workstream labels Oct 10, 2025

Chcek output

0f1659a

larryliu0820 marked this pull request as ready for review October 10, 2025 04:44

larryliu0820 requested review from jackzhxng, kirklandsign, mergennachin and swolchok as code owners October 10, 2025 04:44

Gasoonjia reviewed Oct 10, 2025

View reviewed changes

backends/cuda/cuda_backend.py Outdated Show resolved Hide resolved

Gasoonjia reviewed Oct 10, 2025

View reviewed changes

larryliu0820 added 2 commits October 10, 2025 00:17

Address comments

1808824

Check for poem

a7c55b1

Gasoonjia approved these changes Oct 10, 2025

View reviewed changes

Remove debug config

be5d187

swolchok suggested changes Oct 10, 2025

View reviewed changes

backends/aoti/common_shims.cpp Outdated Show resolved Hide resolved

backends/aoti/common_shims.cpp Outdated Show resolved Hide resolved

extension/llm/runner/util.h Outdated Show resolved Hide resolved

larryliu0820 added 2 commits October 10, 2025 11:50

Add unit tests for convert_to_bfloat

fb4940e

Lint

88873b7

swolchok approved these changes Oct 10, 2025

View reviewed changes

backends/aoti/common_shims.cpp Outdated Show resolved Hide resolved

backends/aoti/common_shims.cpp Outdated Show resolved Hide resolved

extension/llm/runner/util.h Outdated Show resolved Hide resolved

mergennachin approved these changes Oct 10, 2025

View reviewed changes

Address comments

f40b1fb

larryliu0820 requested a review from lucylq as a code owner October 10, 2025 20:26

larryliu0820 added 2 commits October 10, 2025 13:44

Fix typo

7ab6b25

Fix typo

afc2159

larryliu0820 merged commit 09eac16 into main Oct 11, 2025
138 of 148 checks passed

larryliu0820 deleted the voxtral_e2e branch October 11, 2025 02:01

Conversation

larryliu0820 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14980

❗ 1 Active SEVs

❌ 6 New Failures, 4 Pending

Uh oh!

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

larryliu0820 commented Oct 10, 2025

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

larryliu0820 commented Oct 10, 2025 •

edited

Loading

pytorch-bot bot commented Oct 10, 2025 •

edited

Loading

mergennachin Oct 10, 2025 •

edited

Loading

mergennachin commented Oct 10, 2025 •

edited

Loading