[aoti-et] Enable multimodal runner for Voxtral on CUDA#14980
[aoti-et] Enable multimodal runner for Voxtral on CUDA#14980larryliu0820 merged 10 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14980
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 6 New Failures, 4 PendingAs of commit afc2159 with merge base 66c3dea ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Gasoonjia
left a comment
There was a problem hiding this comment.
Thansk for your great work!
The size/stride change for me is pretty strange: i con't image a case that the tensor ptr keeps the same while its size/stride got changed
|
@swolchok take another look? |
swolchok
left a comment
There was a problem hiding this comment.
objections withdrawn. I think with some work you can further simplify the sizes()/strides() update stuff, up to you how much of it you want to do right now
| AOTITorchError aoti_torch_get_strides(Tensor* tensor, int64_t** ret_strides) { | ||
| auto it = internal::tensor_to_strides.find(tensor); | ||
| bool needs_update = false; | ||
|
|
There was a problem hiding this comment.
Can you make docblock something like this?
// CRITICAL: Multimodal models reuse tensors with different shapes across
// executions (e.g., variable-length audio). We MUST validate cached metadata
// matches current tensor state, or CUDA kernels will receive incorrect shapes
// leading to memory corruption and segfaults.
| // Need to re-register all the symbols from the so_handle hosted by this | ||
| // CudaBackend instance. The reason is that these symbols are | ||
| // static/singleton across the whole process. When we share multiple methods | ||
| // (meaning multiple so_handle) in the same process, we need to re-register | ||
| // the symbols from the so_handle that is being used in this execution. | ||
| ET_CHECK_OK_OR_RETURN_ERROR( | ||
| register_shared_library_functions(handle->so_handle)); | ||
|
|
There was a problem hiding this comment.
If we're loading the model once and doing execute/inference multiple times, it will register multiple times, no?
Can you do something like this?
void* last_registered_handle = nullptr;
if (handle->so_handle != last_registered_handle) {
ET_CHECK_OK_OR_RETURN_ERROR(
register_shared_library_functions(handle->so_handle));
last_registered_handle = handle->so_handle;
}
There was a problem hiding this comment.
So the so_handle won't change. It's just we are mapping the symbols differently, especially AOTInductorModelContainerRun. Let's say we do the following:
- load(token_embedding)
- load(audio_encoder)
- load(text_decoder)
- run(audio_encoder) <-- here
AOTInductorModelContainerRunmaps to the symbol in text_decoder.so, so we need to remap the symbol to audio_encoder.so
There was a problem hiding this comment.
Can you store the AOTInductorModelContainerRunFunc inside AOTIDelegateHandle?
struct AOTIDelegateHandle {
void* so_handle;
std::string so_path;
AOTInductorModelContainerHandle container_handle;
void* cuda_stream;
AOTInductorModelContainerRunFunc run_func;
// ... etc for all symbols
};
Result<DelegateHandle*> init(...) const override {
AOTIDelegateHandle* handle = new AOTIDelegateHandle();
handle->so_handle = so_handle;
// Load symbols into THIS handle's struct (not global)
handle->run_func = reinterpret_cast<AOTInductorModelContainerRunFunc>(
dlsym(so_handle, "AOTInductorModelContainerRun"));
// ... etc
ET_CHECK_OR_RETURN_ERROR(
handle->run_func != nullptr,
AccessFailed,
"Failed to load AOTInductorModelContainerRun");
return (DelegateHandle*)handle;
}
Error execute(..., DelegateHandle* handle_, ...) const override {
AOTIDelegateHandle* handle = (AOTIDelegateHandle*)handle_;
// NO re-registration, use the handle's local symbols
AOTIRuntimeError error = handle->run_func(
...)
// ... rest of execution ...
}
|
Also can you update the https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md to include additional CUDA instructions too? |
This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend. **Workflow and Artifact Management Improvements:** * Refactored `.github/workflows/cuda.yml` to split the Voxtral CUDA pipeline into three jobs: `export-voxtral-cuda-artifact` (exports and stores model artifacts), `benchmark-voxtral-cuda` (benchmarks using exported artifacts), and `test-voxtral-cuda-e2e` (runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [[1]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89L90-R91) [[2]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R107) [[3]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R134-R185) [[4]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R196-R267) [[5]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R122) **Multimodal Runner Logic:** * Added automatic conversion of audio tensors to bfloat16 in `MultimodalPrefiller::prefill` and implemented a helper function `convert_to_bfloat16` in `util.h` to support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [[1]](diffhunk://#diff-ad4fcb32ffc5f1f7b4f87b5ee58927cb948a8c0976295befd10e3de445913ae4L96-R136) [[2]](diffhunk://#diff-db4801445eaa3bb4f1370fe41d3a00ae2e3ef354a23ad4d5ace141ecc3c6f413R144-R180) **CUDA Backend and Caching Enhancements:** * Improved caching logic in `common_shims.cpp` for tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [[1]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R54-R81) [[2]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R104-R130) * Added dynamic symbol re-registration in `CudaBackend` to handle multiple shared objects in the same process, ensuring correct execution when switching between models. * Removed redundant logging statements in CUDA backend for cleaner output. [[1]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L226) [[2]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L256) **Build System Updates:** * Updated `CMakeLists.txt` and `executorch-config.cmake` to include and link the CUDA backend (`aoti_cuda`) when building Voxtral and other components, improving build flexibility and CUDA support. [[1]](diffhunk://#diff-606feb24310595f592d98d021a2c90618346977d94decb80b35b7e26ed8ccc1eR89-R95) [[2]](diffhunk://#diff-6a78a155992483ff6f35d595ff6cef63b477d1c853f6482e77acae6ef443f0e4R56) **Debugging and Tuning Options:** * Added support for enabling debug compilation in `cuda_backend.py` via the `DEBUG` environment variable, allowing easier troubleshooting and development.
This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend.
Workflow and Artifact Management Improvements:
.github/workflows/cuda.ymlto split the Voxtral CUDA pipeline into three jobs:export-voxtral-cuda-artifact(exports and stores model artifacts),benchmark-voxtral-cuda(benchmarks using exported artifacts), andtest-voxtral-cuda-e2e(runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [1] [2] [3] [4] [5]Multimodal Runner Logic:
MultimodalPrefiller::prefilland implemented a helper functionconvert_to_bfloat16inutil.hto support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [1] [2]CUDA Backend and Caching Enhancements:
common_shims.cppfor tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [1] [2]CudaBackendto handle multiple shared objects in the same process, ensuring correct execution when switching between models.Build System Updates:
CMakeLists.txtandexecutorch-config.cmaketo include and link the CUDA backend (aoti_cuda) when building Voxtral and other components, improving build flexibility and CUDA support. [1] [2]Debugging and Tuning Options:
cuda_backend.pyvia theDEBUGenvironment variable, allowing easier troubleshooting and development.