Add pybindings for multimodal LLM runner by larryliu0820 · Pull Request #14285 · pytorch/executorch

larryliu0820 · 2025-09-12T21:24:55Z

This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.

Python Bindings Implementation:

Added a new high-level Python API in __init__.py for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.
Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled.

Build System Integration:

Updated CMakeLists.txt to add a pybind11-based Python extension module (_llm_runner) when EXECUTORCH_BUILD_PYBIND is set, linking all necessary dependencies and setting up include paths.

Documentation and Planning:

Added python API section to README.md.

Utility and Extensibility:

Exposed utility functions (load_image_from_file, preprocess_image, create_generation_config) for easier input preprocessing and configuration from Python.

Testing and Examples (Planned):

Added test_runner_pybindings.py.

Code Snippet of How to Use:

from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input
from transformers import AutoProcessor
model_id = "google/gemma-3-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
image_url = "https://llava-vl.github.io/static/images/view.jpg"
conversation = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": image_url},
            {
                "type": "text",
                "text": "What are the things I should be cautious about when I visit here?",
            },
        ],
    },
]
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt")
inputs_combined = [
    make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), 
    make_image_input(inputs["pixel_values"]), 
    make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"),
]
runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)

Output from console:

[multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:109] Prefilling input 0/3, type: text
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 2/3, type: text
[util.h:125] second_input_sizes[0] = 1023
What are the things I should be cautious about when I visit here?<end_of_turn>


You'
[multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048
re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare.

**1. Weather & Terrain – Expanded:**

*   **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself.
*   **Terrain Stability:** The
PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[stats.h:108] 	Prompt Tokens: 669    Generated Tokens: 99
[stats.h:114] 	Model Load Time:		2.297000 (seconds)
[stats.h:124] 	Total inference time:		20.401000 (seconds)		 Rate: 	4.852703 (tokens/second)
[stats.h:132] 		Prompt evaluation:	12.661000 (seconds)		 Rate: 	52.839428 (tokens/second)
[stats.h:143] 		Generated 99 tokens:	7.740000 (seconds)		 Rate: 	12.790698 (tokens/second)
[stats.h:151] 	Time to first generated token:	12.661000 (seconds)
[stats.h:158] 	Sampling time over 768 tokens:	0.117000 (seconds)

cc @mergennachin @cccclai @helunwencser @jackzhxng

pytorch-bot · 2025-09-12T21:24:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Pending, 1 Unrelated Failure

As of commit f8ace7d with merge base c9f46e2 ():

NEW FAILURE - The following job has failed:

trunk / test-arm-ootb-linux / linux-job (gh)
RuntimeError: Command docker exec -t 4eca5dd865d8f94d978671c572238fbc9f049d12bcf809da0401e22a114f70b4 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

docker-builds / docker-build (linux.2xlarge, executorch-ubuntu-22.04-gcc9) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

extension/llm/runner/utils.py

extension/llm/runner/llm_runner_helper.h

mergennachin · 2025-09-16T01:08:15Z

extension/llm/runner/__init__.py

+            ValueError: If the image format is not supported
+            FileNotFoundError: If the image file doesn't exist
+        """
+        if isinstance(image, (str, Path)):


shouldn't you use the CV preprocessing utils function?

Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.

extension/llm/runner/_llm_runner.pyi

extension/llm/runner/multimodal_input.h

extension/llm/runner/utils.py

extension/llm/runner/test/test_runner_pybindings.py

extension/llm/runner/_llm_runner.pyi

extension/llm/runner/pybindings.cpp

extension/llm/runner/multimodal_input.h

extension/llm/runner/_llm_runner.pyi

extension/llm/runner/__init__.py

JacobSzwejbka

.

extension/llm/runner/__init__.py

JacobSzwejbka · 2025-09-22T21:10:52Z

extension/llm/runner/pybindings.cpp

+      .def("is_audio", &MultimodalInput::is_audio)
+      .def("is_raw_audio", &MultimodalInput::is_raw_audio)
+      .def(
+          "get_text",


Not totally convinced all these getter impl are correct

JacobSzwejbka · 2025-09-23T20:34:45Z

extension/llm/runner/README.md

+print(f"Image: {image.width}x{image.height}x{image.channels}")
+
+# Check input types safely
+if text_input.is_text():


Would a user ever need to do this?

JacobSzwejbka · 2025-09-23T20:35:37Z

extension/llm/runner/README.md

+        """Reset the conversation state"""
+        self.runner.reset()
+
+# Usage


Should this section just be a demo.py?

Yeah would be good to have a notebook. But I'll leave it here for now

jackzhxng · 2025-09-24T01:56:02Z

🎉

@mergennachin

This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management. **Python Bindings Implementation:** * Added a new high-level Python API in `__init__.py` for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing. * Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled. **Build System Integration:** * Updated `CMakeLists.txt` to add a `pybind11`-based Python extension module (`_llm_runner`) when `EXECUTORCH_BUILD_PYBIND` is set, linking all necessary dependencies and setting up include paths. **Documentation and Planning:** * Added python API section to `README.md`. **Utility and Extensibility:** * Exposed utility functions (`load_image_from_file`, `preprocess_image`, `create_generation_config`) for easier input preprocessing and configuration from Python. **Testing and Examples (Planned):** * Added `test_runner_pybindings.py`. **Code Snippet of How to Use:** ```python from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input from transformers import AutoProcessor model_id = "google/gemma-3-4b-it" processor = AutoProcessor.from_pretrained(model_id) image_url = "https://llava-vl.github.io/static/images/view.jpg" conversation = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, { "role": "user", "content": [ {"type": "image", "url": image_url}, { "type": "text", "text": "What are the things I should be cautious about when I visit here?", }, ], }, ] inputs = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), make_image_input(inputs["pixel_values"]), make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"), ] runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` Output from console: ``` [multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:109] Prefilling input 0/3, type: text [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 1/3, type: image [multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 2/3, type: text [util.h:125] second_input_sizes[0] = 1023 What are the things I should be cautious about when I visit here?<end_of_turn> You' [multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048 re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare. **1. Weather & Terrain – Expanded:** * **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself. * **Terrain Stability:** The PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000} [stats.h:108] Prompt Tokens: 669 Generated Tokens: 99 [stats.h:114] Model Load Time: 2.297000 (seconds) [stats.h:124] Total inference time: 20.401000 (seconds) Rate: 4.852703 (tokens/second) [stats.h:132] Prompt evaluation: 12.661000 (seconds) Rate: 52.839428 (tokens/second) [stats.h:143] Generated 99 tokens: 7.740000 (seconds) Rate: 12.790698 (tokens/second) [stats.h:151] Time to first generated token: 12.661000 (seconds) [stats.h:158] Sampling time over 768 tokens: 0.117000 (seconds) ``` cc @mergennachin @cccclai @helunwencser @jackzhxng

larryliu0820 requested review from jackzhxng, kirklandsign, mergennachin and swolchok as code owners September 12, 2025 21:24

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025

larryliu0820 added module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities labels Sep 13, 2025

larryliu0820 changed the title ~~Add pybindings for LLM runners~~ Add pybindings for multimodal LLM runners Sep 15, 2025

larryliu0820 changed the title ~~Add pybindings for multimodal LLM runners~~ Add pybindings for multimodal LLM runner Sep 15, 2025

mergennachin reviewed Sep 16, 2025

View reviewed changes

larryliu0820 force-pushed the mm_runner_ext branch from d352449 to 0ad3c71 Compare September 18, 2025 07:20

larryliu0820 requested a review from lucylq as a code owner September 18, 2025 07:20

larryliu0820 force-pushed the mm_runner_ext branch 2 times, most recently from a23bee6 to f68cc69 Compare September 19, 2025 21:57

jackzhxng approved these changes Sep 19, 2025

View reviewed changes

larryliu0820 force-pushed the mm_runner_ext branch from f68cc69 to 15a7e6a Compare September 20, 2025 20:07

larryliu0820 requested a review from digantdesai as a code owner September 22, 2025 08:02

larryliu0820 force-pushed the mm_runner_ext branch from 09cccec to 0a135d1 Compare September 22, 2025 08:12