Add pybindings for multimodal LLM runner#14285
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Pending, 1 Unrelated FailureAs of commit f8ace7d with merge base c9f46e2 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
extension/llm/runner/__init__.py
Outdated
| ValueError: If the image format is not supported | ||
| FileNotFoundError: If the image file doesn't exist | ||
| """ | ||
| if isinstance(image, (str, Path)): |
There was a problem hiding this comment.
shouldn't you use the CV preprocessing utils function?
There was a problem hiding this comment.
Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.
d352449 to
0ad3c71
Compare
a23bee6 to
f68cc69
Compare
f68cc69 to
15a7e6a
Compare
09cccec to
0a135d1
Compare
| .def("is_audio", &MultimodalInput::is_audio) | ||
| .def("is_raw_audio", &MultimodalInput::is_raw_audio) | ||
| .def( | ||
| "get_text", |
There was a problem hiding this comment.
Not totally convinced all these getter impl are correct
80b7130 to
9d5844a
Compare
| print(f"Image: {image.width}x{image.height}x{image.channels}") | ||
|
|
||
| # Check input types safely | ||
| if text_input.is_text(): |
There was a problem hiding this comment.
Would a user ever need to do this?
| """Reset the conversation state""" | ||
| self.runner.reset() | ||
|
|
||
| # Usage |
There was a problem hiding this comment.
Should this section just be a demo.py?
There was a problem hiding this comment.
Yeah would be good to have a notebook. But I'll leave it here for now
|
🎉 |
This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management. **Python Bindings Implementation:** * Added a new high-level Python API in `__init__.py` for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing. * Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled. **Build System Integration:** * Updated `CMakeLists.txt` to add a `pybind11`-based Python extension module (`_llm_runner`) when `EXECUTORCH_BUILD_PYBIND` is set, linking all necessary dependencies and setting up include paths. **Documentation and Planning:** * Added python API section to `README.md`. **Utility and Extensibility:** * Exposed utility functions (`load_image_from_file`, `preprocess_image`, `create_generation_config`) for easier input preprocessing and configuration from Python. **Testing and Examples (Planned):** * Added `test_runner_pybindings.py`. **Code Snippet of How to Use:** ```python from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input from transformers import AutoProcessor model_id = "google/gemma-3-4b-it" processor = AutoProcessor.from_pretrained(model_id) image_url = "https://llava-vl.github.io/static/images/view.jpg" conversation = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, { "role": "user", "content": [ {"type": "image", "url": image_url}, { "type": "text", "text": "What are the things I should be cautious about when I visit here?", }, ], }, ] inputs = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), make_image_input(inputs["pixel_values"]), make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"), ] runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` Output from console: ``` [multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:109] Prefilling input 0/3, type: text [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 1/3, type: image [multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 2/3, type: text [util.h:125] second_input_sizes[0] = 1023 What are the things I should be cautious about when I visit here?<end_of_turn> You' [multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048 re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare. **1. Weather & Terrain – Expanded:** * **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself. * **Terrain Stability:** The PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000} [stats.h:108] Prompt Tokens: 669 Generated Tokens: 99 [stats.h:114] Model Load Time: 2.297000 (seconds) [stats.h:124] Total inference time: 20.401000 (seconds) Rate: 4.852703 (tokens/second) [stats.h:132] Prompt evaluation: 12.661000 (seconds) Rate: 52.839428 (tokens/second) [stats.h:143] Generated 99 tokens: 7.740000 (seconds) Rate: 12.790698 (tokens/second) [stats.h:151] Time to first generated token: 12.661000 (seconds) [stats.h:158] Sampling time over 768 tokens: 0.117000 (seconds) ``` cc @mergennachin @cccclai @helunwencser @jackzhxng
This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.
Python Bindings Implementation:
__init__.pyfor the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.Build System Integration:
CMakeLists.txtto add apybind11-based Python extension module (_llm_runner) whenEXECUTORCH_BUILD_PYBINDis set, linking all necessary dependencies and setting up include paths.Documentation and Planning:
README.md.Utility and Extensibility:
load_image_from_file,preprocess_image,create_generation_config) for easier input preprocessing and configuration from Python.Testing and Examples (Planned):
test_runner_pybindings.py.Code Snippet of How to Use:
Output from console:
cc @mergennachin @cccclai @helunwencser @jackzhxng