-
Notifications
You must be signed in to change notification settings - Fork 33.1k
perceptron: Isaac-0.1 implementation #40962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AkshatSh
wants to merge
100
commits into
huggingface:main
Choose a base branch
from
perceptron-ai-inc:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
100 commits
Select commit
Hold shift + click to select a range
63cd4a0
initial isaac implementation
AkshatSh 63d1b1b
style: fixing assorted PR notes
philippguevorguian d72311d
fix: get modular convert utility working
philippguevorguian d6ed844
feat: modular convert utility outputs
philippguevorguian 7f4944f
Merge pull request #1 from perceptron-ai-inc/pg/update_isaac
philippguevorguian c3cc42d
chore: port updates
philippguevorguian 965215c
fix: update imports
philippguevorguian 4c4f1c9
fix: adjust typing to get modular convert script working
philippguevorguian 021a1ae
feat: modular convert utility outputs
philippguevorguian 6311fd2
Merge pull request #2 from perceptron-ai-inc/pg/update_isaac
philippguevorguian 3d8b786
feat: port updates to isaac
philippguevorguian 92f56b8
fix: changes to enable modular convert
philippguevorguian 70bcc77
chore: modular convert script artifacts
philippguevorguian 5656b83
style: remove redundant registration
philippguevorguian 963f8c1
style: organize auto file entries
philippguevorguian 74f9f3b
style: lints
philippguevorguian f56f064
fix: processor typing
philippguevorguian 4c5c19d
fix: allow image processor typing
philippguevorguian 8a95e64
style: | for unions
philippguevorguian 25523ba
fix: don't alias siglip
philippguevorguian d1dc712
fix: rename vision config to config to be consistent with base class
philippguevorguian d80c9f6
fix: additional remakes
philippguevorguian 58d7311
chore: convert artifacts
philippguevorguian ffb3b9f
style: make style changes
philippguevorguian 92c36d4
refactor: bespoke isaac config
philippguevorguian 79eb96b
style: ruff organize imports
philippguevorguian 1c6479a
chore: convert configuration artifact
philippguevorguian 2899216
fix: get imports in
philippguevorguian aec7721
style: string typing of qwen2
philippguevorguian c0b10b6
fix: remove image processor and tokenizer typing
philippguevorguian 302374d
fix: enable qwen_2_5_vl import
philippguevorguian de9dc80
style: remove unnecessary copy text
philippguevorguian 107ecde
fix: fix copies
philippguevorguian fd5e399
style: pass kwargs and docstrings
philippguevorguian 206b82a
chore: artifact
philippguevorguian 510eb05
style: revert UP045 typing for autodocstring to work
philippguevorguian 887ff82
Merge branch 'main' into main
philippguevorguian e8d8b76
Merge branch 'main' into pg/update_isaac
philippguevorguian 5da4056
fix: latest transformers changes
philippguevorguian 4a97889
chore: new transformers convert
philippguevorguian 0d55395
again
philippguevorguian 287a461
fix: export pretrained model
philippguevorguian c43cb5d
test: add placeholder tests
philippguevorguian c84df28
docs: add seed documentation
philippguevorguian bf432bc
docs: point to isaac model checkpoint
philippguevorguian 080f22d
fix: set config fields in model
philippguevorguian 0764c2c
docs: add dates stamp
philippguevorguian 43f8b81
Update isaac.md
philippguevorguian 665665e
Merge pull request #3 from perceptron-ai-inc/pg/isaac_passes_make_fixup
philippguevorguian 8c722b1
Merge branch 'main' into main
philippguevorguian 0590025
Isaact e2e tests + passing make fixup (#4)
philippguevorguian 3a6e1c6
Merge branch 'main' into main
philippguevorguian 0463099
fix: update TensorType import for latest changes in transformers main…
philippguevorguian 762032c
Merge branch 'main' into main
philippguevorguian 95296b7
fix: updates for v5 standards (#6)
philippguevorguian 0bd5ac0
Merge branch 'main' into main
philippguevorguian 1cb3c4b
feat: guard perceptron imports (#7)
philippguevorguian d439313
fix: guard PIL import (#8)
philippguevorguian e2fe9f9
fix: guard perceptron PIL and torch imports for CI (#9)
philippguevorguian 257f47c
review revisions (#10)
philippguevorguian f826763
Merge branch 'main' into main
philippguevorguian 03ca8c7
Merge branch 'main' into main
philippguevorguian aa31c36
transformers attention interface + modeling test suite (#11)
philippguevorguian 9226a9c
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian a1892a5
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian 82f25d6
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian 5422d9d
style: review revisions (#12)
philippguevorguian f4a6374
review changes (#13): separate projector class, removed redundant cas…
philippguevorguian f86ba81
Merge branch 'main' into main
philippguevorguian abba38b
Squash merge pg/refactor_remove_tensorstream into main
philippguevorguian 8de326e
Merge branch 'main' into main
philippguevorguian 2b69698
feat: batched inference + rope refactor
philippguevorguian 2884211
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian fdbd633
Squash merge into main
philippguevorguian 6ba2fdb
style: alias norm to communicate scope
philippguevorguian 33fcd57
Merge branch 'main' into main
philippguevorguian 57cbd79
refactor: no packed batch inference (#14)
philippguevorguian f2491e8
Merge branch 'main' into main
philippguevorguian bf501dd
feat: rely on qwen3 backbone, flatten vision components, misc style c…
philippguevorguian 22ce167
Merge branch 'main' into main
philippguevorguian c5514ac
Merge branch 'main' into main
philippguevorguian 778d8c5
feat: config updates, image processor backend, assorted changes/tests…
philippguevorguian 231aa23
Merge branch 'main' into main
philippguevorguian 67ae690
style: cleanup (#17)
philippguevorguian bbd8289
style: unify image attention mask + import update (#18)
philippguevorguian ed8fc0a
style: further mask threading simplification + processing docstring (…
philippguevorguian ef3c6f7
Merge branch 'main' into main
philippguevorguian caf377c
test: update tests
philippguevorguian 048094d
Merge branch 'main' into main
philippguevorguian 748c82b
Merge branch 'main' into main
philippguevorguian 7c6ca57
Merge branch 'main' into main
philippguevorguian 8b96e5f
Squash merge pg/additional_cleanup into main
philippguevorguian 81206db
check repo fixes
philippguevorguian 86235d4
add correct date
philippguevorguian e99bbc1
fix: make the pointing types belong to processor class
philippguevorguian 0325565
Merge branch 'main' into main
philippguevorguian 24af778
style: pre final review (#20)
philippguevorguian 3d9e55d
lint
philippguevorguian 251210f
fix: map isaac_vision to isaac module
philippguevorguian bbadef8
fix: specify required backend
philippguevorguian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,143 @@ | ||
| <!--Copyright 2026 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on {release_date} and added to Hugging Face Transformers on 2026-04-13.* | ||
|
|
||
| <div style="float: right;"> | ||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||
| <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
| </div> | ||
|
|
||
| # Isaac | ||
|
|
||
| ## Overview | ||
|
|
||
| Isaac is Perceptron's vision-language model (VLM) that pairs a SigLIP2 vision encoder with a Qwen3 decoder-only stack. The | ||
| Transformers implementation supports text-only and image-conditioned generation, including prompts with multiple interleaved | ||
| images. Isaac uses variable-resolution image preprocessing and can optionally reduce spatial tokens with pixel shuffle to keep | ||
| long multimodal prompts manageable. For more information, refer to the [technical report](https://github.com/perceptron-ai-inc/perceptron/blob/main/papers/isaac_01.pdf). | ||
|
|
||
| Isaac checkpoints are distributed under Perceptron's Non-Production license; please review the license that ships with the | ||
| weights before using them in commercial settings. | ||
|
|
||
| ## Usage tips | ||
|
|
||
| - Batched inputs can mix text-only and multimodal samples. For direct processor/model batching, pass images as a nested | ||
| list such as `[[], [image_a], [image_b, image_c]]`. | ||
| - `image_grid_thw[batch_idx, image_slot] == (0, 0, 0)` marks a padded empty slot. Real image slots have | ||
| `(T=1, H>0, W>0)`. | ||
| - If truncation is enabled, the processor keeps the rightmost part of the multimodal prompt and updates the slot-local | ||
| `image_metadata[..., 0]` and `image_metadata[..., 1]` values automatically. | ||
|
|
||
| ## Usage example | ||
|
|
||
| Isaac uses explicit image placeholders in the rendered prompt. Every occurrence of `processor.image_token` (usually `<image>`) must have a matching image in the `images` argument. | ||
|
|
||
| ```py | ||
| import torch | ||
| from PIL import Image | ||
| from transformers import AutoProcessor, IsaacForConditionalGeneration | ||
|
|
||
| model_id = "PerceptronAI/Isaac-0.1" | ||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| model = IsaacForConditionalGeneration.from_pretrained( | ||
| model_id, | ||
| dtype=torch.bfloat16, | ||
| device_map="auto", | ||
| attn_implementation="flash_attention_2", | ||
| ) | ||
|
|
||
| conversation = [ | ||
| { | ||
|
|
||
| "role": "user", | ||
| "content": [ | ||
| {"type": "text", "text": "Compare the two figures and explain what changed."}, | ||
| {"type": "image", "path": "first_image.png"}, | ||
| {"type": "image", "path": "second_image.png"}, | ||
| ], | ||
| }, | ||
| ] | ||
|
|
||
| prompt = processor.apply_chat_template( | ||
| conversation, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| add_generation_prompt=True, | ||
| return_tensors="pt", | ||
| ) | ||
|
|
||
| inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device) | ||
| generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False,) | ||
|
|
||
| generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :] | ||
| response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
| print(response) | ||
| ``` | ||
|
|
||
| ### Post-processing grounded outputs | ||
|
|
||
| Isaac can generate grounded points and boxes in tagged text spans. Use `post_process_generation()` to strip the tags and | ||
| recover structured annotations. | ||
|
|
||
| ```py | ||
| clean_text, annotations = processor.post_process_generation(response, expected="box") | ||
| print(clean_text) | ||
| print(annotations) | ||
| ``` | ||
|
|
||
| Set `expected="point"` to extract point annotations, or leave `expected=None` to collect both points and boxes. | ||
|
|
||
| ## IsaacVisionConfig | ||
|
|
||
| [[autodoc]] IsaacVisionConfig | ||
|
|
||
| ## IsaacTextConfig | ||
|
|
||
| [[autodoc]] IsaacTextConfig | ||
|
|
||
| ## IsaacConfig | ||
|
|
||
| [[autodoc]] IsaacConfig | ||
|
|
||
| ## IsaacVisionModel | ||
|
|
||
| [[autodoc]] IsaacVisionModel | ||
|
|
||
| ## IsaacTextModel | ||
|
|
||
| [[autodoc]] IsaacTextModel | ||
| - forward | ||
|
|
||
| ## IsaacModel | ||
|
|
||
| [[autodoc]] IsaacModel | ||
| - forward | ||
|
|
||
| ## IsaacForConditionalGeneration | ||
|
|
||
| [[autodoc]] IsaacForConditionalGeneration | ||
| - forward | ||
|
|
||
| ## IsaacProcessor | ||
|
|
||
| [[autodoc]] IsaacProcessor | ||
|
|
||
| ## IsaacImageProcessor | ||
|
|
||
| [[autodoc]] IsaacImageProcessor | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1003,6 +1003,33 @@ class EmbeddingAccessMixin: | |
|
|
||
| _input_embed_layer = "embed_tokens" # default layer that holds input embeddings. | ||
|
|
||
| def _resolve_input_embed_layer(self) -> tuple[nn.Module | None, str]: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kinda unrelated, but I see what you mean. Let's not add it here, model code is already huge. We can keep old-format |
||
| """ | ||
| Returns the parent module and leaf attribute for `_input_embed_layer`. | ||
|
|
||
| Supports both a simple attribute name such as `embed_tokens` and a dotted path such as | ||
| `text_model.embed_tokens`. | ||
| """ | ||
|
|
||
| name = getattr(self, "_input_embed_layer", "embed_tokens") | ||
| if "." not in name: | ||
| return None, name | ||
|
|
||
| module_path, _, attribute_name = name.rpartition(".") | ||
| try: | ||
| module = self.get_submodule(module_path) | ||
| except AttributeError as error: | ||
| raise NotImplementedError( | ||
| f"`_input_embed_layer={name}` could not be resolved for {self.__class__.__name__}." | ||
| ) from error | ||
|
|
||
| if not hasattr(module, attribute_name): | ||
| raise NotImplementedError( | ||
| f"`_input_embed_layer={name}` could not be resolved for {self.__class__.__name__}." | ||
| ) | ||
|
|
||
| return module, attribute_name | ||
|
|
||
| def get_input_embeddings(self) -> nn.Module: | ||
| """ | ||
| Returns the model's input embeddings. | ||
|
|
@@ -1011,7 +1038,9 @@ def get_input_embeddings(self) -> nn.Module: | |
| `nn.Module`: A torch module mapping vocabulary to hidden states. | ||
| """ | ||
|
|
||
| name = getattr(self, "_input_embed_layer", "embed_tokens") | ||
| module, name = self._resolve_input_embed_layer() | ||
| if module is not None: | ||
| return getattr(module, name) | ||
|
|
||
| # 1) Direct attribute (most NLP models). | ||
| if (default_embedding := getattr(self, name, None)) is not None: | ||
|
|
@@ -1044,7 +1073,11 @@ def set_input_embeddings(self, value: nn.Module): | |
| should) override for exotic layouts. | ||
| """ | ||
|
|
||
| name = getattr(self, "_input_embed_layer", "embed_tokens") | ||
| module, name = self._resolve_input_embed_layer() | ||
| if module is not None: | ||
| setattr(module, name, value) | ||
| return | ||
|
|
||
| # 1) Direct attribute (most NLP models) | ||
| if hasattr(self, name): | ||
| setattr(self, name, value) | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import _LazyModule | ||
| from ...utils.import_utils import define_import_structure | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_isaac import * | ||
| from .modeling_isaac import * | ||
| from .processing_isaac import * | ||
| else: | ||
| import sys | ||
|
|
||
| _file = globals()["__file__"] | ||
| sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output from
apply_chat_templateis already a dict ofinputs