Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
63cd4a0
initial isaac implementation
AkshatSh Sep 18, 2025
63d1b1b
style: fixing assorted PR notes
philippguevorguian Oct 10, 2025
d72311d
fix: get modular convert utility working
philippguevorguian Oct 10, 2025
d6ed844
feat: modular convert utility outputs
philippguevorguian Oct 10, 2025
7f4944f
Merge pull request #1 from perceptron-ai-inc/pg/update_isaac
philippguevorguian Oct 10, 2025
c3cc42d
chore: port updates
philippguevorguian Oct 15, 2025
965215c
fix: update imports
philippguevorguian Oct 15, 2025
4c4f1c9
fix: adjust typing to get modular convert script working
philippguevorguian Oct 15, 2025
021a1ae
feat: modular convert utility outputs
philippguevorguian Oct 15, 2025
6311fd2
Merge pull request #2 from perceptron-ai-inc/pg/update_isaac
philippguevorguian Oct 15, 2025
3d8b786
feat: port updates to isaac
philippguevorguian Nov 10, 2025
92f56b8
fix: changes to enable modular convert
philippguevorguian Nov 10, 2025
70bcc77
chore: modular convert script artifacts
philippguevorguian Nov 10, 2025
5656b83
style: remove redundant registration
philippguevorguian Nov 10, 2025
963f8c1
style: organize auto file entries
philippguevorguian Nov 10, 2025
74f9f3b
style: lints
philippguevorguian Nov 10, 2025
f56f064
fix: processor typing
philippguevorguian Nov 10, 2025
4c5c19d
fix: allow image processor typing
philippguevorguian Nov 10, 2025
8a95e64
style: | for unions
philippguevorguian Nov 10, 2025
25523ba
fix: don't alias siglip
philippguevorguian Nov 10, 2025
d1dc712
fix: rename vision config to config to be consistent with base class
philippguevorguian Nov 10, 2025
d80c9f6
fix: additional remakes
philippguevorguian Nov 10, 2025
58d7311
chore: convert artifacts
philippguevorguian Nov 10, 2025
ffb3b9f
style: make style changes
philippguevorguian Nov 10, 2025
92c36d4
refactor: bespoke isaac config
philippguevorguian Nov 10, 2025
79eb96b
style: ruff organize imports
philippguevorguian Nov 10, 2025
1c6479a
chore: convert configuration artifact
philippguevorguian Nov 10, 2025
2899216
fix: get imports in
philippguevorguian Nov 10, 2025
aec7721
style: string typing of qwen2
philippguevorguian Nov 10, 2025
c0b10b6
fix: remove image processor and tokenizer typing
philippguevorguian Nov 10, 2025
302374d
fix: enable qwen_2_5_vl import
philippguevorguian Nov 10, 2025
de9dc80
style: remove unnecessary copy text
philippguevorguian Nov 10, 2025
107ecde
fix: fix copies
philippguevorguian Nov 10, 2025
fd5e399
style: pass kwargs and docstrings
philippguevorguian Nov 10, 2025
206b82a
chore: artifact
philippguevorguian Nov 10, 2025
510eb05
style: revert UP045 typing for autodocstring to work
philippguevorguian Nov 10, 2025
887ff82
Merge branch 'main' into main
philippguevorguian Nov 10, 2025
e8d8b76
Merge branch 'main' into pg/update_isaac
philippguevorguian Nov 10, 2025
5da4056
fix: latest transformers changes
philippguevorguian Nov 10, 2025
4a97889
chore: new transformers convert
philippguevorguian Nov 10, 2025
0d55395
again
philippguevorguian Nov 10, 2025
287a461
fix: export pretrained model
philippguevorguian Nov 10, 2025
c43cb5d
test: add placeholder tests
philippguevorguian Nov 10, 2025
c84df28
docs: add seed documentation
philippguevorguian Nov 10, 2025
bf432bc
docs: point to isaac model checkpoint
philippguevorguian Nov 10, 2025
080f22d
fix: set config fields in model
philippguevorguian Nov 10, 2025
0764c2c
docs: add dates stamp
philippguevorguian Nov 10, 2025
43f8b81
Update isaac.md
philippguevorguian Nov 14, 2025
665665e
Merge pull request #3 from perceptron-ai-inc/pg/isaac_passes_make_fixup
philippguevorguian Nov 14, 2025
8c722b1
Merge branch 'main' into main
philippguevorguian Nov 17, 2025
0590025
Isaact e2e tests + passing make fixup (#4)
philippguevorguian Nov 17, 2025
3a6e1c6
Merge branch 'main' into main
philippguevorguian Dec 3, 2025
0463099
fix: update TensorType import for latest changes in transformers main…
philippguevorguian Dec 3, 2025
762032c
Merge branch 'main' into main
philippguevorguian Dec 4, 2025
95296b7
fix: updates for v5 standards (#6)
philippguevorguian Dec 4, 2025
0bd5ac0
Merge branch 'main' into main
philippguevorguian Dec 9, 2025
1cb3c4b
feat: guard perceptron imports (#7)
philippguevorguian Dec 9, 2025
d439313
fix: guard PIL import (#8)
philippguevorguian Dec 9, 2025
e2fe9f9
fix: guard perceptron PIL and torch imports for CI (#9)
philippguevorguian Dec 9, 2025
257f47c
review revisions (#10)
philippguevorguian Dec 12, 2025
f826763
Merge branch 'main' into main
philippguevorguian Dec 17, 2025
03ca8c7
Merge branch 'main' into main
philippguevorguian Dec 17, 2025
aa31c36
transformers attention interface + modeling test suite (#11)
philippguevorguian Dec 17, 2025
9226a9c
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian Dec 17, 2025
a1892a5
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian Dec 17, 2025
82f25d6
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian Dec 17, 2025
5422d9d
style: review revisions (#12)
philippguevorguian Dec 18, 2025
f4a6374
review changes (#13): separate projector class, removed redundant cas…
philippguevorguian Dec 19, 2025
f86ba81
Merge branch 'main' into main
philippguevorguian Dec 23, 2025
abba38b
Squash merge pg/refactor_remove_tensorstream into main
philippguevorguian Dec 24, 2025
8de326e
Merge branch 'main' into main
philippguevorguian Dec 30, 2025
2b69698
feat: batched inference + rope refactor
philippguevorguian Dec 30, 2025
2884211
Update src/transformers/models/isaac/modular_isaac.py
philippguevorguian Jan 9, 2026
fdbd633
Squash merge into main
philippguevorguian Jan 14, 2026
6ba2fdb
style: alias norm to communicate scope
philippguevorguian Jan 14, 2026
33fcd57
Merge branch 'main' into main
philippguevorguian Mar 3, 2026
57cbd79
refactor: no packed batch inference (#14)
philippguevorguian Mar 4, 2026
f2491e8
Merge branch 'main' into main
philippguevorguian Mar 13, 2026
bf501dd
feat: rely on qwen3 backbone, flatten vision components, misc style c…
philippguevorguian Mar 19, 2026
22ce167
Merge branch 'main' into main
philippguevorguian Mar 23, 2026
c5514ac
Merge branch 'main' into main
philippguevorguian Mar 24, 2026
778d8c5
feat: config updates, image processor backend, assorted changes/tests…
philippguevorguian Mar 24, 2026
231aa23
Merge branch 'main' into main
philippguevorguian Mar 24, 2026
67ae690
style: cleanup (#17)
philippguevorguian Mar 24, 2026
bbd8289
style: unify image attention mask + import update (#18)
philippguevorguian Mar 24, 2026
ed8fc0a
style: further mask threading simplification + processing docstring (…
philippguevorguian Mar 24, 2026
ef3c6f7
Merge branch 'main' into main
philippguevorguian Mar 24, 2026
caf377c
test: update tests
philippguevorguian Mar 24, 2026
048094d
Merge branch 'main' into main
philippguevorguian Mar 25, 2026
748c82b
Merge branch 'main' into main
philippguevorguian Mar 30, 2026
7c6ca57
Merge branch 'main' into main
philippguevorguian Mar 31, 2026
8b96e5f
Squash merge pg/additional_cleanup into main
philippguevorguian Mar 31, 2026
81206db
check repo fixes
philippguevorguian Mar 31, 2026
86235d4
add correct date
philippguevorguian Mar 31, 2026
e99bbc1
fix: make the pointing types belong to processor class
philippguevorguian Mar 31, 2026
0325565
Merge branch 'main' into main
philippguevorguian Apr 13, 2026
24af778
style: pre final review (#20)
philippguevorguian Apr 13, 2026
3d9e55d
lint
philippguevorguian Apr 13, 2026
251210f
fix: map isaac_vision to isaac module
philippguevorguian Apr 13, 2026
bbadef8
fix: specify required backend
philippguevorguian Apr 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1237,6 +1237,8 @@
title: InstructBlipVideo
- local: model_doc/internvl
title: InternVL
- local: model_doc/isaac
title: Isaac
- local: model_doc/janus
title: Janus
- local: model_doc/kosmos-2
Expand Down
143 changes: 143 additions & 0 deletions docs/source/en/model_doc/isaac.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
<!--Copyright 2026 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2026-04-13.*

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>

# Isaac

## Overview

Isaac is Perceptron's vision-language model (VLM) that pairs a SigLIP2 vision encoder with a Qwen3 decoder-only stack. The
Transformers implementation supports text-only and image-conditioned generation, including prompts with multiple interleaved
images. Isaac uses variable-resolution image preprocessing and can optionally reduce spatial tokens with pixel shuffle to keep
long multimodal prompts manageable. For more information, refer to the [technical report](https://github.com/perceptron-ai-inc/perceptron/blob/main/papers/isaac_01.pdf).

Isaac checkpoints are distributed under Perceptron's Non-Production license; please review the license that ships with the
weights before using them in commercial settings.

## Usage tips

- Batched inputs can mix text-only and multimodal samples. For direct processor/model batching, pass images as a nested
list such as `[[], [image_a], [image_b, image_c]]`.
- `image_grid_thw[batch_idx, image_slot] == (0, 0, 0)` marks a padded empty slot. Real image slots have
`(T=1, H>0, W>0)`.
- If truncation is enabled, the processor keeps the rightmost part of the multimodal prompt and updates the slot-local
`image_metadata[..., 0]` and `image_metadata[..., 1]` values automatically.

## Usage example

Isaac uses explicit image placeholders in the rendered prompt. Every occurrence of `processor.image_token` (usually `<image>`) must have a matching image in the `images` argument.

```py
import torch
from PIL import Image
from transformers import AutoProcessor, IsaacForConditionalGeneration

model_id = "PerceptronAI/Isaac-0.1"
processor = AutoProcessor.from_pretrained(model_id)
model = IsaacForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)

conversation = [
{

"role": "user",
"content": [
{"type": "text", "text": "Compare the two figures and explain what changed."},
{"type": "image", "path": "first_image.png"},
{"type": "image", "path": "second_image.png"},
],
},
]

prompt = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
)

inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
Comment on lines +77 to +85
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output from apply_chat_template is already a dict of inputs

generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False,)

generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

### Post-processing grounded outputs

Isaac can generate grounded points and boxes in tagged text spans. Use `post_process_generation()` to strip the tags and
recover structured annotations.

```py
clean_text, annotations = processor.post_process_generation(response, expected="box")
print(clean_text)
print(annotations)
```

Set `expected="point"` to extract point annotations, or leave `expected=None` to collect both points and boxes.

## IsaacVisionConfig

[[autodoc]] IsaacVisionConfig

## IsaacTextConfig

[[autodoc]] IsaacTextConfig

## IsaacConfig

[[autodoc]] IsaacConfig

## IsaacVisionModel

[[autodoc]] IsaacVisionModel

## IsaacTextModel

[[autodoc]] IsaacTextModel
- forward

## IsaacModel

[[autodoc]] IsaacModel
- forward

## IsaacForConditionalGeneration

[[autodoc]] IsaacForConditionalGeneration
- forward

## IsaacProcessor

[[autodoc]] IsaacProcessor

## IsaacImageProcessor

[[autodoc]] IsaacImageProcessor
4 changes: 4 additions & 0 deletions src/transformers/conversion_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,10 @@ def _build_checkpoint_conversion_mapping():
),
WeightRenaming(source_patterns=r"^visual", target_patterns="model.visual"),
],
"isaac": [
WeightRenaming(source_patterns=r"text_model", target_patterns="language_model"),
WeightRenaming(source_patterns=r"vision_tower", target_patterns="visual"),
],
"colqwen2": [
WeightRenaming(source_patterns=r"vlm.model", target_patterns="vlm"),
WeightRenaming(source_patterns=r"vlm(?!\.(language_model|visual))", target_patterns="vlm.language_model"),
Expand Down
37 changes: 35 additions & 2 deletions src/transformers/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1003,6 +1003,33 @@ class EmbeddingAccessMixin:

_input_embed_layer = "embed_tokens" # default layer that holds input embeddings.

def _resolve_input_embed_layer(self) -> tuple[nn.Module | None, str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda unrelated, but I see what you mean. Let's not add it here, model code is already huge. We can keep old-format get_inputs_embeddings and you can submit a new PR to improve it, very much welcome :)

"""
Returns the parent module and leaf attribute for `_input_embed_layer`.

Supports both a simple attribute name such as `embed_tokens` and a dotted path such as
`text_model.embed_tokens`.
"""

name = getattr(self, "_input_embed_layer", "embed_tokens")
if "." not in name:
return None, name

module_path, _, attribute_name = name.rpartition(".")
try:
module = self.get_submodule(module_path)
except AttributeError as error:
raise NotImplementedError(
f"`_input_embed_layer={name}` could not be resolved for {self.__class__.__name__}."
) from error

if not hasattr(module, attribute_name):
raise NotImplementedError(
f"`_input_embed_layer={name}` could not be resolved for {self.__class__.__name__}."
)

return module, attribute_name

def get_input_embeddings(self) -> nn.Module:
"""
Returns the model's input embeddings.
Expand All @@ -1011,7 +1038,9 @@ def get_input_embeddings(self) -> nn.Module:
`nn.Module`: A torch module mapping vocabulary to hidden states.
"""

name = getattr(self, "_input_embed_layer", "embed_tokens")
module, name = self._resolve_input_embed_layer()
if module is not None:
return getattr(module, name)

# 1) Direct attribute (most NLP models).
if (default_embedding := getattr(self, name, None)) is not None:
Expand Down Expand Up @@ -1044,7 +1073,11 @@ def set_input_embeddings(self, value: nn.Module):
should) override for exotic layouts.
"""

name = getattr(self, "_input_embed_layer", "embed_tokens")
module, name = self._resolve_input_embed_layer()
if module is not None:
setattr(module, name, value)
return

# 1) Direct attribute (most NLP models)
if hasattr(self, name):
setattr(self, name, value)
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@
from .instructblip import *
from .instructblipvideo import *
from .internvl import *
from .isaac import *
from .jais2 import *
from .jamba import *
from .janus import *
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,8 @@
("instructblipvideo", "InstructBlipVideoConfig"),
("internvl", "InternVLConfig"),
("internvl_vision", "InternVLVisionConfig"),
("isaac", "IsaacConfig"),
("isaac_vision", "IsaacVisionConfig"),
("jais2", "Jais2Config"),
("jamba", "JambaConfig"),
("janus", "JanusConfig"),
Expand Down Expand Up @@ -758,6 +760,8 @@
("instructblipvideo", "InstructBlipVideo"),
("internvl", "InternVL"),
("internvl_vision", "InternVLVision"),
("isaac", "Isaac"),
("isaac_vision", "IsaacVision"),
("jais2", "Jais2"),
("jamba", "Jamba"),
("janus", "Janus"),
Expand Down Expand Up @@ -1109,6 +1113,7 @@
("gemma4_audio", "gemma4"),
("gemma4_text", "gemma4"),
("gemma4_vision", "gemma4"),
("isaac_vision", "isaac"),
("glm4v_vision", "glm4v"),
("glm4v_moe_vision", "glm4v_moe"),
("glm4v_text", "glm4v"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@
("imagegpt", {"torchvision": "ImageGPTImageProcessor", "pil": "ImageGPTImageProcessorPil"}),
("instructblip", {"torchvision": "BlipImageProcessor", "pil": "BlipImageProcessorPil"}),
("internvl", {"torchvision": "GotOcr2ImageProcessor", "pil": "GotOcr2ImageProcessorPil"}),
("isaac", {"torchvision": "IsaacImageProcessor"}),
("janus", {"torchvision": "JanusImageProcessor", "pil": "JanusImageProcessorPil"}),
("kosmos-2", {"torchvision": "CLIPImageProcessor", "pil": "CLIPImageProcessorPil"}),
("kosmos-2.5", {"torchvision": "Kosmos2_5ImageProcessor", "pil": "Kosmos2_5ImageProcessorPil"}),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("instructblipvideo", "InstructBlipVideoModel"),
("internvl", "InternVLModel"),
("internvl_vision", "InternVLVisionModel"),
("isaac", "IsaacModel"),
("isaac_vision", "IsaacVisionModel"),
("jais2", "Jais2Model"),
("jamba", "JambaModel"),
("janus", "JanusModel"),
Expand Down Expand Up @@ -990,6 +992,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("instructblip", "InstructBlipForConditionalGeneration"),
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("internvl", "InternVLForConditionalGeneration"),
("isaac", "IsaacForConditionalGeneration"),
("janus", "JanusForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("internvl", "InternVLProcessor"),
("isaac", "IsaacProcessor"),
("janus", "JanusProcessor"),
("kosmos-2", "Kosmos2Processor"),
("kosmos-2.5", "Kosmos2_5Processor"),
Expand Down
28 changes: 28 additions & 0 deletions src/transformers/models/isaac/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_isaac import *
from .modeling_isaac import *
from .processing_isaac import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading