Skip to content

perceptron: Isaac-0.1 implementation#40962

Open
AkshatSh wants to merge 100 commits intohuggingface:mainfrom
perceptron-ai-inc:main
Open

perceptron: Isaac-0.1 implementation#40962
AkshatSh wants to merge 100 commits intohuggingface:mainfrom
perceptron-ai-inc:main

Conversation

@AkshatSh
Copy link
Copy Markdown

Perceptron Isaac Implementation

Perceptron released open weight models Isaac-0.1 and Isaac-0.1-Base a 2B dense model for perception.

@AkshatSh AkshatSh marked this pull request as draft September 18, 2025 07:05
@zucchini-nlp
Copy link
Copy Markdown
Member

Nice, lmk if you need any help or a quick review 🤗

@AkshatSh
Copy link
Copy Markdown
Author

@zucchini-nlp - we are just closing this out feel free to give us a quick review while we just finish up testing!

@zucchini-nlp
Copy link
Copy Markdown
Member

amazing, reviewing on Monday :)

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. I did not review everything and went over to check if the PR follows transformers API. The most important comments are:

  1. We need to add image processor class and wrap image transforms inside it
  2. Is the "tensorstream" required to get the model running? AFAIU we can discard it which is prob what we'll need to do. Transformers does not support streaming of multimodals yet
  3. The model structure has to follow standards of other VLMs with correct class names

LMK if you have any questions 🤗

Comment on lines +42 to +48
from perceptron.tensorstream.ops import (
compute_mrope_pos_tensor,
modality_mask,
reconstruct_tensor_stream_from_compact_dict,
slice as ts_slice,
tensor_stream_token_view,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is this for streaming video data? I am not sure we can add it as is, we need to either safe import it or find a way to integrate it in self.video_processor

Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
logger = logging.get_logger(__name__)


class PixelShuffleSiglip2VisionConfig(Siglip2VisionConfig):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to call it PerceptronVisionConfig with the model's name

Comment on lines +102 to +107
# Prepare positional embeddings grid: (1, embed_dim, h, w)
positional_embeddings = (
self.position_embedding.weight.reshape(self.position_embedding_size, self.position_embedding_size, -1)
.permute(2, 0, 1)
.unsqueeze(0)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can prepare it once when model is init, no?

Comment on lines +111 to +113
mode = "bilinear"
align_corners = False
antialias = True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used only once, we don't need to save in var and can move directly to interpolate call

Comment on lines +782 to +789
class RopeScaling(TypedDict, total=False):
rope_type: str
factor: float
mrope_section: list[int]
mrope_interleaved: bool
low_freq_factor: float
high_freq_factor: float
original_max_position_embeddings: int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have rope params in config.text_config.rope_scaling. We are planning a huge rope refactor, and until then it will be easier to follow the existing format

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict

Comment on lines +903 to +912
def apply_chat_template(
self,
messages: list[dict[str, Any]],
tokenize: bool = False,
add_generation_prompt: bool = False,
**kwargs,
) -> Any:
return self.tokenizer.apply_chat_template(
messages, tokenize=tokenize, add_generation_prompt=add_generation_prompt, **kwargs
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed to override

def __init__(
self,
tokenizer: AutoTokenizer,
config: IsaacConfig,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the processor class does not expect config and needs to be initialized with tokenizer ,image-processor, kwargs

Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment on lines +1125 to +1129
super().__init__(config)
self.layers = torch.nn.ModuleList(
[Qwen3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
self.rotary_emb = IsaacRotaryEmbedding(config, device=self.device)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the text model is identical to qwen3, we need to init with AutoModel.from_config(text_config). Otherwise we need to add IsaacTextModel by copying with modular

@philippguevorguian
Copy link
Copy Markdown

@zucchini-nlp just pushed a round of updates addressing your feedback; give it another look when you get a moment

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! I gave a quick look at the first ~1500 lines.

We are currently insisting on following certain standards before merging models, as it causes many issues otherwise. Overall the code looks good but I feel like the models still needs more standardization. So I left some comments and I will review the rest on Monday.Not sure if we can keep the tensorstream, let me see in detail how it is used

Comment thread src/transformers/models/isaac/modular_isaac.py
Comment on lines +16 to +19
try:
from torchvision.transforms.v2 import functional as TVF
except ImportError:
TVF = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is usually imported with is_torchvision_available() from `from ...utils import is_torchvision_available. If it is used only in the fast image processing file, then we don't even need to safe import :)

Comment on lines +74 to +77
# Vision preprocessing constants
VISION_MEAN = (0.5, 0.5, 0.5)
VISION_STD = (0.5, 0.5, 0.5)
VISION_SCALE = 1 / 255
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mean and std are available as IMAGENET_STANDARD_MEAN and STD in image_utils. The other onecan just be assigned a var when needed. Let's not have global var

Comment on lines +144 to +149
do_rescale: bool | None
rescale_factor: float | None
do_normalize: bool | None
image_mean: float | Sequence[float] | None
image_std: float | Sequence[float] | None
do_convert_rgb: bool | None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are already defined in base class, no need to duplicate

Comment on lines +155 to +160
slow_image_processor_class = "IsaacImageProcessor"

resample = PILImageResampling.BILINEAR
model_input_names = ["patches", "token_grids"]
valid_kwargs = IsaacImageProcessorKwargs
unused_kwargs = ["size", "do_center_crop", "crop_size"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the class attr here should be all the image args with their default values, i.e. do_resize, size, image_mean etc. The attr slow_image_processor_class is not needed

# Configuration
# ============================================================================

MAX_PIXELS = 60_000_000 # 60‑megapixel ceiling ≈ 8200 × 7300 px
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, global vars are better assigned where it is used. In this case prob it's a cls attr of a fast image processor

Comment on lines +782 to +789
class RopeScaling(TypedDict, total=False):
rope_type: str
factor: float
mrope_section: list[int]
mrope_interleaved: bool
low_freq_factor: float
high_freq_factor: float
original_max_position_embeddings: int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict

Comment on lines +1220 to +1233
# EventStreamProcessor parameters (for backward compatibility)
self.video_patch_size = vision_patch_size
self.vision_max_num_patches = vision_max_num_patches
self.vision_min_num_patches = vision_min_num_patches
self.pixel_shuffle_scale = pixel_shuffle_scale

# Vision normalization parameters
self.vision_rescale_factor = float(vision_rescale_factor)
self.vision_mean = _normalize_rgb_values(vision_mean, name="vision_mean")
self.vision_std = _normalize_rgb_values(vision_std, name="vision_std")

# Processing parameters
self.max_sequence_length = max_sequence_length
self.vision_token = vision_token
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are params from processing. Do we really need them for model code? I did not see where they are used and usually modeling does not need to know how processing happened

# Processing parameters
self.max_sequence_length = max_sequence_length
self.vision_token = vision_token
self.vision_attn_implementation = vision_attn_implementation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, the vision config has its own attn implementation attribute

Comment on lines +1236 to +1241
def get_text_config(self, *_, **kwargs) -> Qwen3Config:
# Accept optional decoder/encoder flags to align with HF composite configs
kwargs.pop("decoder", None)
kwargs.pop("encoder", None)
return self.text_config

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the super get_text_config works well, or is it raising errors?

@philippguevorguian
Copy link
Copy Markdown

@zucchini-nlp just pushed another round of refinements based on the latest feedback. Would love for you to give it another pass when you get a chance

@philippguevorguian
Copy link
Copy Markdown

Hi @zucchini-nlp, gentle ping on this PR when you get a chance. Thanks for your time!

@philippguevorguian
Copy link
Copy Markdown

Hey @zucchini-nlp - just wanted to follow up again on this PR. It’s been a little while since the last round, and your review would help us move things forward. Appreciate it!

@molbap molbap self-requested a review October 29, 2025 09:33
@@ -0,0 +1,2278 @@
# Perceptron, Inc. Non-Production License
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing date and usual format header

Comment on lines +94 to +111
from genesis.public.tensorstream.tensor_stream import (
Event,
Stream,
TensorStream,
TextType,
VisionType,
create_stream,
group_streams,
)
from genesis.public.tensorstream.tensor_stream_utils import (
compute_mrope_pos_tensor,
modality_mask,
reconstruct_tensor_stream_from_compact_dict,
tensor_stream_token_view,
)
from genesis.public.tensorstream.tensor_stream_utils import (
slice as ts_slice,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest hurdle for now: if I understand correctly this is an if/else path for TensorStream. We don't want to add a new dependency to transformers (and here the import will simply fail).

Comment on lines +180 to +186
@property
def attn_implementation(self) -> str | None:
return self._attn_implementation

@attn_implementation.setter
def attn_implementation(self, value: str | None) -> None:
self._attn_implementation = value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be needed

Comment on lines +2251 to +2255
AutoImageProcessor.register(
IsaacConfig,
fast_image_processor_class=IsaacImageProcessorFast,
exist_ok=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed: mappings in processing_auto are enough

@philippguevorguian
Copy link
Copy Markdown

Leaving a comment and question from a quick review: Is it absolutely necessary for the core functionality of the model to be preserved to have dependencies in tensorstream?

TensorStream is a core primitive for us; it underpins how we handle multimodal computation, and future open-source releases will include incremental improvements to its ergonomics and performance aligned with model updates. We see it as the right abstraction boundary for multimodal modelling.

That said, we’ll be guarding the imports here to avoid issues on that front. As a heads up, we'll also following up with changes here that address feedback around pattern-reuse standards in transformers, feedback there is much appreciated.

@molbap
Copy link
Copy Markdown
Contributor

molbap commented Oct 31, 2025

Thanks @philippguevorguian , then I will see if we can integrate it. I can't say for now if we'll be able to integrate it as our principles for modeling files is to not abstract too much and keep the models "hackable", meaning a user should be able to intervene at any given point of a model's forward, adding a module, modifying the processing, etc. But if the rest is more transformers-aligned, and since the model is relevant in general, it'll be easier.

In particular attention classes that currently are relying on for instance integrations/flash_attention.py should be used for FA. To explain: the policy is to have one "naive" path, eager_attention_forward, that is a Callable explicitly defined in the modeling code and serves as a baseline for attention computation. The optimized paths (sdpa, fa, flex, etc) and associated masks are handled through config keys that swap this Callable into another one that wraps an efficient fa kernel.

For cross-document masking, combinations of existing utils should be sufficient, like masking_utils.py that defines and and or operators, rather than specific utils for this model. If these two "reuse" points are addressed it'll be excellent

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huuge work cleaning it up. It already looks much much better 🤩

We need to rebase on main since there were a few big refactors merged recently. I commented below what needs to be changed for them. And the main question I have is about padding, since we need to try and delegate pad/truncate to tokenizer's __call__ method. The rest is mostly nits about styling

Comment thread docs/source/en/model_doc/isaac.md Outdated
@@ -0,0 +1,128 @@
<!--Copyright 2025 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2026 in all files

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this on the prior pass; resolved

Comment thread docs/source/en/model_doc/isaac.md Outdated
Comment on lines +36 to +42
Key implementation notes:

- **Packed vision attention** – `IsaacVisionEncoder` keeps track of per-image patch lengths and uses specialized attention
kernels with custom `AttentionMaskConverter` utilities so the decoder only applies attention to real patches while supporting
both FlashAttention and SDPA.
- **TensorStream-first pipeline** – `IsaacProcessor` converts chat templates into multimodal streams where every image gets a
dedicated event with spatial metadata. `IsaacModel` can embed that stream directly (using `embed_stream`) and automatically
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the doc is a bit out-dated with all the changes. Let's update it before merging

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc here with all of the changes

("imagegpt", ("ImageGPTImageProcessor", "ImageGPTImageProcessorFast")),
("instructblip", ("BlipImageProcessor", "BlipImageProcessorFast")),
("internvl", ("GotOcr2ImageProcessor", "GotOcr2ImageProcessorFast")),
("isaac", (None, "IsaacImageProcessorFast")),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this needs a rebase on main, we merged a huge refactor yesterday. The mapping is now a `dict[dict[str, str]]]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the update with the TorchVisionBackend

Comment on lines -45 to -51
if TYPE_CHECKING:
# This significantly improves completion suggestion performance when
# the transformers package is used with Microsoft's Pylance language server.
PROCESSOR_MAPPING_NAMES: OrderedDict[str, str | None] = OrderedDict()
else:
PROCESSOR_MAPPING_NAMES = OrderedDict(
[
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidentally removed TYPE_CHECKING?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, merge mistake. Reverted

@@ -0,0 +1,330 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file contents need to be moved to image_processing_isaac.py to align with the refactor. I think rebase and modular_convert will do eveything for you

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, after updating the processing related code this was changed

Comment on lines +1368 to +1369
else:
mm_token_type_ids = mm_token_type_ids.to(device=inputs_embeds.device, dtype=torch.long)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about moving to devices

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed device handling logic here

Comment on lines +1390 to +1391
if isinstance(attention_mask, dict):
attention_mask = attention_mask.get("full_attention", next(iter(attention_mask.values())))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to get non-full attention? That would prob not work well to build position ids

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this more opinionated to be all full attention

Comment on lines +1533 to +1535
vision_patch_attention_mask = (
image_patch_attention_mask if vision_patch_attention_mask is None else vision_patch_attention_mask
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two args that look suspiciously similar, are they the same thing with different names?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardized this on image_patch_attention_mask

Comment on lines +1537 to +1549
if position_ids is None or position_ids.ndim == 2:
position_ids = self._prepare_position_ids_for_generation(
input_ids,
{
"input_ids": input_ids,
"attention_mask": attention_mask,
"past_key_values": past_key_values,
"mm_token_type_ids": mm_token_type_ids,
"vision_token_grids": vision_token_grids,
"vision_token_offsets": vision_token_offsets,
"vision_token_lengths": vision_token_lengths,
},
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already happens when we call generate(), so we shouldn't need to call manually each decode step

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, dropped this

Comment on lines +1626 to +1628
def get_input_embeddings(self) -> nn.Module:
return self.model.get_input_embeddings()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, base class can work it out when it is inside model

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

@zucchini-nlp
Copy link
Copy Markdown
Member

Also would be great to fix the CI before we request core maintainer's review (the last step before merging)

philippguevorguian and others added 9 commits March 23, 2026 20:25
…#16)

* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test
* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test

* style: align with new config implementation style

* style: add date

* chore: remove all isaac image processor fast

* test: new image processor test setup

* feat: special path for uint8 interp

* test: update image processor test for torchvision backend

* fix: don't mutate nested outputs; copy image index

* style: drop redundant copy

* chore: make fix-repo

* chore: convert artifacts

* chore: fix docstring
* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test

* style: align with new config implementation style

* style: add date

* chore: remove all isaac image processor fast

* test: new image processor test setup

* feat: special path for uint8 interp

* test: update image processor test for torchvision backend

* fix: don't mutate nested outputs; copy image index

* style: drop redundant copy

* chore: make fix-repo

* chore: convert artifacts

* chore: fix docstring

* style: update imports

* refactor: unify vision attention mask

test: unified vision attention mask

fix: correct arg name

* fix: standardize on the image attention mask name
)

* style: remove fast import from modeling test

* refactor: drop image_attention_mask from external interface

* style: drop rescale factor from overall processor attrs; no longer used
@zucchini-nlp zucchini-nlp marked this pull request as ready for review March 25, 2026 09:26
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment on lines +290 to +296
if all(len(sample_images) == 0 for sample_images in images):
tensors = {
"vision_patches": torch.zeros((batch_size, 0, 0, 0), dtype=torch.float32),
"vision_patch_attention_mask": torch.zeros((batch_size, 0, 0), dtype=torch.long),
"vision_token_grids": torch.zeros((batch_size, 0, 2), dtype=torch.long),
}
return BatchFeature(data=tensors, tensor_type=return_tensors)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, inside processor we shouldn't be routing empty lists. I'll go check processing code

Comment on lines +788 to +790
for key in ("use_cache", "rope_theta", "max_position_embeddings"):
kwargs.pop(key, None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ig this comes from saved checkpoints which we don't want to re-save?

Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated
Comment thread src/transformers/models/isaac/modular_isaac.py Outdated


@auto_docstring
class IsaacModel(Qwen3PreTrainedModel):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's instead copy from qwen-VL so identical part can be copied. As mentioned, I am seeing a few similar part sch as get_placeholder_mask and compute3d_position_ids. Other part like forward need new args so can be copied and adapted

IMO things could be simplified

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherit from Qwen3VL now after big refactor. After going after a few approaches here, delegating logic of model helpers to the qwen3vl stack generally leads to increased complexity, as the Isaac image truncation logic differs from qwen's policy

philippguevorguian and others added 5 commits April 13, 2026 15:41
* clean up and rearrange code

* fix: allow cutting through any image span

* test: cropping middle of arbitrary image

* test: drop stale / redundant tests

* test: drop flash attn debug test

* test: drop stale helpers

* fix: restore isaac processor compatibility

* test: use public api in isaac integration tests

* fix: restore isaac generation outputs

* simplif

* style: simplify 2

* test: drop redundant tests

* test: drop more low level image processing tests

* test: no plan to define 4 channel numpy processing

* test: drop image processor properties test

* test: focus image processing tests

* test: drop unneeded input trimming helper, chat template now omits the newline by default

* tests: enable Isaac tokenizer defaults coverage

* isaac: support assistant mask chat template tests

* tests: cover Isaac image placeholder expansion

* tests: patch Isaac chat template for assistant masks

* tests: use Isaac default assistant mask template

* tests: align Isaac image batching coverage

* tests: drop unneeded utilities / low-level tests

* style: isaacvisionmodel not isaacvisiontransformer

* tests: clean up imports

* wip 1

* style: drop now unneeded check_argument_for_proper_class override

* test: don't skip where assisted decoding works

* style: inherit from closer base class

* style: lint

* chore: convert artifacts

---------

Co-authored-by: raushan <raushan@huggingface.co>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, isaac

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, I have a few questions on processing and 2D RoPE based on prev suggestions. I left questions below, would be great to incorporate the suggestions from before if possible or add a small comment in code about the main diff is [....]

Also looked at test file, and seeing a pr ckpt used for integration tests. Do you know what will be the plan after merging the model? (do we merge hub-changes or do we create a new hub repo, so we can clean-up the testing ckpt before a final core review)

size: SizeDict,
**kwargs,
) -> torch.Tensor:
if image.dtype == torch.uint8:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small explanation comment would be nice on uint8

Comment on lines +77 to +85
prompt = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
)

inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output from apply_chat_template is already a dict of inputs

f"vision_config must be a dict or an IsaacVisionConfig instance, got {type(self.vision_config).__name__}."
)

self.vision_rescale_factor = float(self.vision_rescale_factor)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it will be float and otherwise error out, as per type annotation. So either we need to annotate (int | float) or we can just delete this line?

Comment on lines +370 to +371
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two will be auto-filled by decorators, can delete now

Comment on lines +469 to +486
if image_metadata is None:
pixel_shuffle_scale = self.config.vision_config.pixel_shuffle_scale_factor
downsampled_height = flat_image_grid_thw[:, 1].div(pixel_shuffle_scale, rounding_mode="floor")
downsampled_width = flat_image_grid_thw[:, 2].div(pixel_shuffle_scale, rounding_mode="floor")
lengths = downsampled_height * downsampled_width
offsets = torch.zeros_like(lengths)
else:
torch_compilable_check(
image_metadata.shape[:2] == image_grid_thw.shape[:2],
"IsaacModel.get_image_features expects batch-major metadata aligned with `image_grid_thw`.",
)
offsets = image_metadata[active_slot_mask][:, 0]
lengths = image_metadata[active_slot_mask][:, 1]

image_features = tuple(
projected_features[image_idx, offset : offset + length]
for image_idx, (offset, length) in enumerate(zip(offsets.tolist(), lengths.tolist(), strict=True))
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe shorter?

if image_metadata is not None:
    offsets = [...]
    projected_features = tuple([... for offset in offsets])

return pooler_output=projected_features

generated_ids = outputs.sequences[:, inputs["input_ids"].shape[1] :]
generated_text = self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
expected_fragment = "The image is a close-up photograph of a red cross symbol."
assert expected_fragment in generated_text
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertListEqual and self.assertEqual are much better for us when checking CI outputs, can you fix everywhere

"sum": 92510.4578057677,
"l2_norm": 3490.2146142251,
}
assert logit_stats == expected_logit_stats
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.allclose would be better, we always get numerical precision diffs due to hardware or pckg versions

Comment on lines +595 to +600
sample_lengths = [single_input["input_ids"].squeeze(0).shape[0] for single_input in single_inputs]
for i, (single_input, batch_ids, single_len) in enumerate(zip(single_inputs, batch_input_ids, sample_lengths)):
single_ids = single_input["input_ids"].squeeze(0)
torch.testing.assert_close(batch_ids[-single_len:], single_ids)

batch_modality_row = batch_inputs["mm_token_type_ids"][i]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing processor outputs which imo belongs in test_processor_isaac

Let's move if we need it, not sure if it is already being tested there. If duplicate or just already tested in similar unitest, then we can simply delete 😅

max_new_tokens = 256
dtype = torch.bfloat16

def setUp(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments on this IntegrationTest

Comment on lines +47 to +49
BASE_MODEL_ID = os.environ.get("ISAAC_TEST_MODEL_ID", "PerceptronAI/Isaac-0.1-Base")
BASE_MODEL_REVISION = os.environ.get("ISAAC_TEST_MODEL_REVISION", "refs/pr/3") or None
LOCAL_CHECKPOINT = os.environ.get("ISAAC_TEST_MODEL_PATH")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants