perceptron: Isaac-0.1 implementation by AkshatSh · Pull Request #40962 · huggingface/transformers

AkshatSh · 2025-09-18T07:05:39Z

Perceptron Isaac Implementation

Perceptron released open weight models Isaac-0.1 and Isaac-0.1-Base a 2B dense model for perception.

zucchini-nlp · 2025-09-18T16:08:13Z

Nice, lmk if you need any help or a quick review 🤗

AkshatSh · 2025-09-19T20:17:06Z

@zucchini-nlp - we are just closing this out feel free to give us a quick review while we just finish up testing!

zucchini-nlp · 2025-09-20T13:42:20Z

amazing, reviewing on Monday :)

zucchini-nlp

Nice job. I did not review everything and went over to check if the PR follows transformers API. The most important comments are:

We need to add image processor class and wrap image transforms inside it
Is the "tensorstream" required to get the model running? AFAIU we can discard it which is prob what we'll need to do. Transformers does not support streaming of multimodals yet
The model structure has to follow standards of other VLMs with correct class names

LMK if you have any questions 🤗

zucchini-nlp · 2025-09-23T13:03:06Z

+from perceptron.tensorstream.ops import (
+    compute_mrope_pos_tensor,
+    modality_mask,
+    reconstruct_tensor_stream_from_compact_dict,
+    slice as ts_slice,
+    tensor_stream_token_view,
+)


hmm is this for streaming video data? I am not sure we can add it as is, we need to either safe import it or find a way to integrate it in self.video_processor

zucchini-nlp · 2025-09-23T13:16:22Z

+logger = logging.get_logger(__name__)
+
+
+class PixelShuffleSiglip2VisionConfig(Siglip2VisionConfig):


we need to call it PerceptronVisionConfig with the model's name

zucchini-nlp · 2025-09-23T13:17:00Z

+        # Prepare positional embeddings grid: (1, embed_dim, h, w)
+        positional_embeddings = (
+            self.position_embedding.weight.reshape(self.position_embedding_size, self.position_embedding_size, -1)
+            .permute(2, 0, 1)
+            .unsqueeze(0)
+        )


we can prepare it once when model is init, no?

zucchini-nlp · 2025-09-23T13:18:13Z

+        mode = "bilinear"
+        align_corners = False
+        antialias = True


used only once, we don't need to save in var and can move directly to interpolate call

zucchini-nlp · 2025-09-23T14:47:23Z

+class RopeScaling(TypedDict, total=False):
+    rope_type: str
+    factor: float
+    mrope_section: list[int]
+    mrope_interleaved: bool
+    low_freq_factor: float
+    high_freq_factor: float
+    original_max_position_embeddings: int


let's have rope params in config.text_config.rope_scaling. We are planning a huge rope refactor, and until then it will be easier to follow the existing format

comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict

zucchini-nlp · 2025-09-23T14:48:27Z

+    def apply_chat_template(
+        self,
+        messages: list[dict[str, Any]],
+        tokenize: bool = False,
+        add_generation_prompt: bool = False,
+        **kwargs,
+    ) -> Any:
+        return self.tokenizer.apply_chat_template(
+            messages, tokenize=tokenize, add_generation_prompt=add_generation_prompt, **kwargs
+        )


not needed to override

zucchini-nlp · 2025-09-23T14:49:15Z

+    def __init__(
+        self,
+        tokenizer: AutoTokenizer,
+        config: IsaacConfig,


the processor class does not expect config and needs to be initialized with tokenizer ,image-processor, kwargs

zucchini-nlp · 2025-09-23T14:51:22Z

+        super().__init__(config)
+        self.layers = torch.nn.ModuleList(
+            [Qwen3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.rotary_emb = IsaacRotaryEmbedding(config, device=self.device)


if the text model is identical to qwen3, we need to init with AutoModel.from_config(text_config). Otherwise we need to add IsaacTextModel by copying with modular

update isaac

philippguevorguian · 2025-10-10T14:49:52Z

@zucchini-nlp just pushed a round of updates addressing your feedback; give it another look when you get a moment

zucchini-nlp

Thanks for iterating! I gave a quick look at the first ~1500 lines.

We are currently insisting on following certain standards before merging models, as it causes many issues otherwise. Overall the code looks good but I feel like the models still needs more standardization. So I left some comments and I will review the rest on Monday.Not sure if we can keep the tensorstream, let me see in detail how it is used

zucchini-nlp · 2025-10-10T18:33:42Z

+try:
+    from torchvision.transforms.v2 import functional as TVF
+except ImportError:
+    TVF = None


this is usually imported with is_torchvision_available() from `from ...utils import is_torchvision_available. If it is used only in the fast image processing file, then we don't even need to safe import :)

zucchini-nlp · 2025-10-10T18:35:36Z

+# Vision preprocessing constants
+VISION_MEAN = (0.5, 0.5, 0.5)
+VISION_STD = (0.5, 0.5, 0.5)
+VISION_SCALE = 1 / 255


the mean and std are available as IMAGENET_STANDARD_MEAN and STD in image_utils. The other onecan just be assigned a var when needed. Let's not have global var

zucchini-nlp · 2025-10-10T18:36:47Z

+    do_rescale: bool | None
+    rescale_factor: float | None
+    do_normalize: bool | None
+    image_mean: float | Sequence[float] | None
+    image_std: float | Sequence[float] | None
+    do_convert_rgb: bool | None


these are already defined in base class, no need to duplicate

zucchini-nlp · 2025-10-10T18:37:50Z

+    slow_image_processor_class = "IsaacImageProcessor"
+
+    resample = PILImageResampling.BILINEAR
+    model_input_names = ["patches", "token_grids"]
+    valid_kwargs = IsaacImageProcessorKwargs
+    unused_kwargs = ["size", "do_center_crop", "crop_size"]


the class attr here should be all the image args with their default values, i.e. do_resize, size, image_mean etc. The attr slow_image_processor_class is not needed

zucchini-nlp · 2025-10-10T18:49:10Z

+# Configuration
+# ============================================================================
+
+MAX_PIXELS = 60_000_000  # 60‑megapixel ceiling ≈ 8200 × 7300 px


same, global vars are better assigned where it is used. In this case prob it's a cls attr of a fast image processor

zucchini-nlp · 2025-10-10T18:51:24Z

+class RopeScaling(TypedDict, total=False):
+    rope_type: str
+    factor: float
+    mrope_section: list[int]
+    mrope_interleaved: bool
+    low_freq_factor: float
+    high_freq_factor: float
+    original_max_position_embeddings: int


comment still relevant. We will be merging the refactor next week hopefully, and we will add a typing hint with typed dict there. For now let's follow the standard way and type hint as simple dict

zucchini-nlp · 2025-10-10T18:52:44Z

+        # EventStreamProcessor parameters (for backward compatibility)
+        self.video_patch_size = vision_patch_size
+        self.vision_max_num_patches = vision_max_num_patches
+        self.vision_min_num_patches = vision_min_num_patches
+        self.pixel_shuffle_scale = pixel_shuffle_scale
+
+        # Vision normalization parameters
+        self.vision_rescale_factor = float(vision_rescale_factor)
+        self.vision_mean = _normalize_rgb_values(vision_mean, name="vision_mean")
+        self.vision_std = _normalize_rgb_values(vision_std, name="vision_std")
+
+        # Processing parameters
+        self.max_sequence_length = max_sequence_length
+        self.vision_token = vision_token


these are params from processing. Do we really need them for model code? I did not see where they are used and usually modeling does not need to know how processing happened

zucchini-nlp · 2025-10-10T18:53:07Z

+        # Processing parameters
+        self.max_sequence_length = max_sequence_length
+        self.vision_token = vision_token
+        self.vision_attn_implementation = vision_attn_implementation


not needed, the vision config has its own attn implementation attribute

zucchini-nlp · 2025-10-10T18:53:38Z

+    def get_text_config(self, *_, **kwargs) -> Qwen3Config:
+        # Accept optional decoder/encoder flags to align with HF composite configs
+        kwargs.pop("decoder", None)
+        kwargs.pop("encoder", None)
+        return self.text_config
+


i think the super get_text_config works well, or is it raising errors?

update isaac for second review

philippguevorguian · 2025-10-15T13:43:19Z

@zucchini-nlp just pushed another round of refinements based on the latest feedback. Would love for you to give it another pass when you get a chance

philippguevorguian · 2025-10-20T16:07:22Z

Hi @zucchini-nlp, gentle ping on this PR when you get a chance. Thanks for your time!

philippguevorguian · 2025-10-28T13:26:20Z

Hey @zucchini-nlp - just wanted to follow up again on this PR. It’s been a little while since the last round, and your review would help us move things forward. Appreciate it!

molbap · 2025-10-29T10:57:15Z

@@ -0,0 +1,2278 @@
+# Perceptron, Inc. Non-Production License


Missing date and usual format header

molbap · 2025-10-29T11:04:57Z

+from genesis.public.tensorstream.tensor_stream import (
+    Event,
+    Stream,
+    TensorStream,
+    TextType,
+    VisionType,
+    create_stream,
+    group_streams,
+)
+from genesis.public.tensorstream.tensor_stream_utils import (
+    compute_mrope_pos_tensor,
+    modality_mask,
+    reconstruct_tensor_stream_from_compact_dict,
+    tensor_stream_token_view,
+)
+from genesis.public.tensorstream.tensor_stream_utils import (
+    slice as ts_slice,
+)


This is the biggest hurdle for now: if I understand correctly this is an if/else path for TensorStream. We don't want to add a new dependency to transformers (and here the import will simply fail).

molbap · 2025-10-29T11:05:18Z

+    @property
+    def attn_implementation(self) -> str | None:
+        return self._attn_implementation
+
+    @attn_implementation.setter
+    def attn_implementation(self, value: str | None) -> None:
+        self._attn_implementation = value


should not be needed

molbap · 2025-10-29T16:22:31Z

+AutoImageProcessor.register(
+    IsaacConfig,
+    fast_image_processor_class=IsaacImageProcessorFast,
+    exist_ok=True,
+)


This is not needed: mappings in processing_auto are enough

philippguevorguian · 2025-10-31T13:17:33Z

Leaving a comment and question from a quick review: Is it absolutely necessary for the core functionality of the model to be preserved to have dependencies in tensorstream?

TensorStream is a core primitive for us; it underpins how we handle multimodal computation, and future open-source releases will include incremental improvements to its ergonomics and performance aligned with model updates. We see it as the right abstraction boundary for multimodal modelling.

That said, we’ll be guarding the imports here to avoid issues on that front. As a heads up, we'll also following up with changes here that address feedback around pattern-reuse standards in transformers, feedback there is much appreciated.

molbap · 2025-10-31T14:36:43Z

Thanks @philippguevorguian , then I will see if we can integrate it. I can't say for now if we'll be able to integrate it as our principles for modeling files is to not abstract too much and keep the models "hackable", meaning a user should be able to intervene at any given point of a model's forward, adding a module, modifying the processing, etc. But if the rest is more transformers-aligned, and since the model is relevant in general, it'll be easier.

In particular attention classes that currently are relying on for instance integrations/flash_attention.py should be used for FA. To explain: the policy is to have one "naive" path, eager_attention_forward, that is a Callable explicitly defined in the modeling code and serves as a baseline for attention computation. The optimized paths (sdpa, fa, flex, etc) and associated masks are handled through config keys that swap this Callable into another one that wraps an efficient fa kernel.

For cross-document masking, combinations of existing utils should be sufficient, like masking_utils.py that defines and and or operators, rather than specific utils for this model. If these two "reuse" points are addressed it'll be excellent

zucchini-nlp

Huuge work cleaning it up. It already looks much much better 🤩

We need to rebase on main since there were a few big refactors merged recently. I commented below what needs to be changed for them. And the main question I have is about padding, since we need to try and delegate pad/truncate to tokenizer's __call__ method. The rest is mostly nits about styling

zucchini-nlp · 2026-03-20T14:06:45Z

@@ -0,0 +1,128 @@
+<!--Copyright 2025 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved.


nit: 2026 in all files

Oops, missed this on the prior pass; resolved

zucchini-nlp · 2026-03-20T14:07:35Z

+Key implementation notes:
+
+- **Packed vision attention** – `IsaacVisionEncoder` keeps track of per-image patch lengths and uses specialized attention
+  kernels with custom `AttentionMaskConverter` utilities so the decoder only applies attention to real patches while supporting
+  both FlashAttention and SDPA.
+- **TensorStream-first pipeline** – `IsaacProcessor` converts chat templates into multimodal streams where every image gets a
+  dedicated event with spatial metadata. `IsaacModel` can embed that stream directly (using `embed_stream`) and automatically


maybe the doc is a bit out-dated with all the changes. Let's update it before merging

Updated the doc here with all of the changes

zucchini-nlp · 2026-03-20T14:08:28Z

            ("imagegpt", ("ImageGPTImageProcessor", "ImageGPTImageProcessorFast")),
            ("instructblip", ("BlipImageProcessor", "BlipImageProcessorFast")),
            ("internvl", ("GotOcr2ImageProcessor", "GotOcr2ImageProcessorFast")),
+            ("isaac", (None, "IsaacImageProcessorFast")),


oops, this needs a rebase on main, we merged a huge refactor yesterday. The mapping is now a `dict[dict[str, str]]]

Made the update with the TorchVisionBackend

zucchini-nlp · 2026-03-20T14:08:56Z

-if TYPE_CHECKING:
-    # This significantly improves completion suggestion performance when
-    # the transformers package is used with Microsoft's Pylance language server.
-    PROCESSOR_MAPPING_NAMES: OrderedDict[str, str | None] = OrderedDict()
-else:
-    PROCESSOR_MAPPING_NAMES = OrderedDict(
-        [


accidentally removed TYPE_CHECKING?

Yes, merge mistake. Reverted

zucchini-nlp · 2026-03-20T14:10:17Z

@@ -0,0 +1,330 @@
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨


this file contents need to be moved to image_processing_isaac.py to align with the refactor. I think rebase and modular_convert will do eveything for you

Yep, after updating the processing related code this was changed

zucchini-nlp · 2026-03-23T14:24:35Z

+        else:
+            mm_token_type_ids = mm_token_type_ids.to(device=inputs_embeds.device, dtype=torch.long)


same comment about moving to devices

Removed device handling logic here

zucchini-nlp · 2026-03-23T14:25:36Z

+        if isinstance(attention_mask, dict):
+            attention_mask = attention_mask.get("full_attention", next(iter(attention_mask.values())))


do we want to get non-full attention? That would prob not work well to build position ids

Made this more opinionated to be all full attention

zucchini-nlp · 2026-03-23T14:26:39Z

+            vision_patch_attention_mask = (
+                image_patch_attention_mask if vision_patch_attention_mask is None else vision_patch_attention_mask
+            )


there are two args that look suspiciously similar, are they the same thing with different names?

Standardized this on image_patch_attention_mask

zucchini-nlp · 2026-03-23T14:27:21Z

+        if position_ids is None or position_ids.ndim == 2:
+            position_ids = self._prepare_position_ids_for_generation(
+                input_ids,
+                {
+                    "input_ids": input_ids,
+                    "attention_mask": attention_mask,
+                    "past_key_values": past_key_values,
+                    "mm_token_type_ids": mm_token_type_ids,
+                    "vision_token_grids": vision_token_grids,
+                    "vision_token_offsets": vision_token_offsets,
+                    "vision_token_lengths": vision_token_lengths,
+                },
+            )


this already happens when we call generate(), so we shouldn't need to call manually each decode step

thanks, dropped this

zucchini-nlp · 2026-03-23T14:27:58Z

+    def get_input_embeddings(self) -> nn.Module:
+        return self.model.get_input_embeddings()
+


not needed, base class can work it out when it is inside model

zucchini-nlp · 2026-03-23T14:36:13Z

Also would be great to fix the CI before we request core maintainer's review (the last step before merging)

…#16) * fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test

* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring

* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring * style: update imports * refactor: unify vision attention mask test: unified vision attention mask fix: correct arg name * fix: standardize on the image attention mask name

) * style: remove fast import from modeling test * refactor: drop image_attention_mask from external interface * style: drop rescale factor from overall processor attrs; no longer used

zucchini-nlp · 2026-03-25T09:29:20Z

+        if all(len(sample_images) == 0 for sample_images in images):
+            tensors = {
+                "vision_patches": torch.zeros((batch_size, 0, 0, 0), dtype=torch.float32),
+                "vision_patch_attention_mask": torch.zeros((batch_size, 0, 0), dtype=torch.long),
+                "vision_token_grids": torch.zeros((batch_size, 0, 2), dtype=torch.long),
+            }
+            return BatchFeature(data=tensors, tensor_type=return_tensors)


weird, inside processor we shouldn't be routing empty lists. I'll go check processing code

zucchini-nlp · 2026-03-25T09:31:18Z

+        for key in ("use_cache", "rope_theta", "max_position_embeddings"):
+            kwargs.pop(key, None)
+


ig this comes from saved checkpoints which we don't want to re-save?

zucchini-nlp · 2026-03-25T09:45:01Z

+
+
+@auto_docstring
+class IsaacModel(Qwen3PreTrainedModel):


let's instead copy from qwen-VL so identical part can be copied. As mentioned, I am seeing a few similar part sch as get_placeholder_mask and compute3d_position_ids. Other part like forward need new args so can be copied and adapted

IMO things could be simplified

Inherit from Qwen3VL now after big refactor. After going after a few approaches here, delegating logic of model helpers to the qwen3vl stack generally leads to increased complexity, as the Isaac image truncation logic differs from qwen's policy

* clean up and rearrange code * fix: allow cutting through any image span * test: cropping middle of arbitrary image * test: drop stale / redundant tests * test: drop flash attn debug test * test: drop stale helpers * fix: restore isaac processor compatibility * test: use public api in isaac integration tests * fix: restore isaac generation outputs * simplif * style: simplify 2 * test: drop redundant tests * test: drop more low level image processing tests * test: no plan to define 4 channel numpy processing * test: drop image processor properties test * test: focus image processing tests * test: drop unneeded input trimming helper, chat template now omits the newline by default * tests: enable Isaac tokenizer defaults coverage * isaac: support assistant mask chat template tests * tests: cover Isaac image placeholder expansion * tests: patch Isaac chat template for assistant masks * tests: use Isaac default assistant mask template * tests: align Isaac image batching coverage * tests: drop unneeded utilities / low-level tests * style: isaacvisionmodel not isaacvisiontransformer * tests: clean up imports * wip 1 * style: drop now unneeded check_argument_for_proper_class override * test: don't skip where assisted decoding works * style: inherit from closer base class * style: lint * chore: convert artifacts --------- Co-authored-by: raushan <raushan@huggingface.co>

github-actions · 2026-04-13T16:06:20Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, isaac

zucchini-nlp

Oke, I have a few questions on processing and 2D RoPE based on prev suggestions. I left questions below, would be great to incorporate the suggestions from before if possible or add a small comment in code about the main diff is [....]

Also looked at test file, and seeing a pr ckpt used for integration tests. Do you know what will be the plan after merging the model? (do we merge hub-changes or do we create a new hub repo, so we can clean-up the testing ckpt before a final core review)

zucchini-nlp · 2026-04-01T14:40:00Z

+        size: SizeDict,
+        **kwargs,
+    ) -> torch.Tensor:
+        if image.dtype == torch.uint8:


a small explanation comment would be nice on uint8

zucchini-nlp · 2026-04-14T13:27:15Z

+prompt = processor.apply_chat_template(
+    conversation,
+    tokenize=True,
+    return_dict=True,
+    add_generation_prompt=True,
+    return_tensors="pt",
+)
+
+inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)


the output from apply_chat_template is already a dict of inputs

zucchini-nlp · 2026-04-14T13:33:52Z

+                f"vision_config must be a dict or an IsaacVisionConfig instance, got {type(self.vision_config).__name__}."
+            )
+
+        self.vision_rescale_factor = float(self.vision_rescale_factor)


nit: it will be float and otherwise error out, as per type annotation. So either we need to annotate (int | float) or we can just delete this line?

zucchini-nlp · 2026-04-14T13:35:13Z

+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,


these two will be auto-filled by decorators, can delete now

zucchini-nlp · 2026-04-14T13:38:53Z

+        if image_metadata is None:
+            pixel_shuffle_scale = self.config.vision_config.pixel_shuffle_scale_factor
+            downsampled_height = flat_image_grid_thw[:, 1].div(pixel_shuffle_scale, rounding_mode="floor")
+            downsampled_width = flat_image_grid_thw[:, 2].div(pixel_shuffle_scale, rounding_mode="floor")
+            lengths = downsampled_height * downsampled_width
+            offsets = torch.zeros_like(lengths)
+        else:
+            torch_compilable_check(
+                image_metadata.shape[:2] == image_grid_thw.shape[:2],
+                "IsaacModel.get_image_features expects batch-major metadata aligned with `image_grid_thw`.",
+            )
+            offsets = image_metadata[active_slot_mask][:, 0]
+            lengths = image_metadata[active_slot_mask][:, 1]
+
+        image_features = tuple(
+            projected_features[image_idx, offset : offset + length]
+            for image_idx, (offset, length) in enumerate(zip(offsets.tolist(), lengths.tolist(), strict=True))
+        )


maybe shorter?

if image_metadata is not None: offsets = [...] projected_features = tuple([... for offset in offsets]) return pooler_output=projected_features

zucchini-nlp · 2026-04-14T14:50:39Z

+        generated_ids = outputs.sequences[:, inputs["input_ids"].shape[1] :]
+        generated_text = self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+        expected_fragment = "The image is a close-up photograph of a red cross symbol."
+        assert expected_fragment in generated_text


self.assertListEqual and self.assertEqual are much better for us when checking CI outputs, can you fix everywhere

zucchini-nlp · 2026-04-14T14:51:25Z

+            "sum": 92510.4578057677,
+            "l2_norm": 3490.2146142251,
+        }
+        assert logit_stats == expected_logit_stats


torch.allclose would be better, we always get numerical precision diffs due to hardware or pckg versions

zucchini-nlp · 2026-04-14T14:53:02Z

+        sample_lengths = [single_input["input_ids"].squeeze(0).shape[0] for single_input in single_inputs]
+        for i, (single_input, batch_ids, single_len) in enumerate(zip(single_inputs, batch_input_ids, sample_lengths)):
+            single_ids = single_input["input_ids"].squeeze(0)
+            torch.testing.assert_close(batch_ids[-single_len:], single_ids)
+
+            batch_modality_row = batch_inputs["mm_token_type_ids"][i]


testing processor outputs which imo belongs in test_processor_isaac

Let's move if we need it, not sure if it is already being tested there. If duplicate or just already tested in similar unitest, then we can simply delete 😅

zucchini-nlp · 2026-04-14T14:54:31Z

+    max_new_tokens = 256
+    dtype = torch.bfloat16
+
+    def setUp(self):


same comments on this IntegrationTest

zucchini-nlp · 2026-04-14T14:54:56Z

+BASE_MODEL_ID = os.environ.get("ISAAC_TEST_MODEL_ID", "PerceptronAI/Isaac-0.1-Base")
+BASE_MODEL_REVISION = os.environ.get("ISAAC_TEST_MODEL_REVISION", "refs/pr/3") or None
+LOCAL_CHECKPOINT = os.environ.get("ISAAC_TEST_MODEL_PATH")


initial isaac implementation

63cd4a0

AkshatSh marked this pull request as draft September 18, 2025 07:05

zucchini-nlp reviewed Sep 23, 2025

View reviewed changes

philippguevorguian and others added 4 commits October 10, 2025 13:21

style: fixing assorted PR notes

63d1b1b

fix: get modular convert utility working

d72311d

feat: modular convert utility outputs

d6ed844

Merge pull request #1 from perceptron-ai-inc/pg/update_isaac

7f4944f

update isaac

zucchini-nlp reviewed Oct 10, 2025

View reviewed changes

philippguevorguian and others added 5 commits October 15, 2025 12:54

chore: port updates

c3cc42d

fix: update imports

965215c

fix: adjust typing to get modular convert script working

4c4f1c9

feat: modular convert utility outputs

021a1ae

Merge pull request #2 from perceptron-ai-inc/pg/update_isaac

6311fd2

update isaac for second review

molbap self-requested a review October 29, 2025 09:33

molbap reviewed Oct 29, 2025

View reviewed changes

philippguevorguian added 6 commits November 10, 2025 09:41

feat: port updates to isaac

3d8b786

fix: changes to enable modular convert

92f56b8

chore: modular convert script artifacts

70bcc77

style: remove redundant registration

5656b83

style: organize auto file entries

963f8c1

style: lints

74f9f3b

zucchini-nlp reviewed Mar 23, 2026

View reviewed changes

philippguevorguian and others added 9 commits March 23, 2026 20:25

Merge branch 'main' into main

22ce167

Merge branch 'main' into main

c5514ac

Merge branch 'main' into main

231aa23

style: further mask threading simplification + processing docstring (#19

ed8fc0a

) * style: remove fast import from modeling test * refactor: drop image_attention_mask from external interface * style: drop rescale factor from overall processor attrs; no longer used

Merge branch 'main' into main

ef3c6f7

test: update tests

caf377c

philippguevorguian force-pushed the main branch from 8a542f4 to caf377c Compare March 24, 2026 23:11

zucchini-nlp marked this pull request as ready for review March 25, 2026 09:26

zucchini-nlp reviewed Mar 25, 2026

View reviewed changes

philippguevorguian and others added 7 commits March 25, 2026 15:22

Merge branch 'main' into main

048094d

Merge branch 'main' into main

748c82b

Merge branch 'main' into main

7c6ca57

Squash merge pg/additional_cleanup into main

8b96e5f

check repo fixes

81206db

add correct date

86235d4

fix: make the pointing types belong to processor class

e99bbc1

zucchini-nlp mentioned this pull request Apr 2, 2026

Add new model: Isaac #45186

Open

philippguevorguian and others added 5 commits April 13, 2026 15:41

Merge branch 'main' into main

0325565

lint

3d9e55d

fix: map isaac_vision to isaac module

251210f

fix: specify required backend

bbadef8

zucchini-nlp reviewed Apr 14, 2026

View reviewed changes

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

		logger = logging.get_logger(__name__)


		class PixelShuffleSiglip2VisionConfig(Siglip2VisionConfig):

		@@ -0,0 +1,128 @@
		<!--Copyright 2025 Perceptron, Inc. and The HuggingFace Inc. team. All rights reserved.

		@@ -0,0 +1,330 @@
		# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

		else:
		mm_token_type_ids = mm_token_type_ids.to(device=inputs_embeds.device, dtype=torch.long)

		if isinstance(attention_mask, dict):
		attention_mask = attention_mask.get("full_attention", next(iter(attention_mask.values())))

		def get_input_embeddings(self) -> nn.Module:
		return self.model.get_input_embeddings()

		for key in ("use_cache", "rope_theta", "max_position_embeddings"):
		kwargs.pop(key, None)

		hidden_states=encoder_outputs.hidden_states,
		attentions=encoder_outputs.attentions,

Conversation

AkshatSh commented Sep 18, 2025

Perceptron Isaac Implementation

Uh oh!

zucchini-nlp commented Sep 18, 2025

Uh oh!

AkshatSh commented Sep 19, 2025

Uh oh!

zucchini-nlp commented Sep 20, 2025

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philippguevorguian commented Oct 10, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philippguevorguian commented Oct 15, 2025

Uh oh!

philippguevorguian commented Oct 20, 2025

Uh oh!

philippguevorguian commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philippguevorguian commented Oct 31, 2025

Uh oh!

molbap commented Oct 31, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zucchini-nlp left a comment •

edited

Loading