Skip to content

Add new model: Isaac #45186

Open
zucchini-nlp wants to merge 96 commits intohuggingface:mainfrom
zucchini-nlp:isaac-main
Open

Add new model: Isaac #45186
zucchini-nlp wants to merge 96 commits intohuggingface:mainfrom
zucchini-nlp:isaac-main

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

What does this PR do?

Same as #40962 but cleans up the code to match transformers API. Couldn't test due to errors, the integration test is failing atm. Still need to clean the testing file

Ideally this should be working with the following snippet, so users don't need to load image or add imaga tokens themselves

import torch
from transformers import AutoProcessor, IsaacForConditionalGeneration

model_id = "PerceptronAI/Isaac-0.1"
processor = AutoProcessor.from_pretrained(model_id, revision="refs/pr/3",)
model = IsaacForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    revision="refs/pr/3",
)


conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Compare the two figures and explain what changed."},
            {"type": "image", "path": "first_image.png"},
            {"type": "image", "path": "second_image.png"},
            ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,
)

generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

philippguevorguian and others added 27 commits December 25, 2025 01:42
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix: update imports

* fix: replace removed check_model_inputs with merge_with_config_defaults and capture_outputs

* fix: no capture outputs within capture outputs

* refactor: move isaac vision internals to padded batched flow

* refactor: align isaac vision attention with standard mask interfaces

* refactor: remove packed_inputs from isaac model api and generation path

* chore: purge isaac packing internals and sync modular outputs

* refactor: remove isaac packing pipeline and align with transformers batched attention standards

* refactor: drop final isaac packed compatibility path

* refactor: use OutputRecorder for isaac hidden states

* refactor: remove manual output_attentions handling in isaac model

* refactor: rely on output recorder for isaac attentions

* fix: do not deepcopy text config

* style: remove overly defensive checks

* style: remove unneeded pops

* refactor: simplify pixshuf

* style: drop unused vision_model alias

* wip simplify

* wip simplify 2

* perf: remove device syncs

* test: add isaac pixel shuffle strict invariant characterization

* refactor: make isaac pixel shuffle tensor-only with strict invariants

* chore: regenerate isaac generated files after modular pixel shuffle refactor

* style: drop redunant check

* refactor: simplify config wiring

* refactor: unify multimodal check for input preparation

* refactor: drop now redundant init override

* style: drop unused attention mask flow through pixel shuffle

* style: collapse resize callsite for readability

* style: drop more redundant checks

* refactor: rely on siglip2 for viison attention

* refactor: enforce invariant

* refactor: simplify processor

* fix: add post init call to vision transformer
…hanges, processor post-processing, expanded tests (huggingface#15)

* fix: update imports

* fix: replace removed check_model_inputs with merge_with_config_defaults and capture_outputs

* fix: no capture outputs within capture outputs

* refactor: move isaac vision internals to padded batched flow

* refactor: align isaac vision attention with standard mask interfaces

* refactor: remove packed_inputs from isaac model api and generation path

* chore: purge isaac packing internals and sync modular outputs

* refactor: remove isaac packing pipeline and align with transformers batched attention standards

* refactor: drop final isaac packed compatibility path

* refactor: use OutputRecorder for isaac hidden states

* refactor: remove manual output_attentions handling in isaac model

* refactor: rely on output recorder for isaac attentions

* fix: do not deepcopy text config

* style: remove overly defensive checks

* style: remove unneeded pops

* refactor: simplify pixshuf

* style: drop unused vision_model alias

* wip simplify

* wip simplify 2

* perf: remove device syncs

* test: add isaac pixel shuffle strict invariant characterization

* refactor: make isaac pixel shuffle tensor-only with strict invariants

* chore: regenerate isaac generated files after modular pixel shuffle refactor

* style: drop redunant check

* refactor: simplify config wiring

* refactor: unify multimodal check for input preparation

* refactor: drop now redundant init override

* style: drop unused attention mask flow through pixel shuffle

* style: collapse resize callsite for readability

* style: drop more redundant checks

* refactor: rely on siglip2 for viison attention

* refactor: enforce invariant

* refactor: simplify processor

* fix: add post init call to vision transformer

* style: update year

* style: drop redundant sdpa standard

* refactor: drop unneeded decorators when inheriting from pretrained model

* refactor: delegate to siglip init

* refactor: base class does rgb check

* style: no need for positional args

* docs: split up isaac vision transformer docstrings

* style: don't save self attributes unused in forward

* style: drop unneeded image processor identifier

* style: remove explicit setting of now auto-discovered config settings

* style: remove option for positional args

* style: remove self.config; is unused

* style: drop redundant vocab size handling

* style: drop redundant can generate decorator

* style: drop unneeded license

* refactor: rely on _tied_weight_keys attr

* refactor: chat template in init

* style: drop redundant fields

* fix: proper types

* refactor: rely on qwen3vl functionality

* fix: config

* test: rope ids

* fix: get rope working by adapting to qwen3vl properly

* style: drop unused current processor

* fix: remove need for config in processor init

* docs: move IsaacProcessor to auto_docstring

* refactor: WIP big refac

* refactor: explicitly pass vision token components

* style: drop unneeded siglip args

* fix: use is_first_iteration and use_cache to catch edge cases

* feat: remove cache position

* fix: assume inputs are on correct device

* refactor: remove properties

* refactor: move input validation

* refactor: WIP use get_image_features and placeholder_mask standard

* refactor: wip 2 get_image_featutes and placeholder_mask

* fix: drop unneeded set_input_embeddings

* fix: simplify vision inputs check

* test: multimodal test inputs

* fix: ignore isaac specific keys at text config init

* chore: generated files

* refactor: WIP isolated text model

* fix: wip drop double capture

* fix: return tuple to properly track outputs

* refactor: wip rely on base config

* fix: fix access pattern in projector

* refactor: rely on qwen3 config for num hidden layers

* refactor: stop mirroring hidden size

* test: drop useless test

* style: drop unneeded vocab size setting

* test: don't read from base config

* style: drop custom type

* refactor: move to canonical approach for scattering image features

* chore: post merge re-generation

* refactor: remove cache position

* refactor: drop extra check for generation phase

wip

* refactor: wip rope index change update

* fix: input ids is never None

* style: attention mask is never none

* refactor: inline helper logic

* refactor: delegate attention mask handling to backbone

* refactor: rely on transormers utilities and invariants to batch images

* refactor: derive token type ids from input ids at the very end

* refactor: move image level logic to isaacimageprocessorfast

* refactor: call on batch of images!

* refactor: all image logic in image processor

* fix: drop device handling for processors

* refactor: more image processor isolation

* refactor: operate on expanded text

wip 2

* refactor: reduce verbosity

wip 4

* feat: post processor

* WIP 1: get vision position ids

* refactor: drop needless posid handling

* refactor: drop unneeded error

* refactor: move position id handling to model in get_rope_index

wip 2

* refactor: deduplicate position id logic

* refactor: use tokenizer with left side padding

* refactor: drop virtual dims

* refactor: remove vision image attention mask

* test: post processing

* wip

* wip 2 simplify

* wip 3

* wip 4

* test: image processor tests

* test: use processing common standard for testing processor

* wip 5

* refactor: rely on config for special token attribute tracking

* style: remove rope index arbitrary posargs

* docs: no pixel_values arg exists

* refactor: rely on library standard assumptions

* docs: improve docstrings

* test: use processor utility directly in model test

* wip

* wip 2

* wip 3

* fix: set defaults

* refactor: remove high level vision embed class

* refactor: simplify position id computation branching logic

* chore: artifacts
…huggingface#16)

* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test
* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test

* style: align with new config implementation style

* style: add date

* chore: remove all isaac image processor fast

* test: new image processor test setup

* feat: special path for uint8 interp

* test: update image processor test for torchvision backend

* fix: don't mutate nested outputs; copy image index

* style: drop redundant copy

* chore: make fix-repo

* chore: convert artifacts

* chore: fix docstring
* fix: use torchvisionbackend

* fix: import IsaacImageProcessor

* fix: resample not interpolation

* style: orgranize import

* chore: auto processing auto from main

* feat: register isaac image processor according to new convention

* fix: update to new config style

* fix: correct pix2struct import

* docs: initial doc update

* feat: re-register isaac processor to auto

* refactor: move max_posiiton_embeddings to isaac config

refactor: move max_position_embeddings to isaac config

* TEMP pop!

* docs: update date

* style: remove removed attr

* style: add config attr for completeness

* style: drop redundant merge_with_config_defaults

* style: remove redundant positions ids handling

* refactor: rely on base class for setting embeddings

* fix: always use full attention

* style: clarify padding logic

* chore: remove stale artifcat

* fix: kwargs name!

* refactor: isolate custom padding to image processor pad method

* feat: no device movement

* style: align with transformers standard for loading rope params

* refactor: drop unneeded arg filter

* feat: compile check image presence instead

* docs: add clarifying comment for keeping empty tensors

* refactor: move broadcasting to forward WIP

* style: use new layer validation functionality

* feat: update embedding access mixin to support nested paths!

* chore: convert artifacts

* refactor: inline build batch

* style: drop duplicate test

* test: polygons

* feat: polygon extraction

* test: polygon generation test

* style: align with new config implementation style

* style: add date

* chore: remove all isaac image processor fast

* test: new image processor test setup

* feat: special path for uint8 interp

* test: update image processor test for torchvision backend

* fix: don't mutate nested outputs; copy image index

* style: drop redundant copy

* chore: make fix-repo

* chore: convert artifacts

* chore: fix docstring

* style: update imports

* refactor: unify vision attention mask

test: unified vision attention mask

fix: correct arg name

* fix: standardize on the image attention mask name
…uggingface#19)

* style: remove fast import from modeling test

* refactor: drop image_attention_mask from external interface

* style: drop rescale factor from overall processor attrs; no longer used
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, isaac

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45186&sha=a44952

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants