Add new model: Isaac by zucchini-nlp · Pull Request #45186 · huggingface/transformers

zucchini-nlp · 2026-04-02T12:29:46Z

What does this PR do?

Same as #40962 but cleans up the code to match transformers API. Couldn't test due to errors, the integration test is failing atm. Still need to clean the testing file

Ideally this should be working with the following snippet, so users don't need to load image or add imaga tokens themselves

import torch
from transformers import AutoProcessor, IsaacForConditionalGeneration

model_id = "PerceptronAI/Isaac-0.1"
processor = AutoProcessor.from_pretrained(model_id, revision="refs/pr/3",)
model = IsaacForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    revision="refs/pr/3",
)


conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Compare the two figures and explain what changed."},
            {"type": "image", "path": "first_image.png"},
            {"type": "image", "path": "second_image.png"},
            ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,
)

generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

update isaac

update isaac for second review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix: update imports * fix: replace removed check_model_inputs with merge_with_config_defaults and capture_outputs * fix: no capture outputs within capture outputs * refactor: move isaac vision internals to padded batched flow * refactor: align isaac vision attention with standard mask interfaces * refactor: remove packed_inputs from isaac model api and generation path * chore: purge isaac packing internals and sync modular outputs * refactor: remove isaac packing pipeline and align with transformers batched attention standards * refactor: drop final isaac packed compatibility path * refactor: use OutputRecorder for isaac hidden states * refactor: remove manual output_attentions handling in isaac model * refactor: rely on output recorder for isaac attentions * fix: do not deepcopy text config * style: remove overly defensive checks * style: remove unneeded pops * refactor: simplify pixshuf * style: drop unused vision_model alias * wip simplify * wip simplify 2 * perf: remove device syncs * test: add isaac pixel shuffle strict invariant characterization * refactor: make isaac pixel shuffle tensor-only with strict invariants * chore: regenerate isaac generated files after modular pixel shuffle refactor * style: drop redunant check * refactor: simplify config wiring * refactor: unify multimodal check for input preparation * refactor: drop now redundant init override * style: drop unused attention mask flow through pixel shuffle * style: collapse resize callsite for readability * style: drop more redundant checks * refactor: rely on siglip2 for viison attention * refactor: enforce invariant * refactor: simplify processor * fix: add post init call to vision transformer

…hanges, processor post-processing, expanded tests (huggingface#15) * fix: update imports * fix: replace removed check_model_inputs with merge_with_config_defaults and capture_outputs * fix: no capture outputs within capture outputs * refactor: move isaac vision internals to padded batched flow * refactor: align isaac vision attention with standard mask interfaces * refactor: remove packed_inputs from isaac model api and generation path * chore: purge isaac packing internals and sync modular outputs * refactor: remove isaac packing pipeline and align with transformers batched attention standards * refactor: drop final isaac packed compatibility path * refactor: use OutputRecorder for isaac hidden states * refactor: remove manual output_attentions handling in isaac model * refactor: rely on output recorder for isaac attentions * fix: do not deepcopy text config * style: remove overly defensive checks * style: remove unneeded pops * refactor: simplify pixshuf * style: drop unused vision_model alias * wip simplify * wip simplify 2 * perf: remove device syncs * test: add isaac pixel shuffle strict invariant characterization * refactor: make isaac pixel shuffle tensor-only with strict invariants * chore: regenerate isaac generated files after modular pixel shuffle refactor * style: drop redunant check * refactor: simplify config wiring * refactor: unify multimodal check for input preparation * refactor: drop now redundant init override * style: drop unused attention mask flow through pixel shuffle * style: collapse resize callsite for readability * style: drop more redundant checks * refactor: rely on siglip2 for viison attention * refactor: enforce invariant * refactor: simplify processor * fix: add post init call to vision transformer * style: update year * style: drop redundant sdpa standard * refactor: drop unneeded decorators when inheriting from pretrained model * refactor: delegate to siglip init * refactor: base class does rgb check * style: no need for positional args * docs: split up isaac vision transformer docstrings * style: don't save self attributes unused in forward * style: drop unneeded image processor identifier * style: remove explicit setting of now auto-discovered config settings * style: remove option for positional args * style: remove self.config; is unused * style: drop redundant vocab size handling * style: drop redundant can generate decorator * style: drop unneeded license * refactor: rely on _tied_weight_keys attr * refactor: chat template in init * style: drop redundant fields * fix: proper types * refactor: rely on qwen3vl functionality * fix: config * test: rope ids * fix: get rope working by adapting to qwen3vl properly * style: drop unused current processor * fix: remove need for config in processor init * docs: move IsaacProcessor to auto_docstring * refactor: WIP big refac * refactor: explicitly pass vision token components * style: drop unneeded siglip args * fix: use is_first_iteration and use_cache to catch edge cases * feat: remove cache position * fix: assume inputs are on correct device * refactor: remove properties * refactor: move input validation * refactor: WIP use get_image_features and placeholder_mask standard * refactor: wip 2 get_image_featutes and placeholder_mask * fix: drop unneeded set_input_embeddings * fix: simplify vision inputs check * test: multimodal test inputs * fix: ignore isaac specific keys at text config init * chore: generated files * refactor: WIP isolated text model * fix: wip drop double capture * fix: return tuple to properly track outputs * refactor: wip rely on base config * fix: fix access pattern in projector * refactor: rely on qwen3 config for num hidden layers * refactor: stop mirroring hidden size * test: drop useless test * style: drop unneeded vocab size setting * test: don't read from base config * style: drop custom type * refactor: move to canonical approach for scattering image features * chore: post merge re-generation * refactor: remove cache position * refactor: drop extra check for generation phase wip * refactor: wip rope index change update * fix: input ids is never None * style: attention mask is never none * refactor: inline helper logic * refactor: delegate attention mask handling to backbone * refactor: rely on transormers utilities and invariants to batch images * refactor: derive token type ids from input ids at the very end * refactor: move image level logic to isaacimageprocessorfast * refactor: call on batch of images! * refactor: all image logic in image processor * fix: drop device handling for processors * refactor: more image processor isolation * refactor: operate on expanded text wip 2 * refactor: reduce verbosity wip 4 * feat: post processor * WIP 1: get vision position ids * refactor: drop needless posid handling * refactor: drop unneeded error * refactor: move position id handling to model in get_rope_index wip 2 * refactor: deduplicate position id logic * refactor: use tokenizer with left side padding * refactor: drop virtual dims * refactor: remove vision image attention mask * test: post processing * wip * wip 2 simplify * wip 3 * wip 4 * test: image processor tests * test: use processing common standard for testing processor * wip 5 * refactor: rely on config for special token attribute tracking * style: remove rope index arbitrary posargs * docs: no pixel_values arg exists * refactor: rely on library standard assumptions * docs: improve docstrings * test: use processor utility directly in model test * wip * wip 2 * wip 3 * fix: set defaults * refactor: remove high level vision embed class * refactor: simplify position id computation branching logic * chore: artifacts

…huggingface#16) * fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test

* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring

* fix: use torchvisionbackend * fix: import IsaacImageProcessor * fix: resample not interpolation * style: orgranize import * chore: auto processing auto from main * feat: register isaac image processor according to new convention * fix: update to new config style * fix: correct pix2struct import * docs: initial doc update * feat: re-register isaac processor to auto * refactor: move max_posiiton_embeddings to isaac config refactor: move max_position_embeddings to isaac config * TEMP pop! * docs: update date * style: remove removed attr * style: add config attr for completeness * style: drop redundant merge_with_config_defaults * style: remove redundant positions ids handling * refactor: rely on base class for setting embeddings * fix: always use full attention * style: clarify padding logic * chore: remove stale artifcat * fix: kwargs name! * refactor: isolate custom padding to image processor pad method * feat: no device movement * style: align with transformers standard for loading rope params * refactor: drop unneeded arg filter * feat: compile check image presence instead * docs: add clarifying comment for keeping empty tensors * refactor: move broadcasting to forward WIP * style: use new layer validation functionality * feat: update embedding access mixin to support nested paths! * chore: convert artifacts * refactor: inline build batch * style: drop duplicate test * test: polygons * feat: polygon extraction * test: polygon generation test * style: align with new config implementation style * style: add date * chore: remove all isaac image processor fast * test: new image processor test setup * feat: special path for uint8 interp * test: update image processor test for torchvision backend * fix: don't mutate nested outputs; copy image index * style: drop redundant copy * chore: make fix-repo * chore: convert artifacts * chore: fix docstring * style: update imports * refactor: unify vision attention mask test: unified vision attention mask fix: correct arg name * fix: standardize on the image attention mask name

…uggingface#19) * style: remove fast import from modeling test * refactor: drop image_attention_mask from external interface * style: drop rescale factor from overall processor attrs; no longer used

github-actions · 2026-04-02T12:30:58Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, isaac

github-actions · 2026-04-02T12:55:56Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45186&sha=a44952

AkshatSh and others added 30 commits September 18, 2025 07:03

initial isaac implementation

63cd4a0

style: fixing assorted PR notes

63d1b1b

fix: get modular convert utility working

d72311d

feat: modular convert utility outputs

d6ed844

Merge pull request #1 from perceptron-ai-inc/pg/update_isaac

7f4944f

update isaac

chore: port updates

c3cc42d

fix: update imports

965215c

fix: adjust typing to get modular convert script working

4c4f1c9

feat: modular convert utility outputs

021a1ae

Merge pull request #2 from perceptron-ai-inc/pg/update_isaac

6311fd2

update isaac for second review

feat: port updates to isaac

3d8b786

fix: changes to enable modular convert

92f56b8

chore: modular convert script artifacts

70bcc77

style: remove redundant registration

5656b83

style: organize auto file entries

963f8c1

style: lints

74f9f3b

fix: processor typing

f56f064

fix: allow image processor typing

4c5c19d

style: | for unions

8a95e64

fix: don't alias siglip

25523ba

fix: rename vision config to config to be consistent with base class

d1dc712

fix: additional remakes

d80c9f6

chore: convert artifacts

58d7311

style: make style changes

ffb3b9f

refactor: bespoke isaac config

92c36d4

style: ruff organize imports

79eb96b

chore: convert configuration artifact

1c6479a

fix: get imports in

2899216

style: string typing of qwen2

aec7721

fix: remove image processor and tokenizer typing

c0b10b6

philippguevorguian and others added 27 commits December 25, 2025 01:42

Squash merge pg/refactor_remove_tensorstream into main

abba38b

Merge branch 'main' into main

8de326e

feat: batched inference + rope refactor

2b69698

Update src/transformers/models/isaac/modular_isaac.py

2884211

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Squash merge into main

fdbd633

style: alias norm to communicate scope

6ba2fdb

Merge branch 'main' into main

33fcd57

Merge branch 'main' into main

f2491e8

Merge branch 'main' into main

22ce167

Merge branch 'main' into main

c5514ac

Merge branch 'main' into main

231aa23

style: further mask threading simplification + processing docstring (h…

ed8fc0a

…uggingface#19) * style: remove fast import from modeling test * refactor: drop image_attention_mask from external interface * style: drop rescale factor from overall processor attrs; no longer used

Merge branch 'main' into main

ef3c6f7

test: update tests

caf377c

Merge branch 'main' into main

048094d

Merge branch 'main' into main

748c82b

Merge branch 'main' into main

7c6ca57

Squash merge pg/additional_cleanup into main

8b96e5f

check repo fixes

81206db

add correct date

86235d4

fix: make the pointing types belong to processor class

e99bbc1

clean up and rearrange code

a44952c

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new model: Isaac #45186

Add new model: Isaac #45186
zucchini-nlp wants to merge 96 commits intohuggingface:mainfrom
zucchini-nlp:isaac-main

zucchini-nlp commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zucchini-nlp commented Apr 2, 2026

What does this PR do?

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants