Skip to content

Add Molmo#43448

Open
SangbumChoi wants to merge 1169 commits intohuggingface:mainfrom
SangbumChoi:molmo
Open

Add Molmo#43448
SangbumChoi wants to merge 1169 commits intohuggingface:mainfrom
SangbumChoi:molmo

Conversation

@SangbumChoi
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sarathc-cerebras and others added 30 commits January 23, 2026 23:09
* adds jais2 model support

* updates tests

* addresses review comment

* review comments addressed

* addresses test review comments

* fixes date

* format issue fix

* Update src/transformers/models/jais2/__init__.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update tests/models/jais2/test_modeling_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/jais2/modular_jais2.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* fixes tests as per review comment

* updates layernorm setup

* Apply suggestions from code review

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* addressed review comments and updated tests as recomended

* fixup tests

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
audio models don't define LM as base model, overwrite explicitly!
* fix processing bugs + add more test cases

* add more image processor tests

* refactor expand_text_with_placeholders

* CI fix
This commit restores Qwen2/3 MoE + GGUF support in Transformers v5.

In this version, handling of MoE tensors are significantly changed so
that support for all MoE + GGUF models ... (okay, only) Qwen2/3 MoE
models in Transformers v4 is now broken.

This commit now adopts new tensor handling, along with extended
`TensorProcessor` with capabilities to handle not only tensor data
but also tensor mappings.  In this process, Qwen2/3 MoE-specific hack
is moved to `Qwen2MoeTensorProcessor`, making the main function to look
more model-agnostic.

This is fully tested on Qwen2 MoE `Qwen1.5-MoE-A2.7B` and partially on
Qwen3 MoE `Qwen3-30B-A3B-Thinking-2507` (due to memory constraints).

Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
…ace#42318)

* Fix: Propagate local_files_only to hub_kwargs and PEFT adapter loading

* Fix conflicts and code style

* Apply formatting fixes

* Apply style fixes

* Fix style: Apply minimal changes for local_files_only

* Style: Revert formatting and finalize local_files_only fix in __init__.py

* Apply style fixes

* Trigger tests

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
fix index

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Add Pixo

* Add Pixo

* Add test

* Add model_doc

* Add model_doc

* modularize

* modularize more

* Add Pixo

* Add Pixo

* Add test

* Add model_doc

* Add model_doc

* Use modular for Pixo

* missing backbone autodoc

* cleanup

* cleanup

* Revise converting

* rename

* rename

* cleanup

* small test update

* address core review comments

* also docs

* fix

* better with the toctree 👀

---------

Co-authored-by: Pablo Montalvo <pablo.montalvo.leroux@gmail.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* Fix tied weight keys sam2 video

* fix edgetam_video

* fix modular sam3_tracker_video
* begin Moe test tensor parallel

* create tiny moe model + fix test tensor parallel Moe

eaeaae

* create tiny moe model + fix test tensor parallel Moe

eaeaae

fix tensor parallel MoE test
fix tensor parallel MoE test

* fix backward pass test in tensor parallel for Dense model (huggingface#42811)

* fix

* linting

* use mixtral instead for testing

* fix dtensor and tensor mismatch

* linting

* checkout test tensor parallel to be like main

* avoid hack and create class instead

* fix loading ep

* add moe test

* now EP inference works again but pass still fails

* Add ColwiseParallelReplicate and RowwiseParallelReplicate classes for replicated layouts

* clean

* eaza

* aeaeaea

* eaeaa

* linting
* add wrapper

* fix style

* change the name to get_kernel

* nit
* fix dtype quantizer

* fix

* rm print

* fix

* style

* fix

* revert

* bitnet

* fix

* gogo

* Update src/transformers/modeling_utils.py

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

* warn instead

* fix

* fix

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
…t_embeddings (huggingface#42558)

* add embedding getter

* modify your own logic

* a common test

* some adapters are not PreTrainedModel s

* few fixes

* implement correct-ish fix?

* fixup

* this is needed likely

* woops

* solving some cross-imports issues here and there

* more ximports issues

* finally

* revert changes

* fixups

* improve message

* add common tests for input_ids first

* increase test coverage

* bigger update for GC

* copies

* mlcd is getting on my nerves a bit

* ah yes

* for BC

* break a couple modelings

* simplify with base_model

* fix copies for torch checkpointing

* simplify this model

* improve messages
* rm ipex and ccl on cpu training doc

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Update docs/source/en/perf_train_cpu_many.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs: Squared ReLU paper fix

* fix: other papers with versioning in URL
* weight converter draft

* fix

* feedback

* update
* draft

* ep

* toctree

* warning
* shard size

* feedback

* add link

* clarify

* update link
* optimum

* assisted decoding

* fix

* feedback

* toctree
…uto-skip non array conversion) (huggingface#42884)

* improve .to() to work on lists or nested lists of tensors, automatically skip converting non array structure

* consider empty list or nested list as array-like
…ngface#42904)

* Document new default shard size + dropped unsafe serialization

* a

* Update MIGRATION_GUIDE_V5.md

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* this way betetr maybe?

* delete legacy from bart and mvp

* import not found

* fix some tests

* fix more tests

* revert smth to run tests again

* i though I fixed it already, but there were more models

* commit and check tests, clean-up later

* assisted deocding shoudl work now

* docs and whisper

* fix a few more tests

* no circular import errors pls

* wording

* add a test for defaults following TRL example

* nit

* Update src/transformers/configuration_utils.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/generation/candidate_generator.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/generation/utils.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* comments

* final fix tests

* more comments

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Allow block sharing in hybrid architectures

* nit and style

* Better docstring for mark_shareable_blocks_as_complete
refactor tests

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* clean

* int

* check

* better

* working

* remove unrelated stuff

* rm print

* torchao

* Fix

* added

* fix quanto

* revert

* reverted

* rm comment

* fix
…lity (huggingface#42431)

* rewrite to improve its usability

* rewrite to improve its usability

* Clarify comment about function parameter elements

* Update implementation of _process_parameter_type

* rewrite to improve its usability

* Clarify comment about function parameter elements

* reformat it a little bit

* reformat it a little bit

* used a wrong ruff version..... this one should be good

* update the string manipulation

* Trying for more consistency

* make fixup

* Try another approach

* Don't include "None" in the out_str when we're already setting optional

* Add some new-style types to GPT-J to see what happens

* Correct use of UnionType

* make fixup

* Add a little snarky comment about typing just because

* Correctly return the same strings as the old function

* Drop unnecessary args

* Remove redundant args information

* add one more elif statement to deal with the case when type hint is None

* add if statement to handle the parameter with default value

* Revert GPT-J changes

* Trigger tests

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
* start

* all until clvp

* all until gpt2

* until lfm2_moe

* all until seamless

* finally all first batch

* style

* Copied from

* apply modulars

* small fixes

* add test

* name

* fix

* more

* fix typos

* more

* fix

* typo

* fix

* revert annoying dates auto change....

* fixes

* fix

* fix

* oupsi

* fixes

* start more fixes

* fix

* add norm buffers

* modular

* improve

* copies

* fixes

* fix advanced rope modules

* more and more

* improve error

* fix

* fixes

* fix

* fixes

* post rebase

* last fix

* really last fix

* stupid layoutlm2 with its external lib

* stupid layoutlmv2 finally....

* create functions
oesni and others added 26 commits January 23, 2026 23:09
* feat: implement solar-open-100b

* feat: update modeling_solar_open.py

* feat: update solar-open config

* chore: apply style

* feat: remove _tied_weights_keys

* feat: update modeling code

* chore: remove speech_to_text_2 in modeling

* docs: solar_open model

* test: solar open model

* chore: re-convert modular

* fix: remove require_read_token

* Apply suggestion from @vasqu

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* chore: update lincse year -> 2026

* feat: add solar_open to tokenizer mapping

* chore: update license year

* test: remove _torch_compile_train_cls

* docs: update solar_open doc

* refactor: simplify SolarOpenDecoderLayer

* refactor: inherit Glm4MoeConfig class

* fix: handle head_dim properly

* chore: apply style

* fix: default parameters

* test: use tiny dummy model

* update expectations and switch to eager moe (no fluctuations per grouped_mm / batched_mm)

* chore: remove trust_remote_code (suggestion from @vasqu)

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Update src/transformers/models/solar_open/modular_solar_open.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* chore: update config docstring

* chore: add partial_rotary_factor workaround comment

* test: check default config values in test_modeling_solar_open.py

* fix: config class interface

* docs: add SolarOpen to doctree

* docs: update dates

* Revert "feat: add solar_open to tokenizer mapping"

This reverts commit 038b1c1.

* feat: remove unnecessary configs

* test: update SolarOpenConfig tests

* fix: attention_dropout issue on training

* Revert "feat: remove unnecessary configs"

This reverts commit 9023688.

* Revert "fix: attention_dropout issue on training"

This reverts commit 3c275dc.

* Revert "Revert "feat: remove unnecessary configs""

This reverts commit e6adcd9.

* Revert "Revert "fix: attention_dropout issue on training""

This reverts commit 573fa9a.

* feat: inherit attention from Llama

* fix: remove del for attention_bias and attention_dropout

* chore: convert solar_open

* fix date

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
* cahnge F to TVF

* use import instead of from

* Change TVF to tvF

* nit
…ingface#43205)

* support any capturing groups in reverse mapping

* define utils and fix test_reverse_loading_mapping

* Fix test_reverse_loading_mapping

* fix non deterministic behavior.

* nit
…ace#42317)

* make vlms export friendly

* seq2seq lms

* biogpt

* more vlms

* colqwen2

* vision models

* more vlms

* more vlms

* more vlms

* vectorized vision embedding

* fixup

* more vlms

* more vlms

* generate_masks_with_special_tokens_and_transfer_map

* custom torch_check

* use custom torch_check

* revert grounding dino changes

* fixup

* remove file

* undo

* undo

* testing

* fixes

* standard error message

* use torch._check_with to raise value error instead of torch._check's runtime error

* fix recurrent gemma

* only itemize tensors

* use spatial shapes list instead of tensor

* fix udop use_cache default value

* use tracable condition for seq2seq lms

* make smolvlm exportable

* fix fastvlm and t5gemma2

* fix qwen2_audio and idefics

* remove script

* tbc

* skip mra model

* helper

* style and document

* fix

* set experts impl to batched

* make xmod exportable and efficient

* make more ssms exportable

* fix

* revert recurrent gemma

* skip models that use chunked attention or rope_index

* qwen3_next

* assert async

* tensorize (mm) grounding dino mask generation

* style

* fix repo

* address comments

* fix qwen2 audio and vits checks

* skip two models using kernels by default

* skip granite moe hybrid using custom kernels

* disable mamba kernels

* vits splinter and videomae
* experts impl gpt oss

* no need to transpose dequantized experts

* skip test_reverse_loading_mapping

* fix custom gating

* revert transposition and simply support transposed experts to avoid modifying eager

* style

* don't rely on weight shapes as they can be square matrices

* no need to relaod

* fallback to eager

* Update src/transformers/models/gpt_oss/modeling_gpt_oss.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

* force 16 bytes alignmenet during weight loading

* simplify logic

* quantization conversions should be applied first

* avoid baddbmm as it is less performant / less optimizable by max-autotune

* no need for logger

* add comment explaining limitation

* standarize operations and only reshape when needed

* fixup conversion and test

* Update src/transformers/conversion_mapping.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* force alignment docstring

* move default apply gate

* offsets

* add docs and make kernel_config optional

* use reshapes as they are equivalent to views when memory is contiguous

* fix and better notes

* reshapes instead of views

* keep model saving and reloading in grouped_mm test to catch misalignment issues

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
…uggingface#43281)

Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Commit 38e5987 from PR huggingface#43194 introduced a minor bug that causes the issuing of
a deprecation warning regardless of the user using the deprecated interface.

For example:
```
model.generate(**input_ids, do_sample=False)
```
would issue a deprecation warning not to pass `GenerationConfig` as well as generation kwargs
even though we're not passing a `GenerationConfig`.

Along the fix this also introduces a unit test to check both the intended and the bug
behavior.

Co-authored-by: nemo <git@ningu.net>
* Fix typo in docstring for `Sam3TrackerConfig`

* Fix `Sam2Config` typo in `configuration_sam2.py`

* Run `make fix-repo`
Fix failing ChameleonIntegrationTests
Fix failing recurrent_gemma tests
* Fix: adding pad_token_id in Qwen3VLTextConfig

* Fix: adding pad_token_id in Qwen3VLTextConfig

* updated the docstring with pad_token_id

* updated the docstring with pad_token_id

* added test nested in Qwen3VLModelTest for missing pad_token_id

* Updated pad_token_id config and removed the tests
* fix: cast memory attention inputs to inference session dtype

* chore: fix formatting

* add fix and tests

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
…3219)

* Fix tokenizer auto_map being ignored for custom models (huggingface#43202)

PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class
doesn't match the registered tokenizer for a model_type. However, this
early-exit was placed before the auto_map check, causing custom tokenizers
with trust_remote_code to be ignored.

This fix moves the auto_map extraction before the early-exit check and adds
tokenizer_auto_map is None to the condition, so models with custom tokenizers
properly use the dynamic module loading path.

* style

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
…Bicubic) (huggingface#43017)

* Fix default interpolation for PVT (Fast and Python) to BICUBIC

* Polish code and trigger CI restart

* Fix docstring to match PILImageResampling.BICUBIC

* Remove Copied from tag and finalize BICUBIC fix

* Apply make style after removing comment

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* Fix Mamba2ForCausalLM weight tying

Add _tied_weights_keys mapping to enable proper weight tying when
tie_word_embeddings=True. This is the standard pattern used by
MambaForCausalLM, GPT2, LLaMA, and other models.

Fixes huggingface#43206

* Enable weight tying in Mamba2ModelTester for regression testing

* Add explicit regression test for Mamba2 weight tying

Replace ModelTester default with explicit test per reviewer feedback.
…gingface#43418)

* Fix in-place modification of inputs_embeds in Kosmos-2.5 forward

* Removed trailing space
…gface#43432)

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* ok ...

* ok ...

* ok ...

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…rn `BaseModelOutputWithPooling` (huggingface#42564)

* Add return_dict to get_text_features methods to allow returning 'BaseModelOutputWithPooling'

Added to all architectures except blip-2, which has a much different structure here. It uses 'Blip2TextModelWithProjection' to get these embeddings/features, but this class isn't as simple to use

* Add return_dict to get_image_features methods to allow returning 'BaseModelOutputWithPooling'

Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods. 2d outputs, 3d outputs, lists of 2d outputs (due to non-matching shapes), existing 'return_attentions' resulting in returning 2-tuple, existing 'return_dict' resulting in returning 3-tuples (???), high quality image embeddings, low quality image embeddings, deepstack image embeddings, etc. etc. etc.

And I only went through like 70-80% of all architectures with get_image_features before I gave up.

Standardisation of all of these sounds like a lost cause.

* make fixup

* Ignore discrepancies for pooler_output, focus on last_hidden_state

* Update get_image_features for the missing architectures

* Update all get_audio_features

* Update get_video_features, except instructblipvideo

Should be fine though, as that  'get_video_features' doesn't live on the AutoModel class, but the AutoModelForConditionalGeneration class

* Run ruff formatting

* Patch Glm4v VisionModel forward with BaseModelOutputWithPooling

* Patch instructblip, although backwards incompatibility stands

* Patch Kosmos2 and Ovis2

* Reformat Ovis2

* Avoid now-deprecated return_attentions

* Remove NumFrames

* Proposal to simplify get_..._features via TransformersKwargs & check_model_inputs

The changes in check_model_inputs aren't the clearest/prettiest, but they work well for now.

* Revert check_model_inputs, adopt can_return_tuple, accept BC on get_..._features methods

This commit updates all get_text_features methods, even blip_2, which was previously not yet attempted

* Fix typo: can_return_dict -> can_return_tuple

* Adopt can_return_tuple for many get_image_features

A handful of outliers that aren't updated yet, e.g. if there's 2+ ModelOutput classes that are viable, or the vq-based ones

For context, the other modeling file classes haven't been updated with the new get_..._features format, nor have the tests

* Update all get_audio_features, some edge cases handled (e.g. gemma3n)

* Update most get_video_features,  some edge case remain, e.g. instructblipvideo

* Patch Fuyu, just return BaseModelOutputWithPooling without pooler

The Fuyu architecture doesn't have an image encoder:
> Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder.

* Introduce ModelOutput subclass for Chameleon, patch get_image_features

* Update modeling files with new output formats for get_..._features

* Update fast_vlm modeling forward from modular llava to remove image_sizes

* Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput

* Replace prior return_dict with check_model_inputs on qwen2_5_vl its VisionTransformer

* Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow returning the projection attentions

* Update Emu akin to Chameleon

* Update the blip architectures with a naive fix

A better solution might be to remove the qformer etc. calls from the get_image/video_features and run those separately in the forward methods.

* Convert remaining modulars (emu3, janus), patch emu3

* Patch blip test

* Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings

* Remove 'copied' for blip_2, instructblip and kosmos2 as they required custom changes

* Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state instead of pooler_output

* Run repo-consistency

* Use kwargs["output_hidden_states"] = True to hardcode output_hidden_states where needed

* Update new GlmAsr get_audio_features on ForConditionalGeneration

* Run make style

* Try to add _can_record_outputs to florence2

* Override JanusVisionModel.forward to avoid bad q-former copy from Blip2

* Import missing BaseModelOutput

* Pop deprecated 'return_attentions', setting 'return_dict' won't be useful iiuc

* Reintroduce kwargs filtering in llava etc. for safety re. image_sizes

We also don't need to incorporate code cleanup etc. in this PR, we should keep it as minimal as possible and leave these kinds of lines intact.

* Use BaseModelOutputWithPooling superclass consistently for custom get_..._features outputs

* Update Blip-2 family and its BaseModelOutputWithVisionQformerOutputs

To use both a vision_outputs and qformer_outputs as keys in the BaseModelOutputWithPooling subclass, despite some duplication.

* Update glm4v _can_record_outputs

* Remove check_model_inputs in granite_speech

I could also use can_return_tuple, but this might be problematic if `return_dict=False` in the config

* Run make style

* Add _can_record_outputs to Ovis2VisionModel

* Update get_text_features/get_video_features from pe_video

* Update missing case on sam3

* Update get_text_features type hints to Union[tuple, BaseModelOutputWithPooling]

Blip-2 and Clvp are the only exceptions

* Add _can_record_inputs to qwen2_5_omni and qwen2_5_vl

* Update get_image_features and get_video_features on ernie4_5_vl_moe

Can we even use BaseModelOutputWithPooling for these? It's a MoE model

* Update get_image_features type hints to Union[tuple, BaseModelOutputWithPooling]

With a handful of exceptions

* Remove @auto_docstring from pe_video, it's seemingly not used on that arch

(or well documented)

* Update get_video_features type hints to Union[tuple, BaseModelOutputWithPooling]

Only exceptions for BaseModelOutputWithDeepstackFeatures

* Fix pe_video import issue

* Update forward, test, and docstring for sam3

* Update get_audio_features type hints to Union[tuple, BaseModelOutputWithPooling]

Also update BaseModelOutput to BaseModelOutputWithPooling in several places, leaving room for a potential pooled embedding to be computed by get_audio_features

* Add simple test case for get_text_features

Fails on CLIP, MetaCLIP, Siglip, Siglip2 as they use 'self.text_model = text_model.text_model', bypassing the TextModel that has `check_model_inputs` cc @zucchini-nlp related to huggingface#42564

* First attempt to get get_image_features under test, still 26 failures

* Resolve several test failures, progress still slow and inconsistent

* Split up get_..._features tests more, should be simpler to disable/customize specific parts per arch

* Fix emu3 tests, also track non-temporal ResNet in hidden_states

* Patch chameleon, emu3, ernie4_5, janus

* Skip output_attentions for FastVLM, timm doesn't accept it

But I'm not sure how to handle the output_hidden_states case

* Patch groupvit, instructblip, ovis2

plus style

* Patch paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, and skip test for perception_lm

perception_lm is still problematic with output_hidden_states, akin to fast_vlm

* Patch qwen3_omni_moe, sam family, edgetam

P.s. edgetam had incorrect _can_record_outputs

Now, all issues that remain with get_image_features are due to 1) CLIP family issue and 2) unclarity with expected output_hidden_states for timm-based models

* Kill now unused BaseModelOutputWithFeatureMaps

* Remove left-over return_dict from prior attempt

* Allow for output_hidden_states in theory, but skip impossible tests

The tests are failing as edgetam doesn't output hidden_states. It used to, because of a broken TimmWrapper in _can_return_outputs.

* Introduce tests for get_audio_features, fixed all architectures

* Introduce tests for get_video_features, only ernie4_5_vl_moe is failing

It's failing as the split_sizes gets made too small, such that the video_embeds doesn't sum to the split_sizes anymore. I'm not sure how to best tackle it.

I also removed the get_video_features from PaddleOCR_vl, as I don't think it's meant to be used with video

* Call post_init on GraniteSpeechCTCEncoder, which was given a PreTrainedModel subclass

* Update llava_onevision test suite, only create video pixel_values in new method

Instead of in the common one, as that negatively affects other tests (as there's no video tokens in the inputs_ids then)

* Create custom video input for ernie4_5_vl_moe

* Skip CLIP family tests; they don't support output_hidden_states/output_attentions due to bug

* Breaking: update Blip2Model.get_text_features to no longer output logits

* Satisfy test_num_layers_is_small test for align

* Test against last_hidden_state against batch_size and hidden_size

19 failures, mostly if architectures merge the first dimension with e.g. num_frames for videos, or swap dimensions from the norm with the hidden_state at index 1 in a 4d-tensor

I don't think it's reasonable to expect these to be 'fixed', they would require drastic changes in the architectures or somewhat arbitrary changes in the post-processing of the hidden states.

* Skip last_hidden_state shape tests for unusual cases

E.g. when batch_size is merged with num_frames or num_patches, or hidden_size is in index -3 instead of index -1

* Update docstrings via auto_docstring for all get_..._features methods

Also add to e.g. aria.md to ensure that get_..._features methods are documented

* Ensure all auto_doc arguments are documented

* Remove redundant docstrings

* Also patch the new glm_image for get_image_features/output_hidden_states

* Update modular files as per check_docstring rules ...

... to avoid modular/check_docstring conflicts. Modular would propargate changes from modular to modeling files, and then check_docstring would complain and update the modeling files only. This created an unstable state where one of the two scripts was unhappy. I resolved this by manually tracking down the check_docstring issues in the modular files.

* Update glm-image dates via fix-repo

* FloatTensor -> LongTensor for image_tokens

* Add simple last_hidden_state description, fix output typing of Gemma3nAudioEncoder.forward

* Add missing `-> tuple | BaseModel...` on check_model_inputs

Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]*\):``

* Ensure forward typing with check_model_inputs is `-> tuple | BaseModel...`

Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]+\) -> (?!tuple | )``

* Undo accidental rename of Ovis2VisionAttention

* Fix incorrect type hints for blip family

* Patch get_image_features for lighton_ocr

* Explicitly use Ovis2VisionAttention in Ovis2VisionEncoderLayer in modular

* Update use of get_image_features for lighton_ocr

Forgot to run tests to verify that it worked, oops

* Rerun python utils/add_dates.py

Not sure which script removed the date... :/

* Remove tie_last_hidden_states=False from check_model_inputs from ...

forward methods that previously did not return a BaseModelOutput

* Revert accidental metaclip import change

* Add missing return_dict=True in get_..._features methods

* Add `output_hidden_states=True` in InternVL get_image_features

Only if needed

* Add missing docstring for llava_next_video get_video_features

* Quick clean-up in _video_features_prepare_config_and_inputs test helper

* model.set_attn_implementation instead of config._attn_implementation

Note:  There's about ~10 other places that use config._attn_implementation in this test file alone

* Add simple docstring to some helper methods re. inputs.

It's not extremely useful I think, as it has to be somewhat generic due to the large differences in the architectures

* Explain why get_..._features test inputs are overridden

* Undo incorrect return_dict=True change in deepseek_vl_hybrid

I added return_dict to get_low_res_image_features and get_high_res_image_features calls, but these methods already set return_dict automatically

* Revert accidental metaclip import change

* Adopt **vision_outputs in instructblip, but mess remains

* Avoid kwargs["output_hidden_states"] = True in get_..._features methods

* Update check_model_inputs to default vision args based on config

* Unrelated but important: patch set_attn_implementation for Windows

idem with set_experts_implementation

* Revert output_hidden_states changes on InternVL

On this architecture, it seems cleaner to go the `kwargs["output_hidden_states"] = True` route, as a simple `output_hidden_states=vision_feature_layer != -1` prevents setting the `output_hidden_states` to True if requested for downstream use.

* Extend d9001cc (check_model_inputs); remove more vision_feature_layer defaulting

* Patch unusual bug: llava_next_video used self.vision_feature_layer

Doesn't seem like this was being used elsewhere, so I can just update it to use the local variant like elsewhere

* Add unused use_cache to TimmWrapperModel to patch FastVLM

FastVLM now forwards this argument due to the check_model_inputs, and TimmWrapper can't use it

* Update check_config_attributes to allow for vision attributes

And rerun fix-repo

* Add tests for config.return_dict=False

Also; siglip had "nested" check_model_inputs: the VisionModel and VisionTransformer (below it) both used `check_model_inputs`. This means that the VisionModel.forward eats the 'return_dict=True', and the lower VisionTransformer.forward its `check_model_inputs` uses the config.return_dict=False to turn the output to a tuple.

The siglip/clip/metaclip family is still broken due to the `text_model = text_model.text_model` bypassing the class with the `check_model_inputs`.

* permute and quantize separately for the comment

* Ditch shared custom_args for ernie4_5_vl_moe

* Move Ernie4_5_VL_MoeVisionAttention next to VisionBlock

* Add missing "attentions" from Florence2 _can_record_outputs

* Clarify kwargs.get("image_sizes") in modeling_llava

* Remove commented skip_test_image_features_output_shape in chameleon tests

* Add a migration guide under 'Library-wide changes with lesser impact'

* Parameterize get_..._features tests  with return_dict (True, False, None)

* Add comment re. TimmWrapper _can_record_outputs

* Shrink Gemma3nAudioEncoderModelOutput with auto_docstring & superclass

* Revert "Unrelated but important: patch set_attn_implementation for Windows"

This reverts commit 0923216.
* setup, workflow and utils files

* src core files

* use get_session instead of httpx

* fix code quality - remove unused imports

* utils and prevous-http-used and cli

* style changes

* use httpx for non-HF images

* get_session for HF images

* change docstrings

* revert formatting - 1

* revert formatting - 2

* docstrings fixes

* revert formatting - 3

* docstring fixes

* fix failing test

* Update src/transformers/pipelines/image_to_image.py

* Update src/transformers/testing_utils.py

* fix modular generation check

* fix consistency

* Apply suggestions from code review

* Update src/transformers/models/idefics2/modeling_idefics2.py

* Update src/transformers/models/idefics3/modeling_idefics3.py

---------

Co-authored-by: Lucain <lucainp@gmail.com>
don't fail

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Patch set_attn_implementation for Windows

idem with set_experts_implementation

* Extend the fix to _can_set_experts_implementation
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, molmo

@merveenoyan
Copy link
Copy Markdown
Contributor

@zucchini-nlp would you like to take over this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.