Add Molmo2 by SangbumChoi · Pull Request #43451 · huggingface/transformers

SangbumChoi · 2026-01-23T14:47:55Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Adds AllenAI Molmo2 multimodal VLM to transformers, supporting: - Molmo2ForConditionalGeneration (image+video+text → text) - Molmo2TextModel / Molmo2TextForCausalLM (text-only) - Molmo2ImageProcessor and Molmo2VideoProcessor - Molmo2Processor Key implementation details: - Uses is_first_iteration (v5 API) for prepare_inputs_for_generation - Custom Molmo2Embedding with embedding + new_embedding parameters - Vision backbone with pooling adapter and multi-layer ViT features - Dynamic full cache support for generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@strict

…odel_prefix - Replace einops.rearrange with native numpy reshape+transpose+reshape - Add @strict decorator to all 4 config classes (Molmo2VitConfig, Molmo2AdapterConfig, Molmo2TextConfig, Molmo2Config) to satisfy TRF010 - Set Molmo2Model.base_model_prefix = "model" (was empty, violating TRF002) - Fix image_mean/image_std mutable shared list (copy constants on init) - Fix test_image_processing: use image_processing_class instead of image_processor_list; skip CHW torch and 4-channel unsupported tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Re-sort _toctree.yml to place Molmo2 after mllama alphabetically - Add None guard in test_video_processor_from_dict_with_kwargs to skip when fast_video_processing_class is not defined Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2's processor has several behaviors that are incompatible with the default ProcessorTesterMixin assumptions: - Chat template enforces strict user/assistant alternation (no system role) - Processor inserts BOS token, shifting sequence length by 1 - Image processor patchifies output, so rescale_factor passthrough fails - Video processor requires FPS metadata not provided by base tests - Hub processor_config.json contains auto_map not preserved in save/load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add @auto_docstring(checkpoint="allenai/Molmo2-8B") decorator to Molmo2TextConfig and Molmo2Config with custom_args for documenting non-standard parameters. This fixes check_config_docstrings CI check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… date Add parameter docstrings to Molmo2TextConfig and Molmo2Config __init__ methods so @strict-wrapped classes pass config docstring CI checks. Update model doc date to 2026-03-28. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move top-level `import torch` and `import torchvision.transforms` behind `is_torch_available()` / `is_torchvision_available()` guards in both image and video processors to prevent ModuleNotFoundError when torchvision is not installed. Also skip test_kwargs_overrides_default_image_processor_kwargs since Molmo2's patchifying image processor doesn't support rescale_factor passthrough. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert all absolute imports (from transformers.xxx) to relative imports (from ...xxx) in image_processing, video_processing, and processing modules to match the convention used by all other in-library models. Remove register_for_auto_class() calls which are only needed for custom hub models and were causing dynamic_module_utils to incorrectly scan local files for relative imports during save_pretrained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n_available The processor's top-level imports from image_processing_molmo2 and video_processing_molmo2 pull in PILImageResampling which requires PIL. Guard these imports with is_vision_available() so `from transformers import *` works when only torch is installed (no PIL/torchvision). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…L imports Move Molmo2ImagesKwargs and Molmo2VideosKwargs definitions directly into processing_molmo2.py instead of importing them from image/video processor modules which require PIL. Also remove Molmo2ImageProcessor/VideoProcessor type hints from __init__ to avoid NameError when vision is unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SangbumChoi · 2026-03-29T01:49:56Z

@molbap Hi I am still working on it since I have to make some example visualizer for this and (most of the code is generated by Claude code). However, you can start review this with brief level of code review! cc. @merveenoyan

Add integration tests for Molmo2-8B covering: - Image generation with exact expected text verification - Video QA (penguin identification) - Video pointing (coordinate output) - Multi-image comparison All expected values derived from actual model inference on A10G. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zucchini-nlp

Hey @SangbumChoi

Great model to add to Transformers. After reviewing I see that using modular would be much better since a lot of part are copy-paste from different models. I left comments on each class about where it can be copied from. Apart from that, there are a few places where we need to clean up and align API with the rest of VLMs for consistency

If you have q, ping me on slack. I will unsubscribe myself from this PR to not get notif about each commit, so when you want another review ping me again by @

zucchini-nlp · 2026-04-02T12:40:23Z

+    r"""
+    This is the configuration class to store the configuration of a [`Molmo2VisionTransformer`].
+    It is used to instantiate a `Molmo2VisionTransformer` according to the specified arguments,
+    defining the model architecture.
+
+    Configuration objects inherit from [`PreTrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PreTrainedConfig`] for more information.
+
+    Example:
+    ```python
+    >>> from transformers import Molmo2VitConfig, Molmo2VisionTransformer
+


let's use @auto_doctring for configs

zucchini-nlp · 2026-04-02T14:33:34Z

+if is_torch_available():
+    pass


dummy import

zucchini-nlp · 2026-04-02T14:34:06Z

+    # =====================================================================
+    # Molmo2 chat template enforces strict user/assistant alternation and
+    # does not support the "system" role used by the base test harness.
+    # =====================================================================
+    def test_apply_chat_template_decoded_video_0(self):


we can override it when init a dummy processor in setup

zucchini-nlp · 2026-04-02T14:35:45Z

+    # =====================================================================
+    # Molmo2Processor.insert_bos() prepends a BOS token, so token count
+    # differs by 1 from raw tokenizer output. This is by design.
+    # =====================================================================
+    @unittest.skip("Molmo2 processor inserts BOS token, causing mismatch with raw tokenizer")


instead of skipping, lets override when needed.

zucchini-nlp · 2026-04-02T14:35:59Z

+    # =====================================================================
+    # Hub model has auto_map in processor_config.json which is not preserved
+    # through save/load cycle. Also use_single_crop_col_tokens default differs.
+    # =====================================================================
+    @unittest.skip("Molmo2 image processor patchifies output; rescale_factor passthrough not supported")
+    def test_image_processor_defaults_preserved_by_image_kwargs(self):
+        pass


same, instead of skipping, lets override when needed this and the rest as well

zucchini-nlp · 2026-04-02T14:37:34Z

+        if self.fast_video_processing_class is None:
+            self.skipTest("No fast video processor class defined")


why not, each tester should have a fast_video_processing_class as the only possible class

zucchini-nlp

Hey @SangbumChoi

Great model to add to Transformers. After reviewing I see that using modular would be much better since a lot of part are copy-paste from different models. I left comments on each class about where it can be copied from. Apart from that, there are a few places where we need to clean up and align API with the rest of VLMs for consistency

If you have q, ping me on slack. I will unsubscribe myself from this PR to not get notif about each commit, so when you want another review ping me again by @

- Remove unused _flash_attention_forward and flash_attn_supports_top_left_mask imports from modeling_molmo2.py (no longer needed after attention refactor) - Move Molmo2AdapterConfig, Molmo2TextConfig, Molmo2VitConfig imports from lazy in-function imports to top-level in test_modeling_molmo2.py per review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix TRF013 modeling structure violation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SangbumChoi · 2026-04-06T23:26:32Z

+    rope_scaling: dict[str, Any] | None = None
+    rope_scaling_layers: list[int] | None = None
+    use_qk_norm: bool = False
+    qk_norm_type: str = "olmo"


Check what is olmo norm type

SangbumChoi · 2026-04-08T13:05:36Z

+def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape


well convert_images only treat single image isn't it?

Adopt auto_docstring on Molmo2Processor/__call__, simplify model_input_names to inherit tokenizer + image_processor keys plus token_type_ids, and drop deprecated frame_sample_mode/sampling_fps from Molmo2VideosKwargs and legacy attribute declarations. Override prepare_processor_dict in the processor test with a system-role-aware chat template, skip chat-template tests that assume batch-dim pixel_values (Molmo2 concatenates crops), and relax test_model_input_names to a subset check since video keys are absent in image-only runs. Drop the test_generate_with_past_key_values skip since image features are cached in the KV cache like other VLMs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nd processor init Fill in docstring entries for Molmo2ImagesKwargs, Molmo2VideosKwargs, and Molmo2VideoProcessorKwargs TypedDicts, and document the five custom init args of Molmo2Processor, so that make fix-repo / check_docstrings passes without placeholder stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…inference, add tie_word_embeddings - Add molmo2/molmo2_text to auto_mappings.py CONFIG_MAPPING_NAMES so AutoConfig.from_pretrained and check_repo.py doc-match checks work - Add molmo2 to HARDCODED_CONFIG_FOR_MODELS in auto_docstring.py to silence repeated 'Config not found' errors during repo checks - Add tie_word_embeddings: bool = False to Molmo2Config class and docstring to satisfy TRF015 modeling structure check - Pass input_data_format=ChannelDimension.LAST explicitly to all normalize() calls in image/video processors; fixes ValueError 'Unable to infer channel dimension format' when images have non-standard channel counts (e.g. RGBA) where infer_channel_ dimension_format's default num_channels=(1,3) can't match Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-18T06:56:52Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, molmo2

…ensor inputs resize_image() and build_overlapping_crops() assume HWC (channels-last) layout. When callers pass CHW numpy arrays or torch tensors (e.g. frames from torchvision / OpenCV→tensor pipelines at 960×540), the width was misinterpreted as the channel count, causing: ValueError: mean must have 960 elements if it is an iterable, got 3 Fix: after to_numpy_array(), infer the channel dimension and transpose to ChannelDimension.LAST before any spatial processing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zucchini-nlp

Nice work @SangbumChoi !

The model arch is huge and more involved than vanilla VLMs, so I think we need a couple more iterations. My main comment is the usage of explicit device in text modules, and refactoring more image/video processors. I hope we can actually move the biiig build_image_inputs fn from model to processing, as it simply tries to split pixels per each input text. Requiring nested image inputs will help with that and we should be able to return already "built" pixel values from image processor

After that, it'd be great to add the basic helpers such as get_image_features, get_placeholder, etc. Some third party libraries started rely on it to get encoded images/videos

zucchini-nlp · 2026-04-21T15:09:32Z

+    torch_dtype=torch.bfloat16,
+    device_map="auto",


ultra nit: in v5 we load model in auto-dtype so we can skip passing torch_dtype

zucchini-nlp · 2026-04-21T15:10:41Z

            ),
            ("mobilevitv2", {"torchvision": "MobileViTImageProcessor", "pil": "MobileViTImageProcessorPil"}),
+            ("molmo2", {"torchvision": "Molmo2ImageProcessor"}),
+            ("nougat", {"torchvision": "NougatImageProcessor", "pil": "NougatImageProcessorPil"}),


nougat? Also I think Molmo2 would have been in auto_mapping.py, does it not get added after running python utils/check_auto.py --fix_and_overwrite 🤔 ?

zucchini-nlp · 2026-04-21T15:17:20Z

+def resize_image(
+    image: np.ndarray,
+    desired_output_size: list[int],
+    resample: PILImageResampling,
+) -> np.ndarray:
+    """Resize an image and rescale to [0, 1] float32."""
+    image = torch.permute(torch.from_numpy(image), [2, 0, 1])
+    resized = torchvision.transforms.Resize(
+        desired_output_size,
+        resample,


huh very interesting, fast processors have self.resize with same functionality, and I can't think of cases when we get a numpy image

I will comment below about possible reason, lmk if processor is starting with numpy even after fixing it

zucchini-nlp · 2026-04-21T15:20:13Z

+    patch_size = 14
+    pooling_size = [2, 2]
+
+    def __init__(self, **kwargs):


needs to type annotate kwargs with Unpack[Molmo2ImagesKwargs] to auto-docstring

zucchini-nlp · 2026-04-21T15:21:17Z

+    def preprocess(
+        self,
+        images: ImageInput,


here is the reason i am seeing, we need to override a private self._preprocess which gets a ready list of tensor images in CHW format each. It also doesn't need resolving args with x if x else self.x

same for docs, not needed as long as you add annotation with Unpack and decorate class with auto_docstring

zucchini-nlp · 2026-04-22T09:11:21Z

+    crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
+    patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)


and do all ops in torch/torchvision since it is TorchBackend

zucchini-nlp · 2026-04-22T09:12:12Z

+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape
+        h_patches = h // patch_size


same q here, i believe videos have a fixed shape as well

zucchini-nlp · 2026-04-22T09:13:04Z

+        if size.height is None or size.width is None:
+            raise ValueError("size must contain 'height' and 'width' keys.")


it is in standardize_kwargs so not needed, kinda duplicate

zucchini-nlp · 2026-04-22T09:13:47Z

+            # Convert from torch (T, C, H, W) to numpy (T, H, W, C)
+            if isinstance(video, torch.Tensor):
+                video = video.permute(0, 2, 3, 1).numpy()


hm interesting, why since all inputs are in channel first format for image and for video processors

zucchini-nlp · 2026-04-22T09:14:27Z

+        batch_crops = []
+        batch_pooled_patches_idx = []
+
+        for video in videos:


would be great if we can use groupby_shapes here and in image processing, it greatly speeds up batch processing in some cases

tarekziade · 2026-04-29T07:00:22Z

+        is_image_block = (token_type_ids[batch_idx, q_idx] == 1) & (token_type_ids_at_kv_idx == 1)
+
+        # This is bidirectional attention whenever we are dealing with image tokens
+        return is_image_block & is_image_block


Redundant expression: is_image_block & is_image_block is equivalent to is_image_block

tarekziade · 2026-04-29T07:00:50Z

+
+            src_idx = np.tile(np.arange(S), (B, 1))  # [B, S]
+            valid_mask = src_idx >= first_valid_index[:, None]  # [B, S]
+            tgt_idx = src_idx + 1  # shit right


typo : shift

tarekziade · 2026-04-29T07:01:30Z


 -->
-*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-05.*
+*This model was released on 2020-05-16 and added to Hugging Face Transformers on 2025-12-05.*


unrelated change, should be removed from this PR

tarekziade · 2026-04-29T07:01:37Z


 -->
-*This model was released on {release_date} and added to Hugging Face Transformers on 2026-02-08.*
+*This model was released on 2026-02-17 and added to Hugging Face Transformers on 2026-02-09.*


unrelated change, should be removed from this PR

tarekziade · 2026-04-29T07:10:36Z

+    size = {"height": 378, "width": 378}
+    image_mean = IMAGENET_STANDARD_MEAN
+    image_std = IMAGENET_STANDARD_STD
+    do_resize = True


I think most of those flags are dead e.g. unused in the molmo2 preprocess cc @zucchini-nlp

merveenoyan requested a review from molbap February 27, 2026 05:55

SangbumChoi force-pushed the molmo2 branch from b15107c to 3fee343 Compare March 26, 2026 23:28

SangbumChoi and others added 14 commits March 27, 2026 08:56

Merge branch 'main' into molmo2

bc04776

fix(molmo2): add Molmo2TextModel to IGNORE_NON_TESTED

e38b0a3

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(molmo2): replace requests with stdlib urllib in video processor

cc06cbe

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'main' into molmo2

f000173

Merge branch 'main' into molmo2

de8c268

zucchini-nlp self-requested a review March 30, 2026 15:23

zucchini-nlp reviewed Apr 2, 2026

View reviewed changes

SangbumChoi added 5 commits April 4, 2026 09:13

add comments fix

5d9811f

remove safe loading

3d6ec36

fix configuration with large refactor

2604a49

fix tests

e257e17

make formalize to standard way

d75fdbe

SangbumChoi force-pushed the molmo2 branch from 77a8c8c to 99c8582 Compare April 4, 2026 07:47

Merge branch 'main' into molmo2

91b6a78

SangbumChoi force-pushed the molmo2 branch from 99c8582 to 91b6a78 Compare April 4, 2026 07:51

SangbumChoi and others added 14 commits April 6, 2026 08:17

Merge branch 'main' into molmo2

e74f38a

add initilization

f275c15

add major change to modular

590ad75

Add post_init() call to Molmo2VisionModel.__init__

31550ee

Fix TRF013 modeling structure violation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add modular_molmo2 and regenerate modeling file

43c4853

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add **kwargs to Molmo2VisionModel.forward

a1b40d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

remove unused float32_attention from Molmo2Vit/AdapterConfig

668776b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into molmo2

26ea989

add Molmo2 checkpoint conversion mapping for weight key renaming

0632571

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

default vision attention to sdpa to avoid eager OOM

68e5285

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

force sdpa in molmo2 vision attention

53ac58e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix molmo2 conversion pattern reversal and add doc dates

553f267

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ruff format conversion_mapping.py

68a2e68

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SangbumChoi commented Apr 8, 2026

View reviewed changes

SangbumChoi and others added 4 commits April 11, 2026 10:23

fix video_preocessor.py

72c51df

simple date fix

c9f2a43

Merge branch 'main' into molmo2

a3fcaab

SangbumChoi requested a review from zucchini-nlp April 18, 2026 02:48

SangbumChoi and others added 3 commits April 18, 2026 11:54

Merge branch 'main' into molmo2

fa91bb0

zucchini-nlp reviewed Apr 22, 2026

View reviewed changes

tarekziade mentioned this pull request Apr 28, 2026

Add Molmo2 tarekziade/tarekziade-transformers-reviewer-test#12

Open

tarekziade reviewed Apr 29, 2026

View reviewed changes

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

		if self.fast_video_processing_class is None:
		self.skipTest("No fast video processor class defined")

		crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
		patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)

		if size.height is None or size.width is None:
		raise ValueError("size must contain 'height' and 'width' keys.")

Conversation

SangbumChoi commented Jan 23, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

SangbumChoi commented Mar 29, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels