Skip to content

Add ForSequenceClassification heads for the OLMo family#45551

Merged
Rocketknight1 merged 2 commits intohuggingface:mainfrom
earino:add-olmo-family-classification
Apr 22, 2026
Merged

Add ForSequenceClassification heads for the OLMo family#45551
Rocketknight1 merged 2 commits intohuggingface:mainfrom
earino:add-olmo-family-classification

Conversation

@earino
Copy link
Copy Markdown
Contributor

@earino earino commented Apr 21, 2026

What does this PR do?

Adds ForSequenceClassification for Olmo, Olmo2, and Olmo3 so AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B") (and the Olmo/Olmo3 equivalents) work.

Structure follows the same pattern as Gemma2, Qwen3, and Glm4: one OlmoForSequenceClassification(LlamaForSequenceClassification): pass in modular_olmo.py, then Olmo2 and Olmo3 each subclass the previous one. After make fix-repo, the generated modeling_*.py files use the GenericForSequenceClassification mixin, same as Jamba, JetMoe, Ministral3, and Gemma3Text.

Scope is the dense chain only. OlmoHybrid and the MoE branch (Olmoe, FlexOlmo) can come as follow-up PRs.

Fixes #45529

Code Agent Policy

This was drafted with Claude Code. @Rocketknight1 explicitly opted in for this specific change on the coordination issue:

This is welcome! Sequence classification heads are often not included in the initial PR adding a new causal LM, but we're happy to add them. Your reason for needing it is good, and PRs like this are usually very easy to automate, so I'm happy for it to be mostly AI-written.

#45529 (comment)

I read every changed line before each commit, ran the local test suite on my machine, and did GPU validation on real checkpoints before opening this. I can defend each change.

  • I confirm that this is not a pure code agent PR.

Before submitting

Changes

  • src/transformers/models/olmo/modular_olmo.py: class OlmoForSequenceClassification(LlamaForSequenceClassification): pass
  • src/transformers/models/olmo2/modular_olmo2.py: subclass of OlmoForSequenceClassification
  • src/transformers/models/olmo3/modular_olmo3.py: subclass of Olmo2ForSequenceClassification
  • Regenerated modeling_olmo.py, modeling_olmo2.py, modeling_olmo3.py via make fix-repo
  • Registered all three in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES in models/auto/modeling_auto.py
  • Added autodoc sections to docs/source/en/model_doc/olmo{,2,3}.md
  • Tests:
    • Olmo and Olmo2 use the older ModelTesterMixin pattern. Added the new class to all_model_classes and text-classification/zero-shot to pipeline_model_mapping.
    • Olmo3 uses the newer CausalLMModelTester. Set sequence_classification_class = Olmo3ForSequenceClassification, which auto-enables the three test_sequence_classification_model* tests from the base class.

Test plan

Local, on MPS:

  • make style, make typing, make check-repo all pass
  • pytest tests/models/olmo tests/models/olmo2 tests/models/olmo3: 413 passed, 0 failing. 4 test_tp_* tests deselected because they require CUDA multi-GPU.
  • Olmo3's three test_sequence_classification_model* tests pass

GPU validation notebook with outputs: https://gist.github.com/earino/2bc6f246eef21a36c3c64d64150b9510

Ran on an NVIDIA RTX PRO 6000 Blackwell. For each of allenai/OLMo-1B-hf, allenai/OLMo-2-0425-1B, and allenai/Olmo-3-7B-Instruct:

  • AutoModelForSequenceClassification.from_pretrained(...) dispatches to the right class
  • Forward returns logits of shape (batch, num_labels)
  • Loss is finite at random init, roughly ln(num_labels) as expected
  • Backward produces finite gradients for every trainable parameter
  • The library's LOAD REPORT correctly shows score.weight | MISSING (new head) and lm_head.weight | UNEXPECTED (causal-LM head unused)

A LoRA fine-tune on IMDB with allenai/OLMo-2-0425-1B (4.2M trainable params, 250 steps) brings loss from 1.48 over the first 20 steps down to 0.0005 over the last 20. I only ran the full training loop on one of the three because the classification head implementation is identical across them (same GenericForSequenceClassification mixin, different backbones and *PreTrainedModel bases), so the training-loop plumbing is shared; smoke-test forward/backward covers the other two. Happy to extend the LoRA run to Olmo and Olmo3 if you'd prefer.

Who can review?

cc @Rocketknight1 as requested on #45529.

earino and others added 2 commits April 21, 2026 15:18
Adds sequence-classification heads to the OLMo family so
`AutoModelForSequenceClassification.from_pretrained("allenai/OLMo-2-0425-1B")`
(and the Olmo/Olmo3 equivalents) work out of the box.

Implementation follows the canonical modular-inheritance pattern used by
Gemma/Gemma2, Qwen2/Qwen3, and Glm/Glm4: a single hand-written subclass in
`modular_olmo.py` cascades trivially to Olmo2 and Olmo3 via the modular
tooling, which resolves to the `GenericForSequenceClassification` mixin.

Also registers the three classes in `MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES`
and adds autodoc entries to each model's doc page.

Coordination: huggingface#45529
Maintainer approval: @Rocketknight1 ("This is welcome! ... happy for it to
be mostly AI-written. Just ping me on the PR for review when it's ready!")

AI assistance: yes, per issue huggingface#45529.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For Olmo and Olmo2 (older `ModelTesterMixin` pattern), adds the new class
to `all_model_classes` and wires up `text-classification` + `zero-shot` in
`pipeline_model_mapping`, so standard forward/gradient tests run against
the classification head.

For Olmo3 (newer `CausalLMModelTester` pattern), sets
`sequence_classification_class = Olmo3ForSequenceClassification` on the
model tester, which auto-enables `test_sequence_classification_model`,
`test_sequence_classification_model_for_single_label`, and
`test_sequence_classification_model_for_multi_label` from the base class.

Local verification on MPS: 413 non-TP tests pass; Olmo3's three
classification tests pass specifically. TP tests (`test_tp_*`) are
deselected on MPS hardware — CUDA-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, olmo, olmo2, olmo3

@Rocketknight1
Copy link
Copy Markdown
Member

run-slow: auto, olmo, olmo2, olmo3

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/olmo", "models/olmo2", "models/olmo3"]
quantizations: []

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM! I'll merge as soon as tests pass, if we're lucky we might get in before the branch cut for today's release.

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 22f14e8d workflow commit (merge commit)
PR 812cba00 branch commit (from PR)
main 8fb7c7e5 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@Rocketknight1 Rocketknight1 added this pull request to the merge queue Apr 22, 2026
Merged via the queue into huggingface:main with commit 864db66 Apr 22, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Olmo2ForSequenceClassification (and ideally OlmoForSequenceClassification / Olmo3ForSequenceClassification)

3 participants