Skip to content

Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757)#45118

Closed
sirzechs66 wants to merge 128 commits intohuggingface:mainfrom
sirzechs66:fix-transformers-issue
Closed

Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757)#45118
sirzechs66 wants to merge 128 commits intohuggingface:mainfrom
sirzechs66:fix-transformers-issue

Conversation

@sirzechs66
Copy link
Copy Markdown
Contributor

@sirzechs66 sirzechs66 commented Mar 30, 2026

What does this PR do?

This PR adds full GGUF loading support for GPT‑OSS models (20B/120B). It allows Transformers (and consequently vLLM) to directly load GPT‑OSS GGUF files without falling back to a wrong architecture. The changes include:

  • Architecture registration in GGUF mappings.
  • A custom GptOssTensorProcessor to handle MoE expert splitting and gate/up interleaving.
  • Reconstruction of nested rope_scaling (YaRN) from flat GGUF metadata.
  • Tests: fast registration test + slow integration test using a real 20B GGUF file.

Fixes #43366, supersedes #43757.
Related vLLM issue: vllm-project/vllm#22353

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). – Not applicable, it adds a feature.
  • Did you read the contributor guideline, Pull Request section? – Yes.
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. – Issue GGUF model with architecture gpt-oss support #43366, discussion in comments.
  • Did you make sure to update the documentation with your changes? – yes!
  • Did you write any new necessary tests? – Yes, in test/quantization/test_ggml.py

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Tagging:

@sirzechs66 sirzechs66 force-pushed the fix-transformers-issue branch from 10463a1 to a5eef4f Compare March 30, 2026 16:43
@sirzechs66
Copy link
Copy Markdown
Contributor Author

@SunMarc, @Cyrilvallez please feel free to look into this as i am getting the failed tests from the unedited files
thank you

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks ! did a first pass

Comment thread tests/models/gpt_oss/test_modeling_gpt_oss.py Outdated
Comment thread tests/models/gpt_oss/test_modeling_gpt_oss.py Outdated
Comment thread tests/models/gpt_oss/test_modeling_gpt_oss.py Outdated
Comment thread tests/models/gpt_oss/test_modeling_gpt_oss.py Outdated
Comment thread tests/models/gpt_oss/test_modeling_gpt_oss.py Outdated
Comment thread src/transformers/modeling_gguf_pytorch_utils.py Outdated
@sirzechs66
Copy link
Copy Markdown
Contributor Author

the branch was behind, when i and it merged without any conflict please review @SunMarc if any more changes are required let me know
and apologies for my naivity
happy to contribute

  • added dynamic rope config fetching as gpt_oss and current gguf version have different metadata (gguf only has 3 rope configs, while gpt_oss have ""factor", "attention_factor", "beta_fast", "beta_slow")
  • changed the test paradigm to ggml ways and removed incorrect test function

please refer to the final commit for review

@sirzechs66 sirzechs66 requested a review from SunMarc April 2, 2026 11:22
@sirzechs66
Copy link
Copy Markdown
Contributor Author

@SunMarc @Cyrilvallez @ArthurZucker Please review as the requested changes have been made,
thanks

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Apr 9, 2026

@bot /style

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Style fix bot fixed some files and pushed the changes.

@SunMarc SunMarc enabled auto-merge April 15, 2026 11:39
auto-merge was automatically disabled April 15, 2026 12:40

Head branch was pushed to by a user without write access

@SunMarc SunMarc enabled auto-merge April 15, 2026 13:46
@SunMarc SunMarc added this pull request to the merge queue Apr 15, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 15, 2026
@sirzechs66
Copy link
Copy Markdown
Contributor Author

@SunMarc please review, there are some unexpected errors in "test_processors" in the CI test, that made merge failed
thank you

@SunMarc SunMarc added this pull request to the merge queue Apr 16, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 16, 2026
sirzechs66 and others added 11 commits April 18, 2026 13:02
* Music flamingo

* Fix pos embeddings

* Method arg docstrings

* Add tests & docs

* Fix AF3 dtype bug

* Fix the MF performance issue

* Fix pos embeddings

* Fix embeddings & format

* Remove external deps

* Update processor token names

* Cleanup

* Simplify RotaryEmbedding to lang-only

* Reuse AF3 config classes

* Trim+rename rotary embedding

* Call parent _init_weights first and drop rotary einsum

* Precompute rotary cache at init

* Use modular processor pattern for MusicFlamingo

* Remove audio-only inference example

* Refactor Audio Feature Casting Path

* Clarify private source repo

* Clean up modular

* Move config to modular

* Formatting

* Remove dummy

* Derive musicflamingo timing and rotary config

* Llama style rotary embeddings

* Added reproducer comments

* Expose _init_weights for modular.

* Satisfy repo checks

* Align MusicFlamingo rotary with Llama style

* Move MusicFlamingo _init_weights to encoder

* Keep old behavior

* Move MusicFlamingo rotary settings into encoder rope_parameters

* Use AutoConfig in AF3/MF

* Align MusicFlamingo RoTE with Llama RoPE conventions

* Update outdated fixtures

* init_weights without changing others

* FIx import

* Remove backward compat

* Regenerate modeling for MF

* Fix AF3 batch inference bug

* Simplify config and nit.

* Conform more to transformers convention, e.g. removing unused code paths.

* Add another possible AF3 prefix.

* Use auto_docstring and update docstrings.

* Nits

* Nit for review

* Shift RoTE to main model so that encoder can be directly used from AF3.

* Refactoring nit.

* Fix init

* Fix some failing tests

* Fix AF3 & MF and add batching tests

* Fix audio embedding masking (bad post length)

* Nits and remove since same as GLM was bug in post length computation

* Simplify MF as AF3, and style checks.

* New config after merge and modular update.

* Address music flamingo tests, and some cleanup.

* style check

* Regenerate config.

* Update fixtures.

* Nits

* Nit

* Improve RoTE config

* Refine MusicFlamingo rotary time handling

* Simplification, and update AF3 processor for better modular

* Fix torch export

* Simplify modular, including upstreaming input_ids input to get_audio_features

* Remove upstreaming of input_ids to get_audio_features, and remove audio_rotary_dim.

* Switch to MoonshineRotaryEmbedding, and cleanup.

* Remove hardcoded MusicFlamingo partial_rotary_factor

* Update fixtures

* Compile re.sub

* Update src/transformers/models/musicflamingo/modular_musicflamingo.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/musicflamingo/modular_musicflamingo.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Style

* Update fixtures.

* Conditional torch import for processor.

---------

Co-authored-by: Eric B <ebezzam@gmail.com>
Co-authored-by: Eric Bezzam <4757445+ebezzam@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* add Cache and test on Mamba

* fix

* fix

* fix

* fix

* fix

* final fix

* test hybrid with jamba

* fix tests

* fixes

* fix

* fix

* fix

* combine both types + zambas

* add config mapèping

* adjust tests

* fix

* fix

* fix

* more models

* final mambas

* config

* finalize almost everything

* simplify tests

* simplify tests further

* fix tests

* oupsi

* fix

* fix broken no_split_modules

* fix

* fixes

* fix

* fix

* fixes

* add layer type

* oupsi

* fix

* style

* fix

* fixes

* final fix

* forgot those qwens

* tests

* offloading

* much better static shape native design

* oupsi

* adjustments in generate

* allow cudagraphs

* small oupsi

* start renaming

* revert unrelated what are they doing here

* more renaming

* revert offloading change

* add offloading skips

* split shapes for tests

* comments and renaming
…face#45143)

* Add parse_response to Processor, make it a bit more official

* Make the parse_response annotation a string to avoid torch import issues

* Add the same logic to any-to-any
…5150)

* Correct comment in training.md for TrainingArguments

Fix comment formatting for TrainingArguments instantiation.

* Update docs/source/en/training.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix bug for janus model image generation

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update expected tokens

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update comment

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* use `_preapre_generation_config`

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update expected token

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update code

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update comments

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update

* update

* update

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…45166)

* Re-add regex substitutions to the spec

* make fix-repo

* Update the test schema to drop the empty content block

* Trigger tests
RudrenduPaul and others added 23 commits April 18, 2026 13:13
…rmatting) (huggingface#45370)

docs: fix docstring errors in Gemma3nTextConfig

Fix five documentation errors in Gemma3nTextConfig docstring:
- Typo: "emebeddings" → "embeddings"
- Incomplete sentence for altup_active_idx (truncated at "or correct")
- Grammar: "should be make" → "should make" in altup_num_inputs
- Grammar: "number of layer" → "number of layers" in num_kv_shared_layers
- Formatting: add missing backticks around type annotations for
  laurel_rank and activation_sparsity_pattern to match HF docstring
  conventions

Both modular_gemma3n.py (source of truth) and the generated
configuration_gemma3n.py are updated in sync.

Built by Rudrendu Paul, developed with Claude Code

Co-authored-by: Rudrendu <RudrenduPaul@users.noreply.github.com>
…uggingface#44949)

* Fix NotebookProgressCallback to allow evaluate() before and after train

* Add unit test for NotebookProgressCallback evaluating before and after training

* Skip NotebookProgressCallback tests when IPython is not installed

* Display eval metrics when training tracker is None on NotebookProgressCallback

* Add is_ipython_available and require_ipython test decorator

* Filter model_preparation_time metric and add code comments in on_eval
…ock.forward (huggingface#45352)

* fix(qwen3_moe): correct return type annotation on Qwen3MoeSparseMoeBlock.forward

* fix: propagate Qwen3MoeSparseMoeBlock forward return type fix to generated vl_moe and omni_moe files

Built by Rudrendu Paul, developed with Claude Code

---------

Co-authored-by: Rudrendu <RudrenduPaul@users.noreply.github.com>
…training (huggingface#45329)

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* Apply repo consistency fixes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* chore: empty commit

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix huggingface#45305 + add regression test GAS

* Refine test model_accepts_loss_kwargs

* fix style

* Fix properly setup model_accepts_loss_kwargs+True

* Update tests/trainer/test_trainer.py

remove unnecessary parameters

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix: simplify error messages, back to a simpler test

* feat: add new test with actual training

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
… generation (huggingface#45368)

ProcessorMixin subclasses (e.g. Qwen3VLProcessor) expose the fast tokenizer
at .tokenizer, not ._tokenizer. Use getattr() to handle both ProcessorMixin
and PreTrainedTokenizerFast when extracting the rust tokenizer backend for
DirectStreamer and CBStreamer.

Fixes huggingface#45362

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…Error (huggingface#45359)

Fixes huggingface#45356

Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` — its
remote `TikTokenTokenizer` is the only correct backend (no `tokenizer.json`,
non-sequential added-token IDs that `TokenizersBackend` cannot reproduce).

Also fix `_patch_mistral_regex`: the method receives the raw
`tokenizers.Tokenizer` object, which has `.pre_tokenizer` directly,
not `.backend_tokenizer.pre_tokenizer`.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…45041)

* ok

* fix consistency

* pass qwen35 reverse mapping

* update new failed test according to captured info

* Revert "update new failed test according to captured info"

This reverts commit 445a400.

* make it optional

* make fusion_mapping more general

* make conv3d conversion more general

* make fusion_mapping more general

* better name for conversion

* add fusion_mapping doc and clean tests

* fix reverse mapping test follow gemma3n

* chore: retrigger ci

* tests: move qwen3.5 reverse mapping fix to separate branch

* code clean!

* ruff format and clean test to make it simple

* richer doc

* get converters from config rather than each module

* add explict module_name check for fusion!

* better isolated test and code clean

* support serialized fusion_config

* ruff format

* config can handle unknown attributes

* move fused cls out of spec by mixin

* detailed comments

* ruff
…le set.__contains__ (huggingface#45282)

* fix torch.compile/export failures on amd

* test

* move imports
…uggingface#45414)

* Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3

When `kernels` is installed, `@use_kernelized_func` attaches a
`rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's
parameter coordinator traces the module graph at init and expects
every registered submodule to be invoked during forward. The model's
forward still calls the plain Python `apply_rotary_pos_emb`, so
`rotary_fn` is never executed and the trace desynchronizes, raising
`IndexError: pop from an empty deque` on the second forward.

Skip attaching the kernelized submodule when ZeRO-3 is enabled; users
running under ZeRO-3 fall back to the Python implementation, which is
what they were getting before huggingface#41147.

Fixes huggingface#45137

* Add dates to new model cards to satisfy check-repository-consistency
* Fix

* First draft

* Add push-to-hub options for SAM3-LiteText conversion

* Fix SAM3-LiteText model tests and text encoder init stability

* Add LiteText ViT auto mappings and use LiteText config

* Improve conversion script

* Do not require triton

* Improve modeling

* Fix repo

* Fix repo

* Add vision model to auto mapping

* Add missing entries to auto mapping

* reverse serve.py

* simplify implementation

* fix modular

* Address review comments

* fix repo

* fix after review 2

* fix tests + repo

* Address comments

* Address comments

* Make fix-repo

* add to hub cache + fixup base sam3 as well

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
…nt (huggingface#45348)

* Make content truly optional

* style + test

* improve test
* inital protoype

* remove unneeded selected_experts

* Revert MoE expert replay; document pattern via monkey patching

Replace the intrusive record/replay implementation across modeling files
with a documentation-only guide. All three pieces — the replayable router
subclass, the replay context manager, and the runtime OutputRecorder
registration — can be built on top of the existing monkey_patching and
output_capturing APIs without touching core MoE modeling code.

Also shows the one-line conversion from vLLM's CompletionOutput.routed_experts
numpy array to the per-layer tuple this pattern expects, enabling RLHF
workflows that generate with vLLM and train with transformers.

* Preserve unrelated forward-progress in __init__.py and generic.py

The previous revert commit accidentally rolled back unrelated work on
these two files — version bump, TorchvisionBackend addition, and
module-alias refactor. Restore those while keeping the MoE-specific
additions (MoERouting export, output_moe_routing kwarg) removed.
…ggingface#45428)

* PEFT integration fixes preventing save/load & integration

* Rerun make style with newer ruff

---------

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
* interesting

* oops

* test uses better temporal positions now

* fix repo

* re-unite glm and qwen3-vl

* add some fast tests

* dummy import

* missed another dummy import

* move comments around and add more comments
huggingface#43047)

* fix: add TimmWrapperForImageClassification to _no_split_modules in modular files and regenerate

- Update modular_pe_audio_video.py and modular_pe_video.py (source of truth)
- Regenerate modeling_pe_audio_video.py and modeling_pe_video.py via modular_model_converter.py
- Remove @unittest.skip on test_model_parallelism now that the crash is resolved

Fixes huggingface#42918

* fix: add TimmWrapperForImageClassification to _no_split_modules in pe_audio, pe_video, pe_audio_video

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
…vec2-lv-60-espeak-cv-ft) (huggingface#45199)

* fix: Resolve regressions from tokenizer refactor

* chore: Add regression test

* nit: Remove the test

* fix: Expand test coverage to all tests

---------

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
* Fix ty for transformers cli

* ty

* await class

* style

* adress comments

* Fix !

* fix
"docs Dinov2: fix typo in checkpoint path from google/dinov2-base-patch16-224 to facebook/dinov2-base"
…ingface#45207)

* [Gemma4] Add docstrings for Per-Layer Embeddings (PLE) pipeline

The PLE system is complex and underdocumented, which makes it hard
for third-party implementations (llama.cpp, candle, mlx, etc.) to
get right. This adds:

- Config docstring for hidden_size_per_layer_input explaining that
  the actual embedding dim is num_hidden_layers * hidden_size_per_layer_input,
  the embedding is scaled by sqrt(hidden_size_per_layer_input), and
  describing the full two-component pipeline

- Docstring for get_per_layer_inputs() explaining the token-identity
  component and the packed-to-4D reshape

- Docstring for project_per_layer_inputs() explaining the context-aware
  projection, normalization, and combination with scale factors

- Comment on the PLE init block pointing to the pipeline methods

Fixes huggingface#45206

* Address review: move PLE details to model doc, shorten config docstring

Move the detailed PLE pipeline description from the config docstring
to the Gemma4 model documentation page. The config docstring now just
describes the parameter shape and links to the full docs.

* Address review nits: move edits to modular_gemma4.py, simplify gemma4.md

- Remove bold formatting and config params section from gemma4.md per review
- Move docstrings and PLE comment from modeling_gemma4.py to modular_gemma4.py
- Revert modeling_gemma4.py (CI regenerates it from modular)

* fix: run make fix-repo to align modeling_gemma4.py with modular_gemma4.py
@sirzechs66 sirzechs66 force-pushed the fix-transformers-issue branch from ba3ebfc to 170394a Compare April 18, 2026 07:52
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, ggml

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45118&sha=ae93f5

@sirzechs66
Copy link
Copy Markdown
Contributor Author

@SunMarc please verify these merges, and the related CI errors

@sirzechs66
Copy link
Copy Markdown
Contributor Author

@SunMarc the pr is outdated and can make pollute the github history, closing this pr
new pr #45506

@sirzechs66 sirzechs66 closed this Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GGUF model with architecture gpt-oss support