Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757)#45118
Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757)#45118sirzechs66 wants to merge 128 commits intohuggingface:mainfrom
Conversation
10463a1 to
a5eef4f
Compare
|
@SunMarc, @Cyrilvallez please feel free to look into this as i am getting the failed tests from the unedited files |
|
the branch was behind, when i and it merged without any conflict please review @SunMarc if any more changes are required let me know
please refer to the final commit for review |
|
@SunMarc @Cyrilvallez @ArthurZucker Please review as the requested changes have been made, |
|
@bot /style |
|
Style fix bot fixed some files and pushed the changes. |
Head branch was pushed to by a user without write access
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@SunMarc please review, there are some unexpected errors in "test_processors" in the CI test, that made merge failed |
* Music flamingo * Fix pos embeddings * Method arg docstrings * Add tests & docs * Fix AF3 dtype bug * Fix the MF performance issue * Fix pos embeddings * Fix embeddings & format * Remove external deps * Update processor token names * Cleanup * Simplify RotaryEmbedding to lang-only * Reuse AF3 config classes * Trim+rename rotary embedding * Call parent _init_weights first and drop rotary einsum * Precompute rotary cache at init * Use modular processor pattern for MusicFlamingo * Remove audio-only inference example * Refactor Audio Feature Casting Path * Clarify private source repo * Clean up modular * Move config to modular * Formatting * Remove dummy * Derive musicflamingo timing and rotary config * Llama style rotary embeddings * Added reproducer comments * Expose _init_weights for modular. * Satisfy repo checks * Align MusicFlamingo rotary with Llama style * Move MusicFlamingo _init_weights to encoder * Keep old behavior * Move MusicFlamingo rotary settings into encoder rope_parameters * Use AutoConfig in AF3/MF * Align MusicFlamingo RoTE with Llama RoPE conventions * Update outdated fixtures * init_weights without changing others * FIx import * Remove backward compat * Regenerate modeling for MF * Fix AF3 batch inference bug * Simplify config and nit. * Conform more to transformers convention, e.g. removing unused code paths. * Add another possible AF3 prefix. * Use auto_docstring and update docstrings. * Nits * Nit for review * Shift RoTE to main model so that encoder can be directly used from AF3. * Refactoring nit. * Fix init * Fix some failing tests * Fix AF3 & MF and add batching tests * Fix audio embedding masking (bad post length) * Nits and remove since same as GLM was bug in post length computation * Simplify MF as AF3, and style checks. * New config after merge and modular update. * Address music flamingo tests, and some cleanup. * style check * Regenerate config. * Update fixtures. * Nits * Nit * Improve RoTE config * Refine MusicFlamingo rotary time handling * Simplification, and update AF3 processor for better modular * Fix torch export * Simplify modular, including upstreaming input_ids input to get_audio_features * Remove upstreaming of input_ids to get_audio_features, and remove audio_rotary_dim. * Switch to MoonshineRotaryEmbedding, and cleanup. * Remove hardcoded MusicFlamingo partial_rotary_factor * Update fixtures * Compile re.sub * Update src/transformers/models/musicflamingo/modular_musicflamingo.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/musicflamingo/modular_musicflamingo.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Style * Update fixtures. * Conditional torch import for processor. --------- Co-authored-by: Eric B <ebezzam@gmail.com> Co-authored-by: Eric Bezzam <4757445+ebezzam@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* add Cache and test on Mamba * fix * fix * fix * fix * fix * final fix * test hybrid with jamba * fix tests * fixes * fix * fix * fix * combine both types + zambas * add config mapèping * adjust tests * fix * fix * fix * more models * final mambas * config * finalize almost everything * simplify tests * simplify tests further * fix tests * oupsi * fix * fix broken no_split_modules * fix * fixes * fix * fix * fixes * add layer type * oupsi * fix * style * fix * fixes * final fix * forgot those qwens * tests * offloading * much better static shape native design * oupsi * adjustments in generate * allow cudagraphs * small oupsi * start renaming * revert unrelated what are they doing here * more renaming * revert offloading change * add offloading skips * split shapes for tests * comments and renaming
…face#45143) * Add parse_response to Processor, make it a bit more official * Make the parse_response annotation a string to avoid torch import issues * Add the same logic to any-to-any
…5150) * Correct comment in training.md for TrainingArguments Fix comment formatting for TrainingArguments instantiation. * Update docs/source/en/training.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix bug for janus model image generation Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update expected tokens Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update comment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * use `_preapre_generation_config` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update expected token Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update comments Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update * update * update --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…45166) * Re-add regex substitutions to the spec * make fix-repo * Update the test schema to drop the empty content block * Trigger tests
…rmatting) (huggingface#45370) docs: fix docstring errors in Gemma3nTextConfig Fix five documentation errors in Gemma3nTextConfig docstring: - Typo: "emebeddings" → "embeddings" - Incomplete sentence for altup_active_idx (truncated at "or correct") - Grammar: "should be make" → "should make" in altup_num_inputs - Grammar: "number of layer" → "number of layers" in num_kv_shared_layers - Formatting: add missing backticks around type annotations for laurel_rank and activation_sparsity_pattern to match HF docstring conventions Both modular_gemma3n.py (source of truth) and the generated configuration_gemma3n.py are updated in sync. Built by Rudrendu Paul, developed with Claude Code Co-authored-by: Rudrendu <RudrenduPaul@users.noreply.github.com>
…uggingface#44949) * Fix NotebookProgressCallback to allow evaluate() before and after train * Add unit test for NotebookProgressCallback evaluating before and after training * Skip NotebookProgressCallback tests when IPython is not installed * Display eval metrics when training tracker is None on NotebookProgressCallback * Add is_ipython_available and require_ipython test decorator * Filter model_preparation_time metric and add code comments in on_eval
…ock.forward (huggingface#45352) * fix(qwen3_moe): correct return type annotation on Qwen3MoeSparseMoeBlock.forward * fix: propagate Qwen3MoeSparseMoeBlock forward return type fix to generated vl_moe and omni_moe files Built by Rudrendu Paul, developed with Claude Code --------- Co-authored-by: Rudrendu <RudrenduPaul@users.noreply.github.com>
…training (huggingface#45329) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Apply repo consistency fixes * changes * changes * changes * changes * changes * changes * changes * changes * chore: empty commit --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix huggingface#45305 + add regression test GAS * Refine test model_accepts_loss_kwargs * fix style * Fix properly setup model_accepts_loss_kwargs+True * Update tests/trainer/test_trainer.py remove unnecessary parameters Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix: simplify error messages, back to a simpler test * feat: add new test with actual training --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
… generation (huggingface#45368) ProcessorMixin subclasses (e.g. Qwen3VLProcessor) expose the fast tokenizer at .tokenizer, not ._tokenizer. Use getattr() to handle both ProcessorMixin and PreTrainedTokenizerFast when extracting the rust tokenizer backend for DirectStreamer and CBStreamer. Fixes huggingface#45362 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…Error (huggingface#45359) Fixes huggingface#45356 Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` — its remote `TikTokenTokenizer` is the only correct backend (no `tokenizer.json`, non-sequential added-token IDs that `TokenizersBackend` cannot reproduce). Also fix `_patch_mistral_regex`: the method receives the raw `tokenizers.Tokenizer` object, which has `.pre_tokenizer` directly, not `.backend_tokenizer.pre_tokenizer`. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…45041) * ok * fix consistency * pass qwen35 reverse mapping * update new failed test according to captured info * Revert "update new failed test according to captured info" This reverts commit 445a400. * make it optional * make fusion_mapping more general * make conv3d conversion more general * make fusion_mapping more general * better name for conversion * add fusion_mapping doc and clean tests * fix reverse mapping test follow gemma3n * chore: retrigger ci * tests: move qwen3.5 reverse mapping fix to separate branch * code clean! * ruff format and clean test to make it simple * richer doc * get converters from config rather than each module * add explict module_name check for fusion! * better isolated test and code clean * support serialized fusion_config * ruff format * config can handle unknown attributes * move fused cls out of spec by mixin * detailed comments * ruff
…le set.__contains__ (huggingface#45282) * fix torch.compile/export failures on amd * test * move imports
…uggingface#45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137 * Add dates to new model cards to satisfy check-repository-consistency
* Fix * First draft * Add push-to-hub options for SAM3-LiteText conversion * Fix SAM3-LiteText model tests and text encoder init stability * Add LiteText ViT auto mappings and use LiteText config * Improve conversion script * Do not require triton * Improve modeling * Fix repo * Fix repo * Add vision model to auto mapping * Add missing entries to auto mapping * reverse serve.py * simplify implementation * fix modular * Address review comments * fix repo * fix after review 2 * fix tests + repo * Address comments * Address comments * Make fix-repo * add to hub cache + fixup base sam3 as well --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co> Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>
…nt (huggingface#45348) * Make content truly optional * style + test * improve test
* inital protoype * remove unneeded selected_experts * Revert MoE expert replay; document pattern via monkey patching Replace the intrusive record/replay implementation across modeling files with a documentation-only guide. All three pieces — the replayable router subclass, the replay context manager, and the runtime OutputRecorder registration — can be built on top of the existing monkey_patching and output_capturing APIs without touching core MoE modeling code. Also shows the one-line conversion from vLLM's CompletionOutput.routed_experts numpy array to the per-layer tuple this pattern expects, enabling RLHF workflows that generate with vLLM and train with transformers. * Preserve unrelated forward-progress in __init__.py and generic.py The previous revert commit accidentally rolled back unrelated work on these two files — version bump, TorchvisionBackend addition, and module-alias refactor. Restore those while keeping the MoE-specific additions (MoERouting export, output_moe_routing kwarg) removed.
…ggingface#45428) * PEFT integration fixes preventing save/load & integration * Rerun make style with newer ruff --------- Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
* interesting * oops * test uses better temporal positions now * fix repo * re-unite glm and qwen3-vl * add some fast tests * dummy import * missed another dummy import * move comments around and add more comments
huggingface#43047) * fix: add TimmWrapperForImageClassification to _no_split_modules in modular files and regenerate - Update modular_pe_audio_video.py and modular_pe_video.py (source of truth) - Regenerate modeling_pe_audio_video.py and modeling_pe_video.py via modular_model_converter.py - Remove @unittest.skip on test_model_parallelism now that the crash is resolved Fixes huggingface#42918 * fix: add TimmWrapperForImageClassification to _no_split_modules in pe_audio, pe_video, pe_audio_video --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
…vec2-lv-60-espeak-cv-ft) (huggingface#45199) * fix: Resolve regressions from tokenizer refactor * chore: Add regression test * nit: Remove the test * fix: Expand test coverage to all tests --------- Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
* Fix ty for transformers cli * ty * await class * style * adress comments * Fix ! * fix
"docs Dinov2: fix typo in checkpoint path from google/dinov2-base-patch16-224 to facebook/dinov2-base"
…ingface#45207) * [Gemma4] Add docstrings for Per-Layer Embeddings (PLE) pipeline The PLE system is complex and underdocumented, which makes it hard for third-party implementations (llama.cpp, candle, mlx, etc.) to get right. This adds: - Config docstring for hidden_size_per_layer_input explaining that the actual embedding dim is num_hidden_layers * hidden_size_per_layer_input, the embedding is scaled by sqrt(hidden_size_per_layer_input), and describing the full two-component pipeline - Docstring for get_per_layer_inputs() explaining the token-identity component and the packed-to-4D reshape - Docstring for project_per_layer_inputs() explaining the context-aware projection, normalization, and combination with scale factors - Comment on the PLE init block pointing to the pipeline methods Fixes huggingface#45206 * Address review: move PLE details to model doc, shorten config docstring Move the detailed PLE pipeline description from the config docstring to the Gemma4 model documentation page. The config docstring now just describes the parameter shape and links to the full docs. * Address review nits: move edits to modular_gemma4.py, simplify gemma4.md - Remove bold formatting and config params section from gemma4.md per review - Move docstrings and PLE comment from modeling_gemma4.py to modular_gemma4.py - Revert modeling_gemma4.py (CI regenerates it from modular) * fix: run make fix-repo to align modeling_gemma4.py with modular_gemma4.py
ba3ebfc to
170394a
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, ggml |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45118&sha=ae93f5 |
|
@SunMarc please verify these merges, and the related CI errors |
What does this PR do?
This PR adds full GGUF loading support for GPT‑OSS models (20B/120B). It allows Transformers (and consequently vLLM) to directly load GPT‑OSS GGUF files without falling back to a wrong architecture. The changes include:
GptOssTensorProcessorto handle MoE expert splitting and gate/up interleaving.rope_scaling(YaRN) from flat GGUF metadata.Fixes #43366, supersedes #43757.
Related vLLM issue: vllm-project/vllm#22353
Code Agent Policy
Before submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Tagging: