Add HyperCLOVAX SEED Think 14B#44956
Conversation
Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b31ff44 to
ef1e73f
Compare
|
@zucchini-nlp , All CI checks have completed, except for one job that is still pending its status report. |
bigshanedogg
left a comment
There was a problem hiding this comment.
This is a self-review of the key changes in this PR.
| attention_multiplier: float | None = None | ||
| residual_multiplier: float | None = None | ||
| embedding_multiplier: float | None = None | ||
| logits_scaling: float | None = None |
There was a problem hiding this comment.
These fields also exist in Granite, but are defined here due to a different default values.
Although they are present in config.json, if not explicitly declared, the dynamic default value setting in post_init will not be applied.
There was a problem hiding this comment.
This part has been removed based on the modification noted in the comment below, except for attention_multiplier.
| # Peri-Layer Normalization: additional RMSNorm after each sub-layer output | ||
| if self.use_post_norm: | ||
| self.post_norm1 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.post_norm2 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps) |
There was a problem hiding this comment.
When self.use_post_norm is True,
post_norm for both attention and MLP are declared separately to match the Peri-LN structure.
Since there is a branch on self.use_post_norm, Granite is inherited instead of GLM4
(field similarity with Granite was also greater).
Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fang Han <fhan0520@gmail.com>
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great work on applying modular! I left a few comments on what can be deleted because it's already auto-resolved by modular
Other than that we're fine. After addressing the comments, will request core maintainer review and we'll merge
| hidden_states = outputs.last_hidden_state | ||
| slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep | ||
| # MuP: multiply logits by logits_scaling (cf. GraniteForCausalLM which divides) | ||
| logits = self.lm_head(hidden_states[:, slice_indices, :]) * self.config.logits_scaling |
There was a problem hiding this comment.
can we adjust scaling, so we can copy fully? For ex in config self.logits_scaling = 1 / self.logits_scaling
There was a problem hiding this comment.
Good idea!
However, I'm a bit concerned that storing the inverted value in Config.logits_scaling could cause confusion,
since users inspecting config.json would see a different value than what's actually used in the forward pass.
Would it be okay to keep the explicit * self.config.logits_scaling in forward for clarity, even if it means a small override?
|
run-slow: hyperclovax |
|
This comment contains models: ["models/hyperclovax"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
6aa22bc to
a0f82ba
Compare
|
@zucchini-nlp, Some of the failed tests appear to be outside the scope of this PR (e.g., |
a0f82ba to
9c3fd14
Compare
| @@ -0,0 +1,27 @@ | |||
| # Copyright 2025 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
a few files left wrt 2026 😄
|
run-slow: hyperclovax |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Oke, seeing a bad rebase with unrelated diff 😄 and a tiny change in rope doc. I will pass-over the latest diff after the bad rebase is fixed, and prob a core maintainer will pass over soon
331ed88 to
9600edb
Compare
|
@zucchini-nlp , |
|
@bigshanedogg , one tiny unrelated diff left-out. And vasqu will come to review next week :) |
9600edb to
d5a0472
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, hyperclovax |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44956&sha=d5a047 |
|
Sorry for all the delays, will be taking a look today!! |
vasqu
left a comment
There was a problem hiding this comment.
Only some nits tbh, looks overall super good! Let's sync with main and fixup the last details 🤗
| tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| model_id, | ||
| dtype=torch.bfloat16, |
There was a problem hiding this comment.
| dtype=torch.bfloat16, |
shouldnt need this anymore, we use dtype="auto" by default nowadays
| **model_inputs, | ||
| max_new_tokens=200, | ||
| tokenizer=tokenizer, | ||
| stop_strings=["<|endofturn|>", "<|stop|>"], |
There was a problem hiding this comment.
Nit: Might be nice to add this to the generation config instead maybe
There was a problem hiding this comment.
We changed this on main, you don't need to manually add these here anymore - just run python utils/check_auto.py --fix_and_overwrite for auto mapping to register these (only for the configs)
| ("groupvit", "CLIPTokenizer" if is_tokenizers_available() else None), | ||
| ("herbert", "HerbertTokenizer" if is_tokenizers_available() else None), | ||
| ("hubert", "Wav2Vec2CTCTokenizer"), | ||
| ("hyperclovax", "TokenizersBackend" if is_tokenizers_available() else None), |
There was a problem hiding this comment.
| ("hyperclovax", "TokenizersBackend" if is_tokenizers_available() else None), |
should not be needed, we auto fallback to tokenizers backend. Could you double check
| HyperCLOVAX is a decoder-only transformer based on Granite with the following modifications: | ||
|
|
||
| - **Maximal Update Parametrization (MuP)**: uses per-config scaling factors | ||
| (`attention_multiplier`, `residual_multiplier`, `embedding_multiplier`, `logits_scaling`) | ||
| to enable stable training across model sizes. | ||
| - **Peri-Layer Normalization** (optional): applies an extra RMSNorm after each | ||
| sub-layer output when `use_post_norm=True`. |
There was a problem hiding this comment.
| HyperCLOVAX is a decoder-only transformer based on Granite with the following modifications: | |
| - **Maximal Update Parametrization (MuP)**: uses per-config scaling factors | |
| (`attention_multiplier`, `residual_multiplier`, `embedding_multiplier`, `logits_scaling`) | |
| to enable stable training across model sizes. | |
| - **Peri-Layer Normalization** (optional): applies an extra RMSNorm after each | |
| sub-layer output when `use_post_norm=True`. |
Nit: we dont really specify the architecture like this in the modular/modeling code - I think it suffices within the model_doc
| @@ -0,0 +1,225 @@ | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. |
| # Same as Granite — avoids edge cases with the causal_mask buffer during CPU offload | ||
| model_split_percents = [0.5, 0.7, 0.8] | ||
|
|
||
| _torch_compile_train_cls = HyperCLOVAXForCausalLM if is_torch_available() else None |
There was a problem hiding this comment.
| _torch_compile_train_cls = HyperCLOVAXForCausalLM if is_torch_available() else None |
shouldnt be needed tbh, can you check?
| @unittest.skip( | ||
| "In TP mode, Float8 quantization derives scales per shard rather than globally, " | ||
| "so each TP rank observes different weight magnitudes than the full-weight non-TP " | ||
| "baseline. HyperCLOVAX's Peri-Layer Normalization (post_norm1/post_norm2) amplifies " | ||
| "this discrepancy past the 75% token-match threshold. Skipped pending an upstream fix." | ||
| ) | ||
| @is_tensor_parallel_test | ||
| def test_tp_generation_quantized(self): | ||
| pass |
There was a problem hiding this comment.
Interesting, cc @3outeille @SunMarc just for viz
| expected_slice = expected_slices.get_expectation().to(torch_device) | ||
| self.assertTrue(torch.allclose(out.logits[0, 0, :15].float(), expected_slice, atol=1e-2, rtol=1e-2)) | ||
|
|
||
| @require_torch_large_accelerator |
There was a problem hiding this comment.
| @require_torch_large_accelerator |
|
|
||
| self.assertEqual(output_text, EXPECTED_TEXTS) | ||
|
|
||
| @require_torch_large_accelerator |
There was a problem hiding this comment.
| @require_torch_large_accelerator |
i dont think we need these anymore
What does this PR do?
Adds native Transformers support for HyperCLOVA X SEED Think 14B, a 14.74B-parameter Korean reasoning LLM developed by NAVER Cloud.
Architecture
LLaMA-style decoder-only transformer with two modifications:
use_post_norm): an extraRMSNormis applied after eachsub-layer output (both attention and MLP), in addition to the standard pre-norm.
attention_multiplier— replaces1/sqrt(head_dim)in attentionresidual_multiplier— scales each sub-layer output before adding to the residual streamembedding_multiplier— scales the token embedding outputlogits_scaling— scales final logits before softmax / samplingImplementation approach
Following the maintainer's guidance in #44957, this PR uses the modular system (
modular_hyperclovax.py) to minimise LOC and make the diff easy to review-iterate. (Roughly 59% of lines are generated rather than manually maintained.)The maintainer suggested inheriting the decoder layer with post-norms from GLM4. After evaluation, Granite was chosen as the decoder layer base instead, for the following reasons:
use_post_normis optional (Falseby default). GLM4's decoder layer has post-norms always on — inheriting from it would require logic to conditionally disablepost_self_attn_layernorm/post_mlp_layernorm, adding complexity rather than reducing it.residual_multiplier(always-active MuP). Whenuse_post_norm=False,HyperCLOVAXDecoderLayeris identical toGraniteDecoderLayer— zero extra code.residual_multiplierand conditionally disabling its built-in norms — two changes in opposite directions for no net gain in code reuse.All other modules (RMSNorm, MLP, Attention, etc.) are inherited from Granite unchanged. The modular file is a few hundred LOC as suggested.
Benchmark validation
External support
Code Agent Policy
A code agent was used for mechanical tasks such as aligning docstrings and comments. The core implementation was written by the submitter directly, who has reviewed every changed line and personally run the tests including benchmark validation.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.