Add HyperCLOVAX SEED Think 14B by bigshanedogg · Pull Request #44956 · huggingface/transformers

bigshanedogg · 2026-03-23T19:34:30Z

What does this PR do?

Adds native Transformers support for HyperCLOVA X SEED Think 14B, a 14.74B-parameter Korean reasoning LLM developed by NAVER Cloud.

related issue: Add HyperCLOVA X SEED Think 14B #44957

Architecture

LLaMA-style decoder-only transformer with two modifications:

Peri-Layer Normalization (use_post_norm): an extra RMSNorm is applied after each
sub-layer output (both attention and MLP), in addition to the standard pre-norm.
Maximal Update Parametrization (μP): four per-config scaling factors replace fixed constants:
- attention_multiplier — replaces 1/sqrt(head_dim) in attention
- residual_multiplier — scales each sub-layer output before adding to the residual stream
- embedding_multiplier — scales the token embedding output
- logits_scaling — scales final logits before softmax / sampling

Implementation approach

Following the maintainer's guidance in #44957, this PR uses the modular system (modular_hyperclovax.py) to minimise LOC and make the diff easy to review-iterate. (Roughly 59% of lines are generated rather than manually maintained.)

The maintainer suggested inheriting the decoder layer with post-norms from GLM4. After evaluation, Granite was chosen as the decoder layer base instead, for the following reasons:

use_post_norm is optional (False by default). GLM4's decoder layer has post-norms always on — inheriting from it would require logic to conditionally disable post_self_attn_layernorm / post_mlp_layernorm, adding complexity rather than reducing it.
Granite's decoder layer already provides residual_multiplier (always-active MuP). When use_post_norm=False, HyperCLOVAXDecoderLayer is identical to GraniteDecoderLayer — zero extra code.
Using GLM4 would require adding both residual_multiplier and conditionally disabling its built-in norms — two changes in opposite directions for no net gain in code reuse.

All other modules (RMSNorm, MLP, Attention, etc.) are inherited from Granite unchanged. The modular file is a few hundred LOC as suggested.

Benchmark validation

Tasks	Metric	vLLM	this PR
hellaswag (non-think)	acc_norm	0.6521	0.6666
gsm8k (non-think)	flexible-extract	0.9151	0.9188

External support

Huggingface hub: naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
Technical report: arXiv 2506.22403
vLLM upstream: vllm-project/vllm#37107 (merged 2026-03-16)

Code Agent Policy

I confirm that this is not a pure code agent PR.

A code agent was used for mechanical tasks such as aligning docstrings and comments. The core implementation was written by the submitter directly, who has reviewed every changed line and personally run the tests including benchmark validation.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
- Add HyperCLOVA X SEED Think 14B #44957
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

@strict

Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigshanedogg · 2026-03-29T21:58:25Z

@zucchini-nlp ,
Following your suggestion, I implemented this in a modular way by inheriting from Granite, incorporated the changes from #44957, and completed benchmark validation.

All CI checks have completed, except for one job that is still pending its status report.
Would it be okay to request a review at this stage?

bigshanedogg

This is a self-review of the key changes in this PR.

bigshanedogg · 2026-03-29T23:51:47Z

+    attention_multiplier: float | None = None
+    residual_multiplier: float | None = None
+    embedding_multiplier: float | None = None
+    logits_scaling: float | None = None


These fields also exist in Granite, but are defined here due to a different default values.
Although they are present in config.json, if not explicitly declared, the dynamic default value setting in post_init will not be applied.

This part has been removed based on the modification noted in the comment below, except for attention_multiplier.

bigshanedogg · 2026-03-29T23:52:15Z

+        # Peri-Layer Normalization: additional RMSNorm after each sub-layer output
+        if self.use_post_norm:
+            self.post_norm1 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+            self.post_norm2 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)


When self.use_post_norm is True,
post_norm for both attention and MLP are declared separately to match the Peri-LN structure.
Since there is a branch on self.use_post_norm, Granite is inherited instead of GLM4
(field similarity with Granite was also greater).

@strict

Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fang Han <fhan0520@gmail.com>

zucchini-nlp

Great work on applying modular! I left a few comments on what can be deleted because it's already auto-resolved by modular

Other than that we're fine. After addressing the comments, will request core maintainer review and we'll merge

zucchini-nlp · 2026-04-02T15:50:48Z

+        hidden_states = outputs.last_hidden_state
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        # MuP: multiply logits by logits_scaling (cf. GraniteForCausalLM which divides)
+        logits = self.lm_head(hidden_states[:, slice_indices, :]) * self.config.logits_scaling


can we adjust scaling, so we can copy fully? For ex in config self.logits_scaling = 1 / self.logits_scaling

Good idea!
However, I'm a bit concerned that storing the inverted value in Config.logits_scaling could cause confusion,
since users inspecting config.json would see a different value than what's actually used in the forward pass.
Would it be okay to keep the explicit * self.config.logits_scaling in forward for clarity, even if it means a small override?

zucchini-nlp · 2026-04-02T15:53:47Z

run-slow: hyperclovax

github-actions · 2026-04-02T15:55:18Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/hyperclovax"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-04-02T16:03:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-04-02T16:14:23Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	7d1b9113	workflow commit (merge commit)
PR	6aa22bc3	branch commit (from PR)
main	bb803105	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

bigshanedogg · 2026-04-03T03:02:39Z

@zucchini-nlp,
Thank you for the thorough review!
I've addressed all the feedback and removed quite a few unnecessary lines. For the logits_scaling part, I've left an additional comment as I wasn't sure if it might cause confusion.
The model behavior has been verified to remain unchanged after the edits.

Some of the failed tests appear to be outside the scope of this PR (e.g., VibeVoiceAsrForConditionalGenerationModelTest).
I will investigate the remaining cases related to HyperCLOVAX.

zucchini-nlp

Nice, to fix the CI you need to run make fix-repo. I merged main which will fix unrelated failures, and requestd a core maintainer's review

zucchini-nlp · 2026-04-07T13:32:20Z

@@ -0,0 +1,27 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.


a few files left wrt 2026 😄

zucchini-nlp · 2026-04-07T13:40:23Z

run-slow: hyperclovax

zucchini-nlp

Oke, seeing a bad rebase with unrelated diff 😄 and a tiny change in rope doc. I will pass-over the latest diff after the bad rebase is fixed, and prob a core maintainer will pass over soon

bigshanedogg · 2026-04-10T08:48:31Z

@zucchini-nlp ,
I've incorporated the suggested changes and reverted to your last reviewed commit (c025d918).
Really appreciate you taking the time to look into this!

zucchini-nlp · 2026-04-10T09:16:32Z

@bigshanedogg , one tiny unrelated diff left-out. And vasqu will come to review next week :)

github-actions · 2026-04-12T06:07:43Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, hyperclovax

github-actions · 2026-04-12T06:17:52Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44956&sha=d5a047

vasqu · 2026-04-21T13:10:37Z

Sorry for all the delays, will be taking a look today!!

vasqu

Only some nits tbh, looks overall super good! Let's sync with main and fixup the last details 🤗

vasqu · 2026-04-21T16:11:03Z

+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    dtype=torch.bfloat16,


Suggested change

dtype=torch.bfloat16,

shouldnt need this anymore, we use dtype="auto" by default nowadays

vasqu · 2026-04-21T16:11:39Z

+    **model_inputs, 
+    max_new_tokens=200,
+    tokenizer=tokenizer,
+    stop_strings=["<|endofturn|>", "<|stop|>"],


Nit: Might be nice to add this to the generation config instead maybe

vasqu · 2026-04-21T16:13:06Z

We changed this on main, you don't need to manually add these here anymore - just run python utils/check_auto.py --fix_and_overwrite for auto mapping to register these (only for the configs)

vasqu · 2026-04-21T16:13:58Z

        ("groupvit", "CLIPTokenizer" if is_tokenizers_available() else None),
        ("herbert", "HerbertTokenizer" if is_tokenizers_available() else None),
        ("hubert", "Wav2Vec2CTCTokenizer"),
+        ("hyperclovax", "TokenizersBackend" if is_tokenizers_available() else None),


Suggested change

("hyperclovax", "TokenizersBackend" if is_tokenizers_available() else None),

should not be needed, we auto fallback to tokenizers backend. Could you double check

vasqu · 2026-04-21T16:15:15Z

+HyperCLOVAX is a decoder-only transformer based on Granite with the following modifications:
+
+- **Maximal Update Parametrization (MuP)**: uses per-config scaling factors
+  (`attention_multiplier`, `residual_multiplier`, `embedding_multiplier`, `logits_scaling`)
+  to enable stable training across model sizes.
+- **Peri-Layer Normalization** (optional): applies an extra RMSNorm after each
+  sub-layer output when `use_post_norm=True`.


Suggested change

HyperCLOVAX is a decoder-only transformer based on Granite with the following modifications:

- **Maximal Update Parametrization (MuP)**: uses per-config scaling factors

(`attention_multiplier`, `residual_multiplier`, `embedding_multiplier`, `logits_scaling`)

to enable stable training across model sizes.

- **Peri-Layer Normalization** (optional): applies an extra RMSNorm after each

sub-layer output when `use_post_norm=True`.

Nit: we dont really specify the architecture like this in the modular/modeling code - I think it suffices within the model_doc

vasqu · 2026-04-21T16:27:14Z

@@ -0,0 +1,225 @@
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.


vasqu · 2026-04-21T16:28:14Z

+    # Same as Granite — avoids edge cases with the causal_mask buffer during CPU offload
+    model_split_percents = [0.5, 0.7, 0.8]
+
+    _torch_compile_train_cls = HyperCLOVAXForCausalLM if is_torch_available() else None


Suggested change

_torch_compile_train_cls = HyperCLOVAXForCausalLM if is_torch_available() else None

shouldnt be needed tbh, can you check?

vasqu · 2026-04-21T16:29:50Z

+    @unittest.skip(
+        "In TP mode, Float8 quantization derives scales per shard rather than globally, "
+        "so each TP rank observes different weight magnitudes than the full-weight non-TP "
+        "baseline. HyperCLOVAX's Peri-Layer Normalization (post_norm1/post_norm2) amplifies "
+        "this discrepancy past the 75% token-match threshold. Skipped pending an upstream fix."
+    )
+    @is_tensor_parallel_test
+    def test_tp_generation_quantized(self):
+        pass


Interesting, cc @3outeille @SunMarc just for viz

vasqu · 2026-04-21T16:31:15Z

+        expected_slice = expected_slices.get_expectation().to(torch_device)
+        self.assertTrue(torch.allclose(out.logits[0, 0, :15].float(), expected_slice, atol=1e-2, rtol=1e-2))
+
+    @require_torch_large_accelerator


Suggested change

@require_torch_large_accelerator

vasqu · 2026-04-21T16:31:28Z

+
+        self.assertEqual(output_text, EXPECTED_TEXTS)
+
+    @require_torch_large_accelerator


Suggested change

@require_torch_large_accelerator

i dont think we need these anymore

bigshanedogg mentioned this pull request Mar 23, 2026

Add HyperCLOVA X SEED Think 14B #44957

Open

2 tasks

DarkLight1337 mentioned this pull request Mar 28, 2026

[Transformers v5] HCXVisionForCausalLM vllm-project/vllm#38387

Open

This was referenced Mar 29, 2026

[Transformers v5] Vendor HCXVisionConfig for compatibility vllm-project/vllm#38447

Open

AutoConfig.register() ignored when trust_remote_code=True and auto_map is present #45093

Closed

bigshanedogg force-pushed the feat/hyperclovax branch from b31ff44 to ef1e73f Compare March 29, 2026 13:51

bigshanedogg marked this pull request as ready for review March 29, 2026 21:57

github-actions Bot requested review from ArthurZucker and Rocketknight1 March 29, 2026 21:57

bigshanedogg changed the title ~~[WIP] Add HyperCLOVAX model~~ Add HyperCLOVAX model Mar 29, 2026

bigshanedogg commented Mar 29, 2026

View reviewed changes

Rocketknight1 mentioned this pull request Mar 30, 2026

add HyperCLOVA X SEED Vision Instruct 3B #45099

Open

2 tasks

bigshanedogg changed the title ~~Add HyperCLOVAX model~~ Add HyperCLOVAX SEED Think 14B Mar 31, 2026

zucchini-nlp reviewed Apr 2, 2026

View reviewed changes

bigshanedogg force-pushed the feat/hyperclovax branch from 6aa22bc to a0f82ba Compare April 3, 2026 02:14

bigshanedogg added 3 commits April 3, 2026 19:26

feat: hyperclovax

74743de

docs: update model_doc and comment

4229c17

fix: update feedback

9c3fd14

bigshanedogg force-pushed the feat/hyperclovax branch from a0f82ba to 9c3fd14 Compare April 4, 2026 01:49

zucchini-nlp approved these changes Apr 7, 2026

View reviewed changes

zucchini-nlp requested review from vasqu and removed request for ArthurZucker and Rocketknight1 April 7, 2026 13:34

Merge branch 'main' into feat/hyperclovax

2f5cc2a

zucchini-nlp reviewed Apr 9, 2026

View reviewed changes

Comment thread .github/workflows/trl-ci-bot.yml Outdated

Comment thread docs/source/en/internal/rope_utils.md Outdated

bigshanedogg force-pushed the feat/hyperclovax branch 2 times, most recently from 331ed88 to 9600edb Compare April 10, 2026 08:46

zucchini-nlp reviewed Apr 10, 2026

View reviewed changes

Comment thread src/transformers/models/blip/image_processing_blip.py Outdated

bigshanedogg added 2 commits April 11, 2026 04:44

fix: ci update

6bfb9f7

fix: update doc

d5a0472

bigshanedogg force-pushed the feat/hyperclovax branch from 9600edb to d5a0472 Compare April 12, 2026 06:06

vasqu approved these changes Apr 21, 2026

View reviewed changes

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

		@@ -0,0 +1,27 @@
		# Copyright 2025 The HuggingFace Team. All rights reserved.

		@@ -0,0 +1,225 @@
		# Copyright 2025 The HuggingFace Inc. team. All rights reserved.

	# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
	# Copyright 2026 The HuggingFace Inc. team. All rights reserved.


		self.assertEqual(output_text, EXPECTED_TEXTS)

		@require_torch_large_accelerator

Conversation

bigshanedogg commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Architecture

Implementation approach

Benchmark validation

External support

Code Agent Policy

Before submitting

Uh oh!

bigshanedogg commented Mar 29, 2026

Uh oh!

bigshanedogg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

CI Results

Commit Info

Uh oh!

bigshanedogg commented Apr 3, 2026

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 7, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bigshanedogg commented Apr 10, 2026

Uh oh!

Uh oh!

zucchini-nlp commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

vasqu commented Apr 21, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigshanedogg commented Mar 23, 2026 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading