Add xcodec2 model by ebezzam · Pull Request #44178 · huggingface/transformers

ebezzam · 2026-02-20T12:36:21Z

What does this PR do?

Re-opening #37868

TODO

recompute expected outputs
passthrough code given new conventions
check for unused code paths / configuration parameters

Original checkpoint: https://huggingface.co/HKUSTAudio/xcodec2
Original modeling code: https://huggingface.co/HKUSTAudio/xcodec2/blob/main/modeling_xcodec2.py

ebezzam · 2026-03-18T17:59:37Z

run-slow: xcodec2

github-actions · 2026-03-18T18:00:46Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/xcodec2"]
quantizations: []

github-actions · 2026-03-18T18:10:36Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	6fd5f248	workflow commit (merge commit)
PR	1fbe78dc	branch commit (from PR)
main	24a4dc22	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

ebezzam

@eustlb a self-review review for X-Codec2!

Main things:

Unique feature extraction for DAC-like and SeamlessM4T-like input processing, as the model needs both padded audio and spectrogram inputs.
New type of components in modular: Xcodec2FiniteScalarQuantization and Xcodec2ISTFTHead (similar to what we saw in the Vocos PR)
Small tweaks/fixes for models that Xcodec2 depended on for modular

Draft model page: https://huggingface.co/bezzam/xcodec2

ebezzam · 2026-03-19T09:13:30Z

    main_input_name = "input_features"
    input_modalities = "audio"
    supports_gradient_checkpointing = True
+    _no_split_modules = ["Wav2Vec2BertEncoderLayer"]


To allow loading with device_map="auto"

ebezzam · 2026-03-19T09:14:43Z

    @torch.no_grad()
    def _init_weights(self, module):
-        """Initialize the weights"""
+        super()._init_weights(module)


XCodec2 uses a pretrained checkpoint of Wav2Vec2-BERT, but Xcodec2's test test_can_init_all_missing_weights was failing because Embedding wasn't initialized. We can rely on the base _init_weights and also remove some initialization from below

ebezzam · 2026-03-19T11:33:48Z

+class SnakeBeta(SnakeBeta):
+    pass
+
+
+class AntiAliasedActivation1d(AntiAliasedActivation1d):
+    pass


I thought just importing above would have been enough, but it wasn't generating the classes without this 🤔

ebezzam · 2026-03-19T11:36:12Z

+        # Back to audio (ISTFT with "same" padding)
+        time_frames = torch.fft.irfft(spectrogram_complex, self.n_fft, dim=1, norm="backward")
+        time_frames = time_frames * self.window[None, :, None]
+        num_frames = spectrogram_complex.shape[-1]
+        output_size = (num_frames - 1) * self.hop_length + self.win_length
+        audio = F.fold(
+            time_frames,
+            output_size=(1, output_size),
+            kernel_size=(1, self.win_length),
+            stride=(1, self.hop_length),
+        )[:, 0, 0, self.padding : -self.padding]


torch.istft doesn't support the custom padding needed here for integrations tests to match expected output

ebezzam · 2026-03-19T11:37:13Z

+        hidden_states = self.finite_scalar_quantization.bound(
+            hidden_states
+        )  # For consistency with original checkpoint
+        quantized_out, indices = self.finite_scalar_quantization(hidden_states)


calling self.finite_scalar_quantization.bound is a bit redundant, as it's called within self.finite_scalar_quantization(hidden_states). But the original modeling did it and it is needed to match expected outputs.

ebezzam · 2026-03-19T12:03:13Z

+        return hidden_states + residual
+
+
+class Xcodec2FiniteScalarQuantization(nn.Module):


new component

ebezzam · 2026-03-19T12:06:13Z

+        return codes, indices
+
+
+class Xcodec2ISTFTHead(nn.Module):


Similar to what we saw in the Vocos PR

ebezzam · 2026-04-13T14:26:21Z

+        if is_torchdynamo_compiling():
+            synced_gpus = False
+        else:
+            synced_gpus = is_deepspeed_zero3_enabled() or is_fsdp_managed_module(self)


for torch.compile support

eustlb

let's iterate

eustlb · 2026-04-28T07:27:06Z

+@slow
+@require_torch
+class Xcodec2IntegrationTest(unittest.TestCase):
+    """NOTE (ebezzam): PyPI model does not support batch inference."""


as noted on their model card, their HF checkpoint and the corresponding modeling code with their PyPI package doesn't support batch inference: https://huggingface.co/HKUSTAudio/xcodec2

They claim it's possible to use their GitHub code for batch inference as noted here, but I think most people are rather using the checkpoint/code from the model card (as it's unclear what checkpoint works with their batch inference code on their GitHub page as apparently this is the script for batch inference but there is no HF checkpoint mentioned...)

In any case, I agree we should test batched inference. So I've created a reproducer to compute outputs by looping over samples, and a new test_batch_integration compares batched outputs from the Transformers implementation with each output of the reproducer.

This indeed needed a padding mask for the spectrogram to be returned by the feature extractor (still need to clean up that!).

PS: more context on the PyPi package and some unconventional things done in their modeling which needed to be moved the feature extractor: #37868 (comment)

eustlb · 2026-04-28T10:03:24Z

+        semantic_output = self.semantic_model(audio_spectrogram, output_hidden_states=True)
+        semantic_hidden_16 = semantic_output.hidden_states[16]
+        semantic_hidden_16 = semantic_hidden_16.transpose(1, 2)


I've checked the training code there, both inference and training use layer 16 of Wav2Vec2Bert. Why don't we just set num_layers = 16 in the semantic_model_config and take the last hidden states?
We're double saving memory and compute: we don't need each hidden states stored via output_hidden_states. The don't load and infer uselessly layers 16..25

very good point and idea!

eustlb · 2026-04-28T10:03:33Z

+        if acoustic_hidden_states.shape[-1] != semantic_hidden_states.shape[-1]:
+            min_len = min(acoustic_hidden_states.shape[-1], semantic_hidden_states.shape[-1])
+            acoustic_hidden_states = acoustic_hidden_states[:, :, :min_len]
+            semantic_hidden_states = semantic_hidden_states[:, :, :min_len]


to be removed

eustlb · 2026-04-28T10:10:24Z

+    def apply_weight_norm(self, legacy=True):
+        weight_norm = nn.utils.weight_norm
+        if hasattr(nn.utils.parametrizations, "weight_norm") and not legacy:


no sure to see why we have this legacy kwarg

ah brings me back to my first weeks at HF with DAC 😆

I agree that the flag could be removed and we can simply use nn.utils.weight_norm, as the conversion script uses legacy=True:

model.apply_weight_norm() model = convert_state_dict(original_checkpoint, model) model.remove_weight_norm()

Using this legacy flag was something I came up with during this PR because the apply_weight_norm method of DAC wrongly assumed that just because nn.utils.parametrizations exists (newer weight norm methods) that it should be used.

This causes an issue when converting a checkpoint that used the legacy method, as they produce difference weight norm tensors in the state dict, hence the legacy flag.

ALTERNATIVELY, 9 months later haha, I think it's better to move apply_weight_norm and remove_weight_norm directly in the conversion script rather having them in the modeling, since they are only really used when converting. What do you think?

I see. Since it could be used for training I would leave it, but here we're not enabling training for now anyway so let's indeed remove it 👍

eustlb · 2026-04-28T10:19:13Z

+        self.semantic_model = AutoModel.from_config(config.semantic_model_config).eval()
+        self.semantic_adapter = Xcodec2SemanticAdapter(config)
+        self.acoustic_encoder = Xcodec2Encoder(config)
+        self.fc_encoder = nn.Linear(
+            config.hidden_size + config.semantic_model_config.hidden_size,
+            config.hidden_size + config.semantic_model_config.hidden_size,
+        )
+        self.quantizer = Xcodec2Quantizer(config)
+        self.decoder = Xcodec2Decoder(config)


semantic_model → semantic_encoder
also even if we don't have the semantic_decoder here since we're not enabling training, I would rename the decoder semantic_decoder

renamed to semantic_encoder. DId you mean renaming self.decoder to self.acoustic_decoder?

ah yes! semantic decoder my bad

acoustic decoder 😉

eustlb · 2026-04-28T10:20:49Z

+        super().__init__(config)
+
+        self.hop_length = config.hop_length
+        self.semantic_model = AutoModel.from_config(config.semantic_model_config).eval()


what is this .eval()?

The original had it, as the semantic encoder was meant to be frozen. But you're right that it's not necessary, and if someone wants to train (which isn't supported now), they could always freeze it themselves.

Removing it

eval() does not freeze weights but just disables dropout, batchnorm etc
Note that AutoModel.from_config does not call .eval() by default, but .from_pretrained does.

If the semantic encoder is always frozen, which is the case here, what we should do:

let's not put eval/ train in the init

use torch.no_grad() in the in forward when infering it. This will save memory by not storing actications anymore.

When we wan't to freeze parameters, we'd do: self.semantic_model.requires_grad_(False)

Yes sorry!

For reference (since things are split between HF Hub and GitHub), this is where they freeze the semantic encoder during training, and their encode uses torch.no_grad() for inference.

…Only keep necessary layers of semantic encoder.

github-actions · 2026-04-30T15:16:09Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dac, higgs_audio_v2_tokenizer, pe_audio, qwen2_5_omni, seamless_m4t, wav2vec2_bert, xcodec, xcodec2

github-actions · 2026-04-30T15:57:22Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44178&sha=4c6f00

Deep-Unlearning and others added 30 commits April 29, 2025 16:34

Add xcodec model

277a96f

code formatting

349feae

typo xcodec2 name

e5f1da8

add xcodec2 in init file

fc0907c

fix import

ea0acbf

fix weight_norm init

8542db7

remove unused import

e98d981

add convert file

d1cd3ac

add ModelOutput class

74fa506

nit

02f5c94

fix device issue

dd0a17c

fix forward

3786203

nit

93dbfad

doc draft

d4d8c6a

draft test

c40912e

match tensor with the orignal implementation

17eb48c

Add doc file for xcodec2

3760438

finish model doc for xcodec2

a2faa55

update doc

31319fb

working xcodec2

8d9f8df

add test file for xcodec2

e5a1838

nit

473f95a

xcodec2 use EncodecFeatureExtractor

dd8aace

Merge branch 'main' into add-xcodec2

f6cf875

Standardize with Xcodec.

244bdb6

Merge branch 'main' into add-xcodec2

a84a69f

Merge branch 'main' into add-xcodec2

9d743e8

Address some PR comments and standardize.

2e23505

Remove Sequential.

0316080

Remove weight norm from model definition.

fcbeab7

ebezzam and others added 6 commits March 18, 2026 16:13

Better use of modular, and cleanup.

4129a27

Merge branch 'main' into add-xcodec2

ab62936

Update config after mering with main, and other nits.

4ffff44

Merge branch 'main' into add-xcodec2

b5ec79a

Address tests and other nits.

642b56c

Nits, better wav2vec2 bert init

1fbe78d

ebezzam added 4 commits March 19, 2026 11:20

CLean up feature extraction and other nits

8570ded

doc nits

15e9349

Nit

7bd81c4

Simplify feature extraction.

a97cc2c

ebezzam commented Mar 19, 2026

View reviewed changes

ebezzam requested a review from eustlb March 19, 2026 12:08

ebezzam self-assigned this Apr 13, 2026

Torch compiile compatibility.

29ce7fd

ebezzam commented Apr 13, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Apr 26, 2026

Student takuma104/Qwen3-TTS-Tokenizer-12Hz-Trainer#1

Closed

eustlb reviewed Apr 28, 2026

View reviewed changes

ebezzam added 2 commits April 28, 2026 16:04

Rename output entries, semantic encoder. Move weight norm utilities. …

02e92ac

…Only keep necessary layers of semantic encoder.

Add batch test and spectrogram padding for equivalence.

811aa2a

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

ebezzam and others added 3 commits April 30, 2026 16:48

Refactoring.

672e1d2

Remove defensive trimming.

d221fbf

Merge branch 'main' into add-xcodec2

c2ab39c

Make style happy.

4c6f001

		return hidden_states + residual


		class Xcodec2FiniteScalarQuantization(nn.Module):

Conversation

ebezzam commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

ebezzam commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

CI Results

Commit Info

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

ebezzam commented Feb 20, 2026 •

edited

Loading

ebezzam Mar 19, 2026 •

edited

Loading

ebezzam Apr 28, 2026 •

edited

Loading