docs: Musicgen melody model card by AshAnand34 · Pull Request #38955 · huggingface/transformers

AshAnand34 · 2025-06-20T23:10:15Z

What does this PR do?

This pull request updates the documentation for the MusicGen Melody model in docs/source/en/model_doc/musicgen_melody.md. The changes aim to simplify and enhance the clarity of the documentation by restructuring the content, adding examples, and improving formatting.

Documentation Improvements

Overview and Structure

Reorganized the content to provide a concise overview of the MusicGen Melody model, highlighting its key features and differences from the original MusicGen.
Simplified the explanation of the model's architecture, breaking it down into three main components: text encoder, MusicGen Melody decoder, and audio decoder.

Examples and Usage

Replaced lengthy code snippets with streamlined examples for generating music using text and audio prompts, including text-only and unconditional generation scenarios.
Added detailed examples for using tools like Demucs for melody isolation and quantization techniques for memory optimization.

Formatting and Accessibility

Introduced a sidebar navigation and collapsible sections for better readability and user experience.
Updated links to external resources and added tooltips for key concepts like quantization and guidance scale.

#36979

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@stevhliu

stevhliu

Good start on this big doc!

stevhliu · 2025-06-26T18:36:12Z

-Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen Melody. The mono channel versions generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output.
+# MusicGen Melody

+[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.


Suggested change

[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.

[MusicGen Melody](https://huggingface.co/papers/2306.05284) builds on top of the [MusicGen](./musicgen) model by adding a melody-guided generation approach to enable more controllable audio generation. The model is conditioned on both input text and chromagram. A chromagram better captures the harmonic and melodic features of music.

Unlike MusicGen, MusicGen Melody uses the audio prompt as a conditional signal for the generated audio sample and the conditional text and audio signals are concatenated to the decoder's hidden states.

stevhliu · 2025-06-26T18:37:37Z

+[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.

-#### Audio Conditional Generation
+You can find all the original [MusicGen Melody](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen) checkpoints on the Hugging Face Hub.


Suggested change

You can find all the original [MusicGen Melody](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen) checkpoints on the Hugging Face Hub.

You can find all the original MusicGen Melody checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=musicgen-melody) organization.

stevhliu · 2025-06-26T18:38:09Z

+> [!TIP]
+> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.


Suggested change

> [!TIP]

> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

> [!TIP]

> This model was contributed by [ylacombe](https://huggingface.co/ylacombe).

>

> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

stevhliu · 2025-06-26T18:46:11Z

+> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

-In the following examples, we load an audio file using the 🤗 Datasets library, which can be pip installed through the command below:
+The example below demonstrates how to generate music conditioned on an audio melody and a text description using the [`AutoModel`] class.


Suggested change

The example below demonstrates how to generate music conditioned on an audio melody and a text description using the [`AutoModel`] class.

The example below demonstrates how to generate music with [`Pipeline`] or the [`AutoModel`] class.

stevhliu · 2025-06-26T18:46:44Z

-The audio file we are about to use is loaded as follows:
 ```python
->>> from datasets import load_dataset
+from transformers import pipeline


import torch from transformers import pipeline pipeline = pipeline("text-to-audio", model="facebook/musicgen-melody", device=0, torch_dtype="auto") pipeline("80s pop track with bassy drums and synth")

stevhliu · 2025-06-26T18:58:46Z

+audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
+```

+**Unconditional Generation**


Suggested change

**Unconditional Generation**

- The example below demonstrates unconditional generation.

stevhliu · 2025-06-26T19:00:20Z

+**Generation Configuration**
+You can inspect and update the model's generation configuration.


Suggested change

**Generation Configuration**

You can inspect and update the model's generation configuration.

- The generation config stores the default parameters that control the generation process such as sampling, guidance scale, and number of generated tokens.

Any arguments passed to the [`~GenerationMixin.generate`] method supersedes the parameters in the generation config.

stevhliu · 2025-06-26T19:01:38Z

+### Other Information
+- **Checkpoint Conversion**: Convert original checkpoints using the script at `src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py`.
+- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.
+- **Sampling**: For best results, use sampling (`do_sample=True`).


Suggested change

### Other Information

- **Checkpoint Conversion**: Convert original checkpoints using the script at `src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py`.

- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.

- **Sampling**: For best results, use sampling (`do_sample=True`).

- The `head_mask` argument is only effective with `attn_implementation="eager"`.

- For best results, set `do_sample=True`.

stevhliu · 2025-06-26T19:05:33Z

+- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.
+- **Sampling**: For best results, use sampling (`do_sample=True`).

 ## Model Structure


Remove this section and replace with the below. Remember each code snippet should be indented under its list item

Suggested change

## Model Structure

- [`MusicgenMelodyForCausalLM`] can be used as a standalone decoder model. Load it by specifying the correct config or accessing it through the `.decoder` attribute of [`MusicgenMelodyForConditionalGeneration`].

[`MusicgenMelodyForConditionalGeneration`] can be used as a composite model that includes the text and audio encoder.

stevhliu · 2025-06-26T19:07:00Z

+# Option 2: Access the decoder from the composite model
+model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
+decoder = model.decoder
 ```


Add a few more notes:

- Ensure you're using a 32kHz checkpoint of the Encodec model because MusicGen was trained on it.

AshAnand34 and others added 2 commits June 20, 2025 15:34

Refactor MusicGen Melody documentation for clarity and structure

af337a7

Merge branch 'huggingface:main' into musicgen-melody-model-card

128a545

Rocketknight1 mentioned this pull request Jun 23, 2025

docs: created musicgen model card #38953

Closed

1 task

stevhliu mentioned this pull request Jun 26, 2025

[Community contributions] Model cards #36979

Closed

stevhliu reviewed Jun 26, 2025

View reviewed changes

Merge branch 'main' into musicgen-melody-model-card

3a7b1d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Musicgen melody model card#38955

docs: Musicgen melody model card#38955
AshAnand34 wants to merge 3 commits intohuggingface:mainfrom
AshAnand34:musicgen-melody-model-card

AshAnand34 commented Jun 20, 2025 •

edited by stevhliu

Loading

Uh oh!

stevhliu left a comment

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.
+[MusicGen Melody](https://huggingface.co/papers/2306.05284) builds on top of the [MusicGen](./musicgen) model by adding a melody-guided generation approach to enable more controllable audio generation. The model is conditioned on both input text and chromagram. A chromagram better captures the harmonic and melodic features of music.
+Unlike MusicGen, MusicGen Melody uses the audio prompt as a conditional signal for the generated audio sample and the conditional text and audio signals are concatenated to the decoder's hidden states.

	You can find all the original [MusicGen Melody](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen) checkpoints on the Hugging Face Hub.
	You can find all the original MusicGen Melody checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=musicgen-melody) organization.

		> [!TIP]
		> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

	The example below demonstrates how to generate music conditioned on an audio melody and a text description using the [`AutoModel`] class.
	The example below demonstrates how to generate music with [`Pipeline`] or the [`AutoModel`] class.

	Unconditional Generation
	- The example below demonstrates unconditional generation.

		Generation Configuration
		You can inspect and update the model's generation configuration.

-## Model Structure
+- [`MusicgenMelodyForCausalLM`] can be used as a standalone decoder model. Load it by specifying the correct config or accessing it through the `.decoder` attribute of [`MusicgenMelodyForConditionalGeneration`].
+   [`MusicgenMelodyForConditionalGeneration`] can be used as a composite model that includes the text and audio encoder.

Conversation

AshAnand34 commented Jun 20, 2025 • edited by stevhliu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Documentation Improvements

Overview and Structure

Examples and Usage

Formatting and Accessibility

Before submitting

Who can review?

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AshAnand34 commented Jun 20, 2025 •

edited by stevhliu

Loading