Skip to content

docs: Musicgen melody model card#38955

Open
AshAnand34 wants to merge 3 commits intohuggingface:mainfrom
AshAnand34:musicgen-melody-model-card
Open

docs: Musicgen melody model card#38955
AshAnand34 wants to merge 3 commits intohuggingface:mainfrom
AshAnand34:musicgen-melody-model-card

Conversation

@AshAnand34
Copy link
Copy Markdown
Contributor

@AshAnand34 AshAnand34 commented Jun 20, 2025

What does this PR do?

This pull request updates the documentation for the MusicGen Melody model in docs/source/en/model_doc/musicgen_melody.md. The changes aim to simplify and enhance the clarity of the documentation by restructuring the content, adding examples, and improving formatting.

Documentation Improvements

Overview and Structure

  • Reorganized the content to provide a concise overview of the MusicGen Melody model, highlighting its key features and differences from the original MusicGen.
  • Simplified the explanation of the model's architecture, breaking it down into three main components: text encoder, MusicGen Melody decoder, and audio decoder.

Examples and Usage

  • Replaced lengthy code snippets with streamlined examples for generating music using text and audio prompts, including text-only and unconditional generation scenarios.
  • Added detailed examples for using tools like Demucs for melody isolation and quantization techniques for memory optimization.

Formatting and Accessibility

  • Introduced a sidebar navigation and collapsible sections for better readability and user experience.
  • Updated links to external resources and added tooltips for key concepts like quantization and guidance scale.

#36979

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@stevhliu

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start on this big doc!

Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen Melody. The mono channel versions generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output.
# MusicGen Melody

[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.
[MusicGen Melody](https://huggingface.co/papers/2306.05284) builds on top of the [MusicGen](./musicgen) model by adding a melody-guided generation approach to enable more controllable audio generation. The model is conditioned on both input text and chromagram. A chromagram better captures the harmonic and melodic features of music.
Unlike MusicGen, MusicGen Melody uses the audio prompt as a conditional signal for the generated audio sample and the conditional text and audio signals are concatenated to the decoder's hidden states.

[MusicGen Melody](https://huggingface.co/papers/2306.05284) is a single-stage, auto-regressive Transformer model designed for high-quality music generation, conditioned on both text and audio prompts. Unlike its predecessor, MusicGen Melody uses the audio prompt as a direct melodic guide, allowing for more precise control over the generated music.

#### Audio Conditional Generation
You can find all the original [MusicGen Melody](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen) checkpoints on the Hugging Face Hub.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can find all the original [MusicGen Melody](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen) checkpoints on the Hugging Face Hub.
You can find all the original MusicGen Melody checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=musicgen-melody) organization.

Comment on lines +31 to +32
> [!TIP]
> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> [!TIP]
> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.
> [!TIP]
> This model was contributed by [ylacombe](https://huggingface.co/ylacombe).
>
> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

> Click on the MusicGen Melody models in the right sidebar for more examples of how to apply the model to various music generation tasks.

In the following examples, we load an audio file using the 🤗 Datasets library, which can be pip installed through the command below:
The example below demonstrates how to generate music conditioned on an audio melody and a text description using the [`AutoModel`] class.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The example below demonstrates how to generate music conditioned on an audio melody and a text description using the [`AutoModel`] class.
The example below demonstrates how to generate music with [`Pipeline`] or the [`AutoModel`] class.

The audio file we are about to use is loaded as follows:
```python
>>> from datasets import load_dataset
from transformers import pipeline
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import torch
from transformers import pipeline

pipeline = pipeline("text-to-audio", model="facebook/musicgen-melody", device=0, torch_dtype="auto")
pipeline("80s pop track with bassy drums and synth")

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
```

**Unconditional Generation**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Unconditional Generation**
- The example below demonstrates unconditional generation.

Comment on lines +142 to +143
**Generation Configuration**
You can inspect and update the model's generation configuration.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Generation Configuration**
You can inspect and update the model's generation configuration.
- The generation config stores the default parameters that control the generation process such as sampling, guidance scale, and number of generated tokens.
Any arguments passed to the [`~GenerationMixin.generate`] method supersedes the parameters in the generation config.

Comment on lines +155 to +158
### Other Information
- **Checkpoint Conversion**: Convert original checkpoints using the script at `src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py`.
- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.
- **Sampling**: For best results, use sampling (`do_sample=True`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Other Information
- **Checkpoint Conversion**: Convert original checkpoints using the script at `src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py`.
- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.
- **Sampling**: For best results, use sampling (`do_sample=True`).
- The `head_mask` argument is only effective with `attn_implementation="eager"`.
- For best results, set `do_sample=True`.

- **`head_mask`**: The `head_mask` argument is only effective with `attn_implementation="eager"`.
- **Sampling**: For best results, use sampling (`do_sample=True`).

## Model Structure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this section and replace with the below. Remember each code snippet should be indented under its list item

Suggested change
## Model Structure
- [`MusicgenMelodyForCausalLM`] can be used as a standalone decoder model. Load it by specifying the correct config or accessing it through the `.decoder` attribute of [`MusicgenMelodyForConditionalGeneration`].
[`MusicgenMelodyForConditionalGeneration`] can be used as a composite model that includes the text and audio encoder.

# Option 2: Access the decoder from the composite model
model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
decoder = model.decoder
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a few more notes:

- Ensure you're using a 32kHz checkpoint of the Encodec model because MusicGen was trained on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants