Skip to content

Updating model card for wav2vec2#38956

Open
AshAnand34 wants to merge 2 commits intohuggingface:mainfrom
AshAnand34:wav2vec2-model-card
Open

Updating model card for wav2vec2#38956
AshAnand34 wants to merge 2 commits intohuggingface:mainfrom
AshAnand34:wav2vec2-model-card

Conversation

@AshAnand34
Copy link
Copy Markdown
Contributor

@AshAnand34 AshAnand34 commented Jun 20, 2025

What does this PR do?

This pull request updates the wav2vec2 documentation to improve readability, enhance usability with examples, and reorganize the structure for better navigation. Key changes include the addition of usage examples, restructuring of the API reference, and updates to the model overview.

Documentation Enhancements:

  • Replaced the abstract with a concise explanation of Wav2Vec2 and added links to Hugging Face Hub for checkpoints and examples.
  • Added Python code examples demonstrating how to use the Wav2Vec2 model for automatic speech recognition and audio classification.
  • Included guidance on using Flash Attention 2 for faster inference and quantization techniques for memory optimization.

Structural Improvements:

  • Reorganized the API reference section by converting model components (Wav2Vec2Config, Wav2Vec2CTCTokenizer, etc.) into subsections for better navigation.

These changes make the documentation more user-friendly and accessible, especially for developers new to Wav2Vec2.

#36979

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@stevhliu

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Please apply some of these same changes (such as changing the API reference) to your other PRs 🤗

# Wav2Vec2

The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.
[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework that makes pretraining audio models easier and more efficient. It is pretrained on large amounts of unlabeled data where certain chunks of audio are masked and the model must predict it. The model is fine-tuned on a much smaller labeled dataset and demonstrates it can still achieve competitive results.

[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.

The abstract from the paper is the following:
You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub.
You can find all the original Wav2Vec2 checkpoints in the [Wav2Vec2.0](https://huggingface.co/collections/facebook/wav2vec-20-651e865258e3dee2586c89f5) collection.

Comment on lines +34 to +35
> [!TIP]
> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> [!TIP]
> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.
> [!TIP]
> This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
>
> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.


## Usage tips
```python
from transformers import pipeline
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import torch
from transformers import pipeline

pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0, torch_dtype="auto")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")


## Using Flash Attention 2
</hfoption>
<hfoption id="AutoModel">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import torch
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForCTC

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", torch_dtype="auto", device_map="auto", attn_implementation="sdpa")

inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt").to("cuda")

with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(predicted_ids)
transcription[0]

</hfoptions>

To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to include a quantization example since the model isn't that large

```

### Expected speedups
## Notes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove all the notes except for:

- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- If you have a `head_mask`, you must use the eager attention implementation for it to work.

   ```py
   from transformers import AutoModelForCTC

   model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", attn_implementation="eager")

)
```

## Resources
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all the resources except for links to external ones (ie, not guides or scripts you can find in the docs)

- A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recognition with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker).

## Wav2Vec2Config
---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to change any of the API references

@AshAnand34
Copy link
Copy Markdown
Contributor Author

Thanks!

Please apply some of these same changes (such as changing the API reference) to your other PRs 🤗

I am out of town at the moment. I will make the changes when I am back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants