Updating model card for wav2vec2 by AshAnand34 · Pull Request #38956 · huggingface/transformers

AshAnand34 · 2025-06-20T23:51:31Z

What does this PR do?

This pull request updates the wav2vec2 documentation to improve readability, enhance usability with examples, and reorganize the structure for better navigation. Key changes include the addition of usage examples, restructuring of the API reference, and updates to the model overview.

Documentation Enhancements:

Replaced the abstract with a concise explanation of Wav2Vec2 and added links to Hugging Face Hub for checkpoints and examples.
Added Python code examples demonstrating how to use the Wav2Vec2 model for automatic speech recognition and audio classification.
Included guidance on using Flash Attention 2 for faster inference and quantization techniques for memory optimization.

Structural Improvements:

Reorganized the API reference section by converting model components (Wav2Vec2Config, Wav2Vec2CTCTokenizer, etc.) into subsections for better navigation.

These changes make the documentation more user-friendly and accessible, especially for developers new to Wav2Vec2.

#36979

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@stevhliu

stevhliu

Thanks!

Please apply some of these same changes (such as changing the API reference) to your other PRs 🤗

stevhliu · 2025-06-26T22:15:23Z

+# Wav2Vec2

-The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
+[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.


Suggested change

[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.

[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework that makes pretraining audio models easier and more efficient. It is pretrained on large amounts of unlabeled data where certain chunks of audio are masked and the model must predict it. The model is fine-tuned on a much smaller labeled dataset and demonstrates it can still achieve competitive results.

stevhliu · 2025-06-26T22:16:18Z

+[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.

-The abstract from the paper is the following:
+You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub.


Suggested change

You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub.

You can find all the original Wav2Vec2 checkpoints in the [Wav2Vec2.0](https://huggingface.co/collections/facebook/wav2vec-20-651e865258e3dee2586c89f5) collection.

stevhliu · 2025-06-26T22:16:45Z

+> [!TIP]
+> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.


Suggested change

> [!TIP]

> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.

> [!TIP]

> This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).

>

> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.

stevhliu · 2025-06-26T22:23:45Z


-## Usage tips
+```python
+from transformers import pipeline


import torch from transformers import pipeline pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0, torch_dtype="auto") pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")

stevhliu · 2025-06-26T22:23:58Z


-## Using Flash Attention 2
+</hfoption>
+<hfoption id="AutoModel">


import torch from datasets import load_dataset from transformers import AutoProcessor, AutoModelForCTC dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", torch_dtype="auto", device_map="auto", attn_implementation="sdpa") inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt").to("cuda") with torch.no_grad(): logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) transcription[0]

stevhliu · 2025-06-26T22:24:14Z

+</hfoptions>

-To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.


No need to include a quantization example since the model isn't that large

stevhliu · 2025-06-26T22:26:51Z

 ```

-### Expected speedups
+## Notes


You can remove all the notes except for:

- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - If you have a `head_mask`, you must use the eager attention implementation for it to work. ```py from transformers import AutoModelForCTC model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", attn_implementation="eager")

stevhliu · 2025-06-26T22:27:43Z

+    )
+```

 ## Resources


Remove all the resources except for links to external ones (ie, not guides or scripts you can find in the docs)

stevhliu · 2025-06-26T22:28:03Z

 - A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recognition with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker).

-## Wav2Vec2Config
+---


You don't need to change any of the API references

AshAnand34 · 2025-07-05T19:57:05Z

Thanks!

Please apply some of these same changes (such as changing the API reference) to your other PRs 🤗

I am out of town at the moment. I will make the changes when I am back.

Updating model card for wav2vec2

8c74202

stevhliu mentioned this pull request Jun 26, 2025

[Community contributions] Model cards #36979

Closed

stevhliu reviewed Jun 26, 2025

View reviewed changes

Merge branch 'main' into wav2vec2-model-card

03e6c0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating model card for wav2vec2#38956

Updating model card for wav2vec2#38956
AshAnand34 wants to merge 2 commits intohuggingface:mainfrom
AshAnand34:wav2vec2-model-card

AshAnand34 commented Jun 20, 2025 •

edited by stevhliu

Loading

Uh oh!

stevhliu left a comment

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

stevhliu Jun 26, 2025

Uh oh!

AshAnand34 commented Jul 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy.
	[Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework that makes pretraining audio models easier and more efficient. It is pretrained on large amounts of unlabeled data where certain chunks of audio are masked and the model must predict it. The model is fine-tuned on a much smaller labeled dataset and demonstrates it can still achieve competitive results.

	You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub.
	You can find all the original Wav2Vec2 checkpoints in the [Wav2Vec2.0](https://huggingface.co/collections/facebook/wav2vec-20-651e865258e3dee2586c89f5) collection.

		> [!TIP]
		> Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks.

Conversation

AshAnand34 commented Jun 20, 2025 • edited by stevhliu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Documentation Enhancements:

Structural Improvements:

Before submitting

Who can review?

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AshAnand34 commented Jul 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AshAnand34 commented Jun 20, 2025 •

edited by stevhliu

Loading