Updating model card for wav2vec2#38956
Conversation
stevhliu
left a comment
There was a problem hiding this comment.
Thanks!
Please apply some of these same changes (such as changing the API reference) to your other PRs 🤗
| # Wav2Vec2 | ||
|
|
||
| The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. | ||
| [Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy. |
There was a problem hiding this comment.
| [Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy. | |
| [Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework that makes pretraining audio models easier and more efficient. It is pretrained on large amounts of unlabeled data where certain chunks of audio are masked and the model must predict it. The model is fine-tuned on a much smaller labeled dataset and demonstrates it can still achieve competitive results. |
| [Wav2Vec2](https://huggingface.co/papers/2006.11477) is a self-supervised learning framework for speech representations that masks speech input in the latent space and solves a contrastive task over quantized latent representations. It's like having a speech expert that learns the patterns of human speech from raw audio alone, then fine-tunes on transcribed speech to achieve remarkable accuracy. | ||
|
|
||
| The abstract from the paper is the following: | ||
| You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub. |
There was a problem hiding this comment.
| You can find all the original [Wav2Vec2](https://huggingface.co/models?search=wav2vec2) checkpoints on the Hugging Face Hub. | |
| You can find all the original Wav2Vec2 checkpoints in the [Wav2Vec2.0](https://huggingface.co/collections/facebook/wav2vec-20-651e865258e3dee2586c89f5) collection. |
| > [!TIP] | ||
| > Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks. |
There was a problem hiding this comment.
| > [!TIP] | |
| > Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks. | |
| > [!TIP] | |
| > This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). | |
| > | |
| > Click on the Wav2Vec2 models in the right sidebar for more examples of how to apply Wav2Vec2 to different speech recognition and audio classification tasks. |
|
|
||
| ## Usage tips | ||
| ```python | ||
| from transformers import pipeline |
There was a problem hiding this comment.
import torch
from transformers import pipeline
pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0, torch_dtype="auto")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")|
|
||
| ## Using Flash Attention 2 | ||
| </hfoption> | ||
| <hfoption id="AutoModel"> |
There was a problem hiding this comment.
import torch
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForCTC
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", torch_dtype="auto", device_map="auto", attn_implementation="sdpa")
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription[0]| </hfoptions> | ||
|
|
||
| To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference: | ||
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. |
There was a problem hiding this comment.
No need to include a quantization example since the model isn't that large
| ``` | ||
|
|
||
| ### Expected speedups | ||
| ## Notes |
There was a problem hiding this comment.
You can remove all the notes except for:
- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- If you have a `head_mask`, you must use the eager attention implementation for it to work.
```py
from transformers import AutoModelForCTC
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h", attn_implementation="eager")| ) | ||
| ``` | ||
|
|
||
| ## Resources |
There was a problem hiding this comment.
Remove all the resources except for links to external ones (ie, not guides or scripts you can find in the docs)
| - A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recognition with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker). | ||
|
|
||
| ## Wav2Vec2Config | ||
| --- |
There was a problem hiding this comment.
You don't need to change any of the API references
I am out of town at the moment. I will make the changes when I am back. |
What does this PR do?
This pull request updates the
wav2vec2documentation to improve readability, enhance usability with examples, and reorganize the structure for better navigation. Key changes include the addition of usage examples, restructuring of the API reference, and updates to the model overview.Documentation Enhancements:
Structural Improvements:
Wav2Vec2Config,Wav2Vec2CTCTokenizer, etc.) into subsections for better navigation.These changes make the documentation more user-friendly and accessible, especially for developers new to Wav2Vec2.
#36979
Before submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@stevhliu