Batched extraction of speaker embeddings for Nemo word-level baseline#36
Conversation
|
@microsoft-github-policy-service agree company="BUT" |
nidleo
left a comment
There was a problem hiding this comment.
Thank you for contributing to this repo!
If you have the numbers, could you please mention in the PR's description the expected speed up from this change?
|
Sure, the previous version of the extractor runs 1 minute and 36 seconds on NVIDIA RTX A5000 on recording MTG_30860_plaza_0, and the batched version runs 31 s. It can be further improved by inferring multiple segments in a batch. However, I see that Whisper per-channel inference is now the primary problem. I suggest using the HuggingFace implementation of Whisper long-form decoding. |
If you know of an implementation as accurate as Whisper large-v3 and also faster (or at least one that supports batching independent streams), it would be great to know. My impression was that there's a speed/accuracy tradeoff. |
|
HuggingFace Transformers previously utilized a chunked algorithm. Recently, they have also introduced a sequential algorithm (huggingface/transformers#27658), similar to the one used in the original openai repository, supporting batched inference, flash attention, and speculative decoding. After experimenting with it, I found that the text outputs are satisfactory. However, I noticed some noise in the timestamps. To address this issue, I integrated an external model for force alignment. As a result, the baseline ASR now runs in approximately 2-3 minutes with large-v3 compared to 8 minutes with the OpenAI implementation on an RTX A5000. Although I initially considered creating a PR, I ultimately decided against it due to the reliance on an additional model for force alignment. Additionally, there is a helpful blog post that compares various Whisper decoding implementations: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription. |
No description provided.