Skip to content

Batched extraction of speaker embeddings for Nemo word-level baseline#36

Merged
nidleo merged 1 commit intomicrosoft:mainfrom
Lakoc:word_based_diarization_batched
Apr 16, 2024
Merged

Batched extraction of speaker embeddings for Nemo word-level baseline#36
nidleo merged 1 commit intomicrosoft:mainfrom
Lakoc:word_based_diarization_batched

Conversation

@Lakoc
Copy link
Copy Markdown

@Lakoc Lakoc commented Apr 15, 2024

No description provided.

@Lakoc
Copy link
Copy Markdown
Author

Lakoc commented Apr 15, 2024

@microsoft-github-policy-service agree company="BUT"

Copy link
Copy Markdown
Contributor

@nidleo nidleo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing to this repo!

If you have the numbers, could you please mention in the PR's description the expected speed up from this change?

@Lakoc
Copy link
Copy Markdown
Author

Lakoc commented Apr 16, 2024

Sure, the previous version of the extractor runs 1 minute and 36 seconds on NVIDIA RTX A5000 on recording MTG_30860_plaza_0, and the batched version runs 31 s. It can be further improved by inferring multiple segments in a batch. However, I see that Whisper per-channel inference is now the primary problem. I suggest using the HuggingFace implementation of Whisper long-form decoding.

@nidleo nidleo merged commit 0993574 into microsoft:main Apr 16, 2024
@nidleo
Copy link
Copy Markdown
Contributor

nidleo commented Apr 16, 2024

I suggest using the HuggingFace implementation of Whisper long-form decoding.

If you know of an implementation as accurate as Whisper large-v3 and also faster (or at least one that supports batching independent streams), it would be great to know. My impression was that there's a speed/accuracy tradeoff.

@Lakoc
Copy link
Copy Markdown
Author

Lakoc commented Apr 18, 2024

HuggingFace Transformers previously utilized a chunked algorithm. Recently, they have also introduced a sequential algorithm (huggingface/transformers#27658), similar to the one used in the original openai repository, supporting batched inference, flash attention, and speculative decoding.

After experimenting with it, I found that the text outputs are satisfactory. However, I noticed some noise in the timestamps. To address this issue, I integrated an external model for force alignment. As a result, the baseline ASR now runs in approximately 2-3 minutes with large-v3 compared to 8 minutes with the OpenAI implementation on an RTX A5000. Although I initially considered creating a PR, I ultimately decided against it due to the reliance on an additional model for force alignment.

Additionally, there is a helpful blog post that compares various Whisper decoding implementations: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants