Skip to content

Optimize Parakeet feature extraction on CUDA#45134

Open
milesial wants to merge 3 commits intohuggingface:mainfrom
milesial:codex/parakeet-gpu-transformers
Open

Optimize Parakeet feature extraction on CUDA#45134
milesial wants to merge 3 commits intohuggingface:mainfrom
milesial:codex/parakeet-gpu-transformers

Conversation

@milesial
Copy link
Copy Markdown
Contributor

@milesial milesial commented Mar 31, 2026

What does this PR do?

Add support for CUDA parakeet preprocessor, running STFT and mel spectrogram extraction on the GPU.
This refactor also speeds up the CPU implementation.

Tested on nvidia/parakeet-ctc-0.6b, B200, 300s audio:

Before this PR, CPU: 28ms
After this PR, CPU: 21ms
After this PR, GPU: 1.7ms

No impact on accuracy (VoxPopuli).

Context for this one is to accelerate vLLM for our multimodal nemotron model. Processing audio inputs has a bottleneck on this CPU feature extractor. This PR does a small refactor that gives some good speedup for CPU, and also enables the CUDA backend to be used to accelerate even further.

Some rough numbers, processing a 30min audio of 16 kHz:

Before this PR, CPU: 770ms
After this PR, CPU: 265ms (~3x)
After this PR, CUDA: 20ms (~38x)

Several changes in this PR:

Enabling CUDA backend
Caching of the Hann window and mel filters
Dynamic torch compile at strategic locations
GPU-friendly padding
More efficient and accurate complex number magnitude computation
  • I confirm that this is not a pure code agent PR.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Copy Markdown
Member

cc @eustlb @ebezzam

@eustlb
Copy link
Copy Markdown
Contributor

eustlb commented Mar 31, 2026

Hey @milesial, interesting PR! This should be covered out of the box by #44394, so I'll wait it to land (coming days) beofre doing a full review. Context on what motivated you to do so and limitations you had with the current implem would be golden feedback for us 🙏

@milesial milesial marked this pull request as ready for review April 3, 2026 04:11
@milesial
Copy link
Copy Markdown
Contributor Author

milesial commented Apr 3, 2026

Hi @eustlb , thanks for linking your PR!
Context for this one is to accelerate vLLM for our multimodal nemotron model. Processing audio inputs has a bottleneck on this CPU feature extractor. This PR does a small refactor that gives some good speedup for CPU, and also enables the CUDA backend to be used to accelerate even further.

Some rough numbers, processing a 30min audio of 16 kHz:

  • Before this PR, CPU: 770ms
  • After this PR, CPU: 265ms (~3x)
  • After this PR, CUDA: 5.5ms (~140x)

Several changes in this PR:

  • Enabling CUDA backend
  • Caching of the Hann window and mel filters
  • Dynamic torch compile at strategic locations
  • GPU-friendly padding
  • More efficient and accurate complex number magnitude computation

I see your PR is a major refactor of audio processors, does it address these points as well?
I tried your PR and these are the numbers I got:

  • CPU: 690ms
  • CUDA (after some patches, did not work OOTB): 16.5ms

Other question would be about timelines, I'm guessing your PR can take a while to get merged, while this smaller one can get merged faster in the meantime and unblock us, what do you think?

@milesial milesial force-pushed the codex/parakeet-gpu-transformers branch from b5a601d to 6703a6d Compare April 3, 2026 05:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: parakeet

@milesial
Copy link
Copy Markdown
Contributor Author

milesial commented Apr 7, 2026

Gentle ping @eustlb @ebezzam on the above comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants