Optimize Parakeet feature extraction on CUDA by milesial · Pull Request #45134 · huggingface/transformers

milesial · 2026-03-31T01:59:28Z

What does this PR do?

Add support for CUDA parakeet preprocessor, running STFT and mel spectrogram extraction on the GPU.
This refactor also speeds up the CPU implementation.

Tested on nvidia/parakeet-ctc-0.6b, B200, 300s audio:

Before this PR, CPU: 28ms
After this PR, CPU: 21ms
After this PR, GPU: 1.7ms

No impact on accuracy (VoxPopuli).

Context for this one is to accelerate vLLM for our multimodal nemotron model. Processing audio inputs has a bottleneck on this CPU feature extractor. This PR does a small refactor that gives some good speedup for CPU, and also enables the CUDA backend to be used to accelerate even further.

Some rough numbers, processing a 30min audio of 16 kHz:

Before this PR, CPU: 770ms
After this PR, CPU: 265ms (~3x)
After this PR, CUDA: 20ms (~38x)

Several changes in this PR:

Enabling CUDA backend
Caching of the Hann window and mel filters
Dynamic torch compile at strategic locations
GPU-friendly padding
More efficient and accurate complex number magnitude computation

I confirm that this is not a pure code agent PR.

Before submitting

Did you read the contributor guideline,
Pull Request section?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2026-03-31T12:42:30Z

cc @eustlb @ebezzam

eustlb · 2026-03-31T12:48:27Z

Hey @milesial, interesting PR! This should be covered out of the box by #44394, so I'll wait it to land (coming days) beofre doing a full review. Context on what motivated you to do so and limitations you had with the current implem would be golden feedback for us 🙏

milesial · 2026-04-03T04:12:32Z

Hi @eustlb , thanks for linking your PR!
Context for this one is to accelerate vLLM for our multimodal nemotron model. Processing audio inputs has a bottleneck on this CPU feature extractor. This PR does a small refactor that gives some good speedup for CPU, and also enables the CUDA backend to be used to accelerate even further.

Some rough numbers, processing a 30min audio of 16 kHz:

Before this PR, CPU: 770ms
After this PR, CPU: 265ms (~3x)
After this PR, CUDA: 5.5ms (~140x)

Several changes in this PR:

Enabling CUDA backend
Caching of the Hann window and mel filters
Dynamic torch compile at strategic locations
GPU-friendly padding
More efficient and accurate complex number magnitude computation

I see your PR is a major refactor of audio processors, does it address these points as well?
I tried your PR and these are the numbers I got:

CPU: 690ms
CUDA (after some patches, did not work OOTB): 16.5ms

Other question would be about timelines, I'm guessing your PR can take a while to get merged, while this smaller one can get merged faster in the meantime and unblock us, what do you think?

Signed-off-by: milesial <milesial@users.noreply.github.com>

github-actions · 2026-04-03T05:09:36Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: parakeet

milesial · 2026-04-07T14:56:51Z

Gentle ping @eustlb @ebezzam on the above comment

milesial marked this pull request as ready for review April 3, 2026 04:11

github-actions Bot requested review from ArthurZucker and Rocketknight1 April 3, 2026 04:12

milesial added 3 commits April 2, 2026 22:08

Optimize Parakeet feature extraction on CUDA

a9875b9

Add dynamic compile to Parakeet feature extraction

61e5e7d

Fix Parakeet feature extractor formatting

6703a6d

Signed-off-by: milesial <milesial@users.noreply.github.com>

milesial force-pushed the codex/parakeet-gpu-transformers branch from b5a601d to 6703a6d Compare April 3, 2026 05:08

netanel-haber mentioned this pull request Apr 12, 2026

ParakeetExtractor performance and UX enhancements vllm-project/vllm#39423

Merged

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Parakeet feature extraction on CUDA#45134

Optimize Parakeet feature extraction on CUDA#45134
milesial wants to merge 3 commits intohuggingface:mainfrom
milesial:codex/parakeet-gpu-transformers

milesial commented Mar 31, 2026 •

edited

Loading

Uh oh!

Rocketknight1 commented Mar 31, 2026

Uh oh!

eustlb commented Mar 31, 2026

Uh oh!

milesial commented Apr 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

milesial commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

milesial commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Mar 31, 2026

Uh oh!

eustlb commented Mar 31, 2026

Uh oh!

milesial commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

milesial commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

milesial commented Mar 31, 2026 •

edited

Loading

milesial commented Apr 3, 2026 •

edited

Loading