perf(qwen3-tts): use LATENCY hint for OV compilations; fix streaming defaults to Qwen upstream by Conradzz · Pull Request #78 · SearchSavior/OpenArc

Conradzz · 2026-04-10T00:08:35Z

Summary

1. Use `PERFORMANCE_HINT=LATENCY` for all Qwen3-TTS OV compilations

This pipeline is a single-stream autoregressive decode loop at batch=1. Without an explicit hint, the GPU plugin uses PerformanceMode.UNDEFINED, which doesn't optimize for single-stream latency. Setting LATENCY pins one execution stream and minimizes per-infer dispatch overhead. CPU already defaults to LATENCY-like behavior; the hint is set explicitly there for consistency.

Measured impact (B70 / Xe2 / OpenVINO 2024.x GPU plugin):

Talker decode: ~68–92 ms/frame → ~22 ms/frame (~3–4× speedup)
Overall first-audio latency for a short phrase: ~5s → ~1.2s

2. Restore streaming defaults to Qwen-recommended values

stream_chunk_frames was 50, which diverges from Qwen's recommended chunk_size=300 (ref). Restored to 300. stream_left_context kept at 25, matching upstream's left_context_size=25.

Test plan

Qwen3-TTS 1.7B Base voice clone loads cleanly on GPU with LATENCY hint
Verified on Intel Arc B70 (Battlemage, 32 GB, OpenVINO 2024.x)
Streaming defaults match upstream Qwen3-TTS reference

SearchSavior · 2026-04-10T01:17:46Z

@Conradzz Hello, thank you for the PR!

Have you tested longer than a few seconds? Admittedly 50 was an AI mistake I missed; the default value of stream_chunk_frames is 300, which is what qwen reccomends and partly a constraint of the architecture since audio codebooks must be generated in a sequence that depends on the previous set of codebooks. To stream coherent chunks we need enough of them; see here

https://github.com/QwenLM/Qwen3-TTS/blob/022e286b98fbec7e1e916cb940cdf532cd9f488e/qwen_tts/core/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py#L886

For me, your defaults are breaking prosody with frequent pauses, vs waiting for larger chunks to accumulate. However the latency hint is a good addition, and I would be open to merging that.

I suggest using the paper for qwen3-tts in your agent's context.

This PR does not demonstrate implementation considerations about what the code was designed to do vs what it is in the current form. Notice how the agent writes Default stream_chunk_frames=50 corresponds to ~4.2s of audio at the 12 Hz codec rate and rushes into details present in the code now, versus asking, why was this default set this way in the first place. Those details' aren't in the OpenArc codebase, and your agent did not have the right context to make good decisions.

Before we go any further, I encourage you to test stream_chunk_frames=50 vs stream_chunk_frames=300 and let me know how each value sounds. Hold stream_left_context=25.

Conradzz · 2026-04-10T01:44:42Z

You're correct, I found we were running with the wrong parameters a little later on. I'm still working on my development team.

…

On Thu, Apr 9, 2026 at 9:18 PM Emerson Tatelbaum ***@***.***> wrote: *SearchSavior* left a comment (SearchSavior/OpenArc#78) <#78 (comment)> @Conradzz <https://github.com/Conradzz> Hello, thank you for the PR! Have you tested longer than a few seconds? Admittedly 50 was an AI mistake I missed; the default value of stream_chunk_frames is 300, which is what qwen reccomends and partly a constraint of the architecture since audio codebooks must be generated in a sequence that depends on the previous set of codebooks. To stream *coherent* chunks we need enough of them; see here https://github.com/QwenLM/Qwen3-TTS/blob/022e286b98fbec7e1e916cb940cdf532cd9f488e/qwen_tts/core/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py#L886 For me, your defaults are breaking prosody with frequent pauses, vs waiting for larger chunks to accumulate. However the latency hint is a good addition, and I would be open to merging that. ------------------------------ I suggest using the paper for qwen3-tts in your agent's context. This PR does not demonstrate implementation considerations about what the code was designed to do vs what it is in the current form. Notice how the agent writes Default stream_chunk_frames=50 corresponds to ~4.2s of audio at the 12 Hz codec rate and rushes into details present in the code now, versus asking, why was this default set this way in the first place. Those details' aren't in the OpenArc codebase, and your agent did not have the right context to make good decisions. Before we go any further, I encourage you to test stream_chunk_frames=50 vs stream_chunk_frames=300 and let me know how each value sounds. Hold stream_left_context=25. — Reply to this email directly, view it on GitHub <#78 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEPZDPNEOIZEAD7SGXNPEL4VBDVDAVCNFSM6AAAAACXTFTUBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMJZGE3DAMRYHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Two small but impactful perf fixes for the Qwen3-TTS pipeline. 1. PERFORMANCE_HINT=LATENCY for all Qwen3-TTS OV compilations. The pipeline is a single-stream autoregressive decode loop at batch=1. OpenVINO's default THROUGHPUT hint provisions multiple streams/threads optimized for batched inference, which adds significant per-infer dispatch overhead for tight AR loops. LATENCY pins one execution stream and minimizes launch latency. Measured ~3-4x speedup on talker decode (~22 ms/frame vs ~68-92 ms/frame) on Battlemage / OpenVINO 2024.x GPU plugin. 2. Drop streaming chunk defaults so short phrases actually stream. Default stream_chunk_frames=50 corresponds to ~4.2s of audio at the 12 Hz codec rate. Phrases shorter than that (most conversational TTS output) finished decoding before a chunk boundary was reached, so the client saw the full response arrive as a single final chunk with streaming effectively disabled. New defaults: stream_chunk_frames=8 (~0.67s), stream_left_context=4 (half of chunk size). Callers wanting the old behavior can pass stream_chunk_frames explicitly.

SearchSavior marked this pull request as draft April 10, 2026 01:19

SearchSavior self-assigned this Apr 10, 2026

Conradzz changed the title ~~perf(qwen3-tts): use LATENCY hint and shrink stream chunk defaults~~ perf(qwen3-tts): use LATENCY hint for OV compilations; fix streaming defaults to Qwen upstream Apr 10, 2026

Conradzz force-pushed the perf/qwen3-tts-latency-and-streaming branch from caeb702 to ab10ef6 Compare April 10, 2026 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qwen3-tts): use LATENCY hint for OV compilations; fix streaming defaults to Qwen upstream#78

perf(qwen3-tts): use LATENCY hint for OV compilations; fix streaming defaults to Qwen upstream#78
Conradzz wants to merge 1 commit intoSearchSavior:mainfrom
Conradzz:perf/qwen3-tts-latency-and-streaming

Conradzz commented Apr 10, 2026 •

edited

Loading

Uh oh!

SearchSavior commented Apr 10, 2026

Uh oh!

Conradzz commented Apr 10, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Conradzz commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Use PERFORMANCE_HINT=LATENCY for all Qwen3-TTS OV compilations

2. Restore streaming defaults to Qwen-recommended values

Test plan

Uh oh!

SearchSavior commented Apr 10, 2026

Uh oh!

Conradzz commented Apr 10, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conradzz commented Apr 10, 2026 •

edited

Loading

1. Use `PERFORMANCE_HINT=LATENCY` for all Qwen3-TTS OV compilations