perf(qwen3-tts): use LATENCY hint for OV compilations; fix streaming defaults to Qwen upstream#78
Conversation
|
@Conradzz Hello, thank you for the PR! Have you tested longer than a few seconds? Admittedly 50 was an AI mistake I missed; the default value of For me, your defaults are breaking prosody with frequent pauses, vs waiting for larger chunks to accumulate. However the latency hint is a good addition, and I would be open to merging that. I suggest using the paper for qwen3-tts in your agent's context. This PR does not demonstrate implementation considerations about what the code was designed to do vs what it is in the current form. Notice how the agent writes Before we go any further, I encourage you to test |
|
You're correct, I found we were running with the wrong parameters a little
later on. I'm still working on my development team.
…On Thu, Apr 9, 2026 at 9:18 PM Emerson Tatelbaum ***@***.***> wrote:
*SearchSavior* left a comment (SearchSavior/OpenArc#78)
<#78 (comment)>
@Conradzz <https://github.com/Conradzz> Hello, thank you for the PR!
Have you tested longer than a few seconds? Admittedly 50 was an AI mistake
I missed; the default value of stream_chunk_frames is 300, which is what
qwen reccomends and partly a constraint of the architecture since audio
codebooks must be generated in a sequence that depends on the previous set
of codebooks. To stream *coherent* chunks we need enough of them; see here
https://github.com/QwenLM/Qwen3-TTS/blob/022e286b98fbec7e1e916cb940cdf532cd9f488e/qwen_tts/core/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py#L886
For me, your defaults are breaking prosody with frequent pauses, vs
waiting for larger chunks to accumulate. However the latency hint is a good
addition, and I would be open to merging that.
------------------------------
I suggest using the paper for qwen3-tts in your agent's context.
This PR does not demonstrate implementation considerations about what the
code was designed to do vs what it is in the current form. Notice how the
agent writes Default stream_chunk_frames=50 corresponds to ~4.2s of audio
at the 12 Hz codec rate and rushes into details present in the code now,
versus asking, why was this default set this way in the first place. Those
details' aren't in the OpenArc codebase, and your agent did not have the
right context to make good decisions.
Before we go any further, I encourage you to test stream_chunk_frames=50
vs stream_chunk_frames=300 and let me know how each value sounds. Hold
stream_left_context=25.
—
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEPZDPNEOIZEAD7SGXNPEL4VBDVDAVCNFSM6AAAAACXTFTUBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMJZGE3DAMRYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Two small but impactful perf fixes for the Qwen3-TTS pipeline. 1. PERFORMANCE_HINT=LATENCY for all Qwen3-TTS OV compilations. The pipeline is a single-stream autoregressive decode loop at batch=1. OpenVINO's default THROUGHPUT hint provisions multiple streams/threads optimized for batched inference, which adds significant per-infer dispatch overhead for tight AR loops. LATENCY pins one execution stream and minimizes launch latency. Measured ~3-4x speedup on talker decode (~22 ms/frame vs ~68-92 ms/frame) on Battlemage / OpenVINO 2024.x GPU plugin. 2. Drop streaming chunk defaults so short phrases actually stream. Default stream_chunk_frames=50 corresponds to ~4.2s of audio at the 12 Hz codec rate. Phrases shorter than that (most conversational TTS output) finished decoding before a chunk boundary was reached, so the client saw the full response arrive as a single final chunk with streaming effectively disabled. New defaults: stream_chunk_frames=8 (~0.67s), stream_left_context=4 (half of chunk size). Callers wanting the old behavior can pass stream_chunk_frames explicitly.
caeb702 to
ab10ef6
Compare
Summary
1. Use
PERFORMANCE_HINT=LATENCYfor all Qwen3-TTS OV compilationsThis pipeline is a single-stream autoregressive decode loop at batch=1. Without an explicit hint, the GPU plugin uses
PerformanceMode.UNDEFINED, which doesn't optimize for single-stream latency. SettingLATENCYpins one execution stream and minimizes per-infer dispatch overhead. CPU already defaults to LATENCY-like behavior; the hint is set explicitly there for consistency.Measured impact (B70 / Xe2 / OpenVINO 2024.x GPU plugin):
2. Restore streaming defaults to Qwen-recommended values
stream_chunk_frameswas 50, which diverges from Qwen's recommendedchunk_size=300(ref). Restored to 300.stream_left_contextkept at 25, matching upstream'sleft_context_size=25.Test plan