feat: recover dropped dictation commands via blank-gap TDT repair by timlenardo · Pull Request #2 · synth-inc/FluidAudio

timlenardo · 2026-04-22T01:10:38Z

Summary

TDT suppresses dictation commands ("period", "comma", "end quote", etc.) when LM context makes blank more probable than the token — producing a silent ≥960ms blank gap with real speech inside it
After the main TDT inference, scan for inter-token gaps ≥12 frames, confirm speech via Silero VAD, run a focused re-inference on the gap window, and insert the result if it matches a known dictation command
VadManager is initialized once in AsrManager.initialize() and reused; repair is a no-op if VAD is unavailable

Validation

Validated on 536 eval recordings and 2,757 real history audio files:

7.9% of real utterances have a qualifying gap (≥960ms with VAD-confirmed speech)
3 dictation command hits found in history audio (all "end quote" / "unquote")
Window TDT inference measured at 72ms median (not a full 15s pass)
Expected average latency overhead: ~6ms per transcription

Files

`Sources/FluidAudio/ASR/DictationRepair.swift` — repair logic as an `AsrManager` extension
`Sources/FluidAudio/ASR/AsrManager.swift` — adds `vadManager` property, initializes in `initialize(models:)`
`Sources/FluidAudio/ASR/AsrTranscription.swift` — calls `repairDictationGaps` in the single-chunk path after `processTranscriptionResult`
`Tests/.../EvalHistoryRepairTests.swift` — eval harness: runs repair against all eval recording audio files, writes `/tmp/repair_comparison.json`
`Tests/.../HistoryAudioGapScanTests.swift` — scan harness: measures trigger rate and recovery on history audio

Test plan

Build passes with no new errors
`swift test --filter EvalHistoryRepairTests` produces repair output at `/tmp/repair_comparison.json`
Transcribing audio with a spoken "period" or "end quote" gap correctly recovers the command
Audio without qualifying gaps shows no change in output or latency

TDT suppresses dictation commands (period, comma, end quote, etc.) via LM context when the preceding word makes blank more probable than the token. This surfaces as a blank gap ≥960ms in the token timestamps with VAD-confirmed speech inside the gap. The repair runs after the main TDT inference: 1. Find inter-token gaps ≥12 frames (960ms) in hypothesis.timestamps 2. Run Silero VAD on the full audio to confirm speech in each gap 3. Run a focused TDT re-inference on the gap window (afterFrame+3 to beforeFrame+2) 4. If recovered text is a known dictation command, insert it into the transcript Validated on 536 eval recordings (3 hits) and 2,757 real history audio files (3 hits, 7.9% trigger rate). Window inference measured at 72ms median — not a full 15s inference cost. Expected average latency overhead: ~6ms/transcription. Also adds two eval harnesses: - EvalHistoryRepairTests: runs repair against all eval recording audio files - HistoryAudioGapScanTests: scans history audio for gap trigger rate and recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: recover dropped dictation commands via blank-gap TDT repair#2

feat: recover dropped dictation commands via blank-gap TDT repair#2
timlenardo wants to merge 1 commit intomainfrom
feat/dictation-repair

timlenardo commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timlenardo commented Apr 22, 2026

Summary

Validation

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant