Skip to content

feat: recover dropped dictation commands via blank-gap TDT repair#2

Open
timlenardo wants to merge 1 commit intomainfrom
feat/dictation-repair
Open

feat: recover dropped dictation commands via blank-gap TDT repair#2
timlenardo wants to merge 1 commit intomainfrom
feat/dictation-repair

Conversation

@timlenardo
Copy link
Copy Markdown

Summary

  • TDT suppresses dictation commands ("period", "comma", "end quote", etc.) when LM context makes blank more probable than the token — producing a silent ≥960ms blank gap with real speech inside it
  • After the main TDT inference, scan for inter-token gaps ≥12 frames, confirm speech via Silero VAD, run a focused re-inference on the gap window, and insert the result if it matches a known dictation command
  • VadManager is initialized once in AsrManager.initialize() and reused; repair is a no-op if VAD is unavailable

Validation

Validated on 536 eval recordings and 2,757 real history audio files:

  • 7.9% of real utterances have a qualifying gap (≥960ms with VAD-confirmed speech)
  • 3 dictation command hits found in history audio (all "end quote" / "unquote")
  • Window TDT inference measured at 72ms median (not a full 15s pass)
  • Expected average latency overhead: ~6ms per transcription

Files

  • `Sources/FluidAudio/ASR/DictationRepair.swift` — repair logic as an `AsrManager` extension
  • `Sources/FluidAudio/ASR/AsrManager.swift` — adds `vadManager` property, initializes in `initialize(models:)`
  • `Sources/FluidAudio/ASR/AsrTranscription.swift` — calls `repairDictationGaps` in the single-chunk path after `processTranscriptionResult`
  • `Tests/.../EvalHistoryRepairTests.swift` — eval harness: runs repair against all eval recording audio files, writes `/tmp/repair_comparison.json`
  • `Tests/.../HistoryAudioGapScanTests.swift` — scan harness: measures trigger rate and recovery on history audio

Test plan

  • Build passes with no new errors
  • `swift test --filter EvalHistoryRepairTests` produces repair output at `/tmp/repair_comparison.json`
  • Transcribing audio with a spoken "period" or "end quote" gap correctly recovers the command
  • Audio without qualifying gaps shows no change in output or latency

TDT suppresses dictation commands (period, comma, end quote, etc.) via LM
context when the preceding word makes blank more probable than the token.
This surfaces as a blank gap ≥960ms in the token timestamps with VAD-confirmed
speech inside the gap.

The repair runs after the main TDT inference:
1. Find inter-token gaps ≥12 frames (960ms) in hypothesis.timestamps
2. Run Silero VAD on the full audio to confirm speech in each gap
3. Run a focused TDT re-inference on the gap window (afterFrame+3 to beforeFrame+2)
4. If recovered text is a known dictation command, insert it into the transcript

Validated on 536 eval recordings (3 hits) and 2,757 real history audio files
(3 hits, 7.9% trigger rate). Window inference measured at 72ms median — not
a full 15s inference cost. Expected average latency overhead: ~6ms/transcription.

Also adds two eval harnesses:
- EvalHistoryRepairTests: runs repair against all eval recording audio files
- HistoryAudioGapScanTests: scans history audio for gap trigger rate and recovery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant