Skip to content

Migrate to inworld-tts-2 + Soniox STT, add 54 languages#75

Merged
cshape merged 7 commits into
mainfrom
tts2-soniox-multilang
May 5, 2026
Merged

Migrate to inworld-tts-2 + Soniox STT, add 54 languages#75
cshape merged 7 commits into
mainfrom
tts2-soniox-multilang

Conversation

@cshape
Copy link
Copy Markdown
Contributor

@cshape cshape commented May 5, 2026

Summary

  • TTS → inworld-tts-2 everywhere (realtime + REST). Language is supplied per-call: BCP-47 via session.providerData.tts.language for the realtime session, and the REST language field for Anki batch + flashcard pronunciation. Voices stay cross-lingual but grounded in the target locale.
  • STT → soniox/stt-rt-v4 with the ISO 639-1 language hint via transcription.language. Soniox emits cumulative partial deltas (each delta is the full transcript so far) — switched the buffer logic from append to replace so partials no longer render as "hi hi there hi there this...".
  • +54 languages: reshuffled LanguageConfig to make BCP-47 the canonical form (dropped sttLanguageCode and ttsConfig.languageCode). Added all Soniox-supported languages with alternating Sarah/Jason TTS-2 voices and templated personas. The 6 existing curated personas/voices/topics are preserved as-is.
  • Frontend polish: welcome modal now reads "60+ languages"; the sidebar language dropdown scrolls (max-height: 60vh, overflow-y: auto, overscroll-behavior: contain).

Test plan

  • cd backend && npx tsc --noEmit
  • cd frontend && npx tsc --noEmit
  • cd backend && npx vitest run — all 60 tests pass
  • Pick Spanish (existing) → conversation works, Señor Gael Herrera persona + Rafael voice
  • Pick Finnish (new) → Sarah/Jason speaks Finnish; Soniox transcribes Finnish speech
  • Mid-session language switch → fresh session.update with new BCP-47
  • Watch backend logs for inworld_error_event (TTS-2 / Soniox surface here)
  • Anki export with a non-English deck — verify generated WAVs sound right

- TTS: switch realtime + REST calls to inworld-tts-2; pass BCP-47 via
  session.providerData.tts.language and the REST language field so
  cross-lingual voices stay grounded in the target locale.
- STT: switch to soniox/stt-rt-v4 with ISO 639-1 language hint via
  transcription.language. Soniox emits cumulative partial deltas, so
  replace the buffer per delta instead of appending (fixes
  "hi hi there hi there this..." UI bug).
- Languages: reshuffle LanguageConfig to make BCP-47 canonical (drop
  sttLanguageCode + ttsConfig.languageCode); add 54 Soniox-supported
  languages with alternating Sarah/Jason TTS-2 voices and templated
  personas. Existing 6 curated personas/voices/topics preserved.
- Frontend: bump welcome modal copy to "60+ languages" and let the
  language dropdown scroll (max-height: 60vh, overflow-y: auto).
Copilot AI review requested due to automatic review settings May 5, 2026 19:10
@cshape cshape requested review from a team as code owners May 5, 2026 19:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the language-tutor app from a small curated set to a 60-language catalog while migrating speech integration to Inworld TTS 2 and Soniox realtime STT. It updates both realtime session wiring and REST pronunciation/export paths so language selection now drives provider-specific locale hints.

Changes:

  • Switch realtime transcription to soniox/stt-rt-v4 and update partial-transcript handling for Soniox’s cumulative deltas.
  • Migrate TTS requests to inworld-tts-2, passing BCP-47 language hints for realtime and REST pronunciation/export flows.
  • Restructure LanguageConfig around code + bcp47, add 54 language entries, and make small frontend UI updates for the larger language list.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
frontend/src/styles/main.css Makes the new-chat language dropdown scrollable for the expanded language list.
frontend/src/components/WelcomeModal.tsx Updates the welcome copy to advertise broader language support.
backend/src/services/websocket-handler.ts Passes the selected language locale into websocket-triggered pronunciation requests.
backend/src/services/session-manager.ts Switches realtime STT provider/model, updates partial-delta handling, and sends TTS locale hints in session.update.
backend/src/services/inworld-llm.ts Updates REST pronunciation requests to TTS 2 and adds a language parameter.
backend/src/helpers/tts-audio-generator.ts Updates batch Anki/export TTS generation to TTS 2 with per-language locale hints.
backend/src/config/server.ts Changes the default configured TTS model string.
backend/src/config/languages.ts Refactors language config shape and adds the new supported language catalog.
backend/src/__tests__/session-manager.test.ts Adjusts realtime/session tests for Soniox cumulative deltas and new session payload fields.
backend/src/__tests__/inworld-llm.test.ts Updates pronunciation tests for the new method signature and TTS 2 request body.
backend/src/__tests__/config/languages.test.ts Updates config tests for the new bcp47 field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/src/config/languages.ts Outdated
Comment thread backend/src/config/languages.ts Outdated
Comment thread backend/src/config/languages.ts Outdated
Comment thread backend/src/services/inworld-llm.ts
Comment thread backend/src/helpers/tts-audio-generator.ts
Comment thread backend/src/config/languages.ts Outdated
cshape added 6 commits May 5, 2026 12:37
Teach the model to use TTS-2's three expressivity mechanisms (English
steering tags, English non-verbal tags, target-language disfluencies),
and use locale-native voices where the Inworld catalog has them.

- New `disfluencies` field on LanguageConfig with 2-4 fillers per locale
  (e.g. えーと/あの for ja, este/eh/pues for es) seeded across all 60 languages
- Expanded session.update instructions with a `# Voice & expressivity` section
  distinguishing English-only control tags ([speak warmly], [laugh]) from
  inline target-language disfluencies
- Strip bracketed control tags from streaming and final assistant transcripts
  so the UI shows clean text; held-back buffer handles tags spanning chunks
- Switch ar/nl/he/hi/ja/ko/pl/ru/zh to native voices (Nour, Katrien, Yael,
  Aarav, Hina, Hyunwoo, Szymon, Elena, Mei). Sarah/Jason remain fallbacks
  for the ~45 languages without natives in the catalog
- Polish only has male natives, so persona renamed Kasia → Szymon
The previous "0–2 per turn, only when thinking" wording was too permissive
and disfluencies were buried as the third bullet under steering and
non-verbal tags. With responses already capped at 1–2 sentences, the model
read it as optional and skipped them entirely.

Lead the expressivity section with disfluencies, frame them as expected
in MOST responses, and demote bracketed control tags to optional/rare.
Also seed two concrete inline templates so the model has a usage pattern
to mimic.
A more capable model should follow the new TTS-2 expressivity guidance
(disfluencies, steering tags) more reliably. Updates all three call
sites that were unified on the previous model so realtime conversation,
flashcard/feedback generation, and memory operations stay aligned.
Live testing showed the model picking the first item in the list ("um")
turn after turn — the seeded list was being read as "use these specific
words" rather than "natural fillers like these".

- Reframe the list as non-exhaustive examples; explicitly welcome any
  common ${name} filler the model knows
- Add a "VARY" rule: never reuse the same disfluency in consecutive
  responses, don't lean on the most generic one every turn
- Drop the example template that hardcoded disfluencies[0], which was
  reinforcing the lock-in
- Expand the curated 6 (en/es/fr/de/it/pt) and ja/ko/zh from 3-4 fillers
  to 8-10 each so the model has more native variety to draw from
Adopts the inworld-golden-demo pattern: route TTS playback through a
two-peer local WebRTC loopback so Chrome's hardware AEC sees the
playback as an explicit reference signal. Without that signal the
echoCancellation: true constraint we were already passing has nothing
concrete to subtract, which is why the agent could hear itself when
the user listens through speakers.

Architecture:
- New AudioPipeline owns one shared 24kHz playback AudioContext and a
  master output node. Both AudioPlayer instances (streaming voice + TTS
  pronunciation) connect into it so a single WebRTC reference covers
  all playback paths.
- New WebRtcLoopback (ported from golden demo) wires two RTCPeerConnections
  locally via ICE; mic stream goes to client peer, playback stream to
  server peer. The AEC-processed mic stream comes back out for capture.
- Loopback playback routes through a hidden <audio> element so Chrome
  uses its standard output pipeline (required for the AEC reference).

Smaller wins bundled in:
- getUserMedia constraints add suppressLocalAudioPlayback: { ideal: true }
  (Chrome-only hint) and sampleRate: 24000.
- Capture AudioContext is created at 24kHz so the worklet receives samples
  at the target rate — drops the per-quantum linear-interp resampling.
- Worklet simplified accordingly: just buffer 2400 samples (100ms) and
  post Int16 PCM. Drop the ScriptProcessorNode fallback (AudioWorklet is
  universally supported on browsers we target).
- AudioPlayer fade duration bumped 15ms → 25ms; the longer ramp lets
  Chrome's AEC re-converge cleanly on user interrupt without an audible
  click. Documented inline.

Graceful degradation: WebRTC loopback only enables on Chrome/Edge. On
Firefox/iOS Safari the pipeline detects the lack of support and routes
playback directly to ctx.destination, returning the original mic stream
unchanged — same behavior as before this change. iOS handler path is
fully preserved.
- Stop hardcoding 'inworld-tts-2' in REST TTS paths (inworld-llm.ts
  pronounce + tts-audio-generator.ts batch export). Both now read
  modelId from langConfig.ttsConfig so future model migrations stay
  in one place.
- Match Sarah/Jason fallback voices to each persona's gender across
  all 25 non-native-voice languages (e.g. Pieter→Jason, Aino→Sarah,
  Iker→Jason). Native-voice languages were already correct.
- Indonesian nativeName: 'Indonesia' (country) → 'Bahasa Indonesia'.
- Filipino: rename code 'tl' → 'fil' so Soniox STT and TTS-2 BCP-47
  hint reference the same language designator. Updated the LanguageConfig
  comment to acknowledge ISO 639-2 fallback when 639-1 doesn't apply
  (Filipino has no 639-1 code).
- Distinguish Basque/Catalan/Galician sidebar flags (all rendered as
  🏴 because Spanish-region tag sequences aren't in standard emoji
  fonts): 🟥 / 🟨 / 🟦 matching each flag's dominant color. Welsh keeps
  the standard 🏴󠁧󠁢󠁷󠁬󠁳󠁿 dragon glyph.
- Run prettier across backend so the formatter-driven CI lint job passes.
@cshape cshape merged commit 9ad7ca3 into main May 5, 2026
2 checks passed
@cshape cshape deleted the tts2-soniox-multilang branch May 5, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants