Migrate to inworld-tts-2 + Soniox STT, add 54 languages by cshape · Pull Request #75 · inworld-ai/language-learning-node

cshape · 2026-05-05T19:10:43Z

Summary

TTS → inworld-tts-2 everywhere (realtime + REST). Language is supplied per-call: BCP-47 via session.providerData.tts.language for the realtime session, and the REST language field for Anki batch + flashcard pronunciation. Voices stay cross-lingual but grounded in the target locale.
STT → soniox/stt-rt-v4 with the ISO 639-1 language hint via transcription.language. Soniox emits cumulative partial deltas (each delta is the full transcript so far) — switched the buffer logic from append to replace so partials no longer render as "hi hi there hi there this...".
+54 languages: reshuffled LanguageConfig to make BCP-47 the canonical form (dropped sttLanguageCode and ttsConfig.languageCode). Added all Soniox-supported languages with alternating Sarah/Jason TTS-2 voices and templated personas. The 6 existing curated personas/voices/topics are preserved as-is.
Frontend polish: welcome modal now reads "60+ languages"; the sidebar language dropdown scrolls (max-height: 60vh, overflow-y: auto, overscroll-behavior: contain).

Test plan

cd backend && npx tsc --noEmit
cd frontend && npx tsc --noEmit
cd backend && npx vitest run — all 60 tests pass
Pick Spanish (existing) → conversation works, Señor Gael Herrera persona + Rafael voice
Pick Finnish (new) → Sarah/Jason speaks Finnish; Soniox transcribes Finnish speech
Mid-session language switch → fresh session.update with new BCP-47
Watch backend logs for inworld_error_event (TTS-2 / Soniox surface here)
Anki export with a non-English deck — verify generated WAVs sound right

- TTS: switch realtime + REST calls to inworld-tts-2; pass BCP-47 via session.providerData.tts.language and the REST language field so cross-lingual voices stay grounded in the target locale. - STT: switch to soniox/stt-rt-v4 with ISO 639-1 language hint via transcription.language. Soniox emits cumulative partial deltas, so replace the buffer per delta instead of appending (fixes "hi hi there hi there this..." UI bug). - Languages: reshuffle LanguageConfig to make BCP-47 canonical (drop sttLanguageCode + ttsConfig.languageCode); add 54 Soniox-supported languages with alternating Sarah/Jason TTS-2 voices and templated personas. Existing 6 curated personas/voices/topics preserved. - Frontend: bump welcome modal copy to "60+ languages" and let the language dropdown scroll (max-height: 60vh, overflow-y: auto).

Copilot

Pull request overview

This PR expands the language-tutor app from a small curated set to a 60-language catalog while migrating speech integration to Inworld TTS 2 and Soniox realtime STT. It updates both realtime session wiring and REST pronunciation/export paths so language selection now drives provider-specific locale hints.

Changes:

Switch realtime transcription to soniox/stt-rt-v4 and update partial-transcript handling for Soniox’s cumulative deltas.
Migrate TTS requests to inworld-tts-2, passing BCP-47 language hints for realtime and REST pronunciation/export flows.
Restructure LanguageConfig around code + bcp47, add 54 language entries, and make small frontend UI updates for the larger language list.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`frontend/src/styles/main.css`	Makes the new-chat language dropdown scrollable for the expanded language list.
`frontend/src/components/WelcomeModal.tsx`	Updates the welcome copy to advertise broader language support.
`backend/src/services/websocket-handler.ts`	Passes the selected language locale into websocket-triggered pronunciation requests.
`backend/src/services/session-manager.ts`	Switches realtime STT provider/model, updates partial-delta handling, and sends TTS locale hints in `session.update`.
`backend/src/services/inworld-llm.ts`	Updates REST pronunciation requests to TTS 2 and adds a language parameter.
`backend/src/helpers/tts-audio-generator.ts`	Updates batch Anki/export TTS generation to TTS 2 with per-language locale hints.
`backend/src/config/server.ts`	Changes the default configured TTS model string.
`backend/src/config/languages.ts`	Refactors language config shape and adds the new supported language catalog.
`backend/src/__tests__/session-manager.test.ts`	Adjusts realtime/session tests for Soniox cumulative deltas and new session payload fields.
`backend/src/__tests__/inworld-llm.test.ts`	Updates pronunciation tests for the new method signature and TTS 2 request body.
`backend/src/__tests__/config/languages.test.ts`	Updates config tests for the new `bcp47` field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Teach the model to use TTS-2's three expressivity mechanisms (English steering tags, English non-verbal tags, target-language disfluencies), and use locale-native voices where the Inworld catalog has them. - New `disfluencies` field on LanguageConfig with 2-4 fillers per locale (e.g. えーと/あの for ja, este/eh/pues for es) seeded across all 60 languages - Expanded session.update instructions with a `# Voice & expressivity` section distinguishing English-only control tags ([speak warmly], [laugh]) from inline target-language disfluencies - Strip bracketed control tags from streaming and final assistant transcripts so the UI shows clean text; held-back buffer handles tags spanning chunks - Switch ar/nl/he/hi/ja/ko/pl/ru/zh to native voices (Nour, Katrien, Yael, Aarav, Hina, Hyunwoo, Szymon, Elena, Mei). Sarah/Jason remain fallbacks for the ~45 languages without natives in the catalog - Polish only has male natives, so persona renamed Kasia → Szymon

The previous "0–2 per turn, only when thinking" wording was too permissive and disfluencies were buried as the third bullet under steering and non-verbal tags. With responses already capped at 1–2 sentences, the model read it as optional and skipped them entirely. Lead the expressivity section with disfluencies, frame them as expected in MOST responses, and demote bracketed control tags to optional/rare. Also seed two concrete inline templates so the model has a usage pattern to mimic.

A more capable model should follow the new TTS-2 expressivity guidance (disfluencies, steering tags) more reliably. Updates all three call sites that were unified on the previous model so realtime conversation, flashcard/feedback generation, and memory operations stay aligned.

Live testing showed the model picking the first item in the list ("um") turn after turn — the seeded list was being read as "use these specific words" rather than "natural fillers like these". - Reframe the list as non-exhaustive examples; explicitly welcome any common ${name} filler the model knows - Add a "VARY" rule: never reuse the same disfluency in consecutive responses, don't lean on the most generic one every turn - Drop the example template that hardcoded disfluencies[0], which was reinforcing the lock-in - Expand the curated 6 (en/es/fr/de/it/pt) and ja/ko/zh from 3-4 fillers to 8-10 each so the model has more native variety to draw from

Adopts the inworld-golden-demo pattern: route TTS playback through a two-peer local WebRTC loopback so Chrome's hardware AEC sees the playback as an explicit reference signal. Without that signal the echoCancellation: true constraint we were already passing has nothing concrete to subtract, which is why the agent could hear itself when the user listens through speakers. Architecture: - New AudioPipeline owns one shared 24kHz playback AudioContext and a master output node. Both AudioPlayer instances (streaming voice + TTS pronunciation) connect into it so a single WebRTC reference covers all playback paths. - New WebRtcLoopback (ported from golden demo) wires two RTCPeerConnections locally via ICE; mic stream goes to client peer, playback stream to server peer. The AEC-processed mic stream comes back out for capture. - Loopback playback routes through a hidden <audio> element so Chrome uses its standard output pipeline (required for the AEC reference). Smaller wins bundled in: - getUserMedia constraints add suppressLocalAudioPlayback: { ideal: true } (Chrome-only hint) and sampleRate: 24000. - Capture AudioContext is created at 24kHz so the worklet receives samples at the target rate — drops the per-quantum linear-interp resampling. - Worklet simplified accordingly: just buffer 2400 samples (100ms) and post Int16 PCM. Drop the ScriptProcessorNode fallback (AudioWorklet is universally supported on browsers we target). - AudioPlayer fade duration bumped 15ms → 25ms; the longer ramp lets Chrome's AEC re-converge cleanly on user interrupt without an audible click. Documented inline. Graceful degradation: WebRTC loopback only enables on Chrome/Edge. On Firefox/iOS Safari the pipeline detects the lack of support and routes playback directly to ctx.destination, returning the original mic stream unchanged — same behavior as before this change. iOS handler path is fully preserved.

- Stop hardcoding 'inworld-tts-2' in REST TTS paths (inworld-llm.ts pronounce + tts-audio-generator.ts batch export). Both now read modelId from langConfig.ttsConfig so future model migrations stay in one place. - Match Sarah/Jason fallback voices to each persona's gender across all 25 non-native-voice languages (e.g. Pieter→Jason, Aino→Sarah, Iker→Jason). Native-voice languages were already correct. - Indonesian nativeName: 'Indonesia' (country) → 'Bahasa Indonesia'. - Filipino: rename code 'tl' → 'fil' so Soniox STT and TTS-2 BCP-47 hint reference the same language designator. Updated the LanguageConfig comment to acknowledge ISO 639-2 fallback when 639-1 doesn't apply (Filipino has no 639-1 code). - Distinguish Basque/Catalan/Galician sidebar flags (all rendered as 🏴 because Spanish-region tag sequences aren't in standard emoji fonts): 🟥 / 🟨 / 🟦 matching each flag's dominant color. Welsh keeps the standard 🏴󠁧󠁢󠁷󠁬󠁳󠁿 dragon glyph. - Run prettier across backend so the formatter-driven CI lint job passes.

Copilot AI review requested due to automatic review settings May 5, 2026 19:10

cshape requested review from a team as code owners May 5, 2026 19:10

Copilot started reviewing on behalf of cshape May 5, 2026 19:11 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

cshape added 6 commits May 5, 2026 12:37

SuomiKP31 approved these changes May 5, 2026

View reviewed changes

cshape merged commit 9ad7ca3 into main May 5, 2026
2 checks passed

cshape deleted the tts2-soniox-multilang branch May 5, 2026 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to inworld-tts-2 + Soniox STT, add 54 languages#75

Migrate to inworld-tts-2 + Soniox STT, add 54 languages#75
cshape merged 7 commits into
mainfrom
tts2-soniox-multilang

cshape commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cshape commented May 5, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants