feat(stt): add diarization capabilities and speaker_id support#1267
Merged
toubatbrian merged 8 commits intomainfrom Apr 17, 2026
Merged
feat(stt): add diarization capabilities and speaker_id support#1267toubatbrian merged 8 commits intomainfrom
toubatbrian merged 8 commits intomainfrom
Conversation
Port of livekit/agents#5438 — adds STT diarization capability detection and speaker_id passthrough from the Python agents framework. https://claude.ai/code/session_01VtE2b4qcjcN21cvDhsdcFo
🦋 Changeset detectedLatest commit: 6398aef The changes in this PR will be included in the next version bump. This PR includes changesets to release 25 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
Contributor
Author
|
cc @toubatbrian @livekit/agent-devs for review — this is a port of livekit/agents#5438 (STT diarization capabilities + speaker_id on TimedString/SpeechData). Generated by Claude Code |
- Add `diarize?: boolean` to DeepgramOptions so typed users of STT<'deepgram/nova-3'> can enable diarization without type casts. - Fix SpeechStream.updateOptions to merge modelOptions instead of overwriting the stream's local state, preserving prior values when callers update only a subset of keys. https://claude.ai/code/session_01VtE2b4qcjcN21cvDhsdcFo
Per CLAUDE.md porting rules, every JS change corresponding to a Python change must carry an inline // Ref comment. https://claude.ai/code/session_01VtE2b4qcjcN21cvDhsdcFo
theomonnom
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port of livekit/agents#5438 — adds STT diarization capability detection and
speakerIdpassthrough from the Python agents framework.What's ported
speakerIdonTimedString(voice/io.ts): Added optionalspeakerId?: string | nullfield to theTimedStringinterface andcreateTimedString()factory, matching Python'sTimedString.speaker_id.speakerIdonSpeechData(stt/stt.ts): Added optionalspeakerId?: string | nullfield to theSpeechDatainterface, so transcript events can carry speaker identity at the utterance level.diarizationonSTTCapabilities(stt/stt.ts): Added optionaldiarization?: booleancapability flag. Also added aprotected updateCapabilities()method on the baseSTTclass so subclasses (like inference STT) can dynamically toggle capabilities after construction.Diarization capability detection (
inference/stt.ts): Ported the_DIARIZATION_EXTRA_KEYS/_diarization_enabled()pattern from Python. The inference STT now infersdiarization: truefrom provider-specificmodelOptionskeys (diarizefor Deepgram/xAI,speaker_labelsfor AssemblyAI) both at construction and whenupdateOptions()is called.xAI STT model type in inference (
inference/stt.ts): AddedXaiSTTModels('xai/stt-1') andXaiOptionsinterface withdiarize,endpointing,format, andinterim_resultsfields. Updated theSTTOptionsconditional type to resolveXaiOptionsfor xAI models.speaker_idin wire protocol (inference/api_protos.ts): Added optionalspeaker_idfield to the Zod schemas forsttWordSchema,sttInterimTranscriptEventSchema, andsttFinalTranscriptEventSchema.speaker_idpassthrough inprocessTranscript()(inference/stt.ts): The speech data builder now extractsspeaker_idfrom both event-level and word-level server responses and maps them tospeakerIdonSpeechDataandTimedString.speaker_labelsonAssemblyAIOptions: Added the missingspeaker_labelsboolean option.Implementation nuances (Python → TypeScript)
speaker_id: str | Noneattribute on astrsubclassspeakerId?: string | nullfield on theTimedStringinterfaceself._capabilities = replace(self._capabilities, diarization=...)usingdataclasses.replace()this.updateCapabilities({ diarization: ... })— new protected method since#capabilitiesis a true private field inaccessible to subclasses@overloadper provider withTypedDictSTTOptions<TModel>that resolves per-modelupdateOptionsself._opts.extra_kwargs.update(extra)(dict merge){ ...this.opts.modelOptions, ...opts.modelOptions }(spread merge)speaker_id(snake_case)speakerId(camelCase) on public interfaces;speaker_idpreserved in wire protocol Zod schemasWhat's NOT ported (Python-specific)
uv.lockchanges (Python lockfile / new plugin registrations)@overloadtype stubs (TypeScript uses conditional types instead)aiohttp.ClientSessiontyping (Node.js uses different HTTP primitives)Test plan
pnpm build:agents)speakerIdis populated whendiarize: truespeakerIdflows through whendiarize: truehttps://claude.ai/code/session_01VtE2b4qcjcN21cvDhsdcFo