add `AgentConfigUpdate` & initial judges #4547

theomonnom · 2026-01-17T21:12:45Z

Summary by CodeRabbit

New Features
- Pluggable evaluation system with multiple judges, aggregated session scoring, and example integration for session-end evaluation.
- Client-events streaming and RPCs to fetch session state, chat history, agent info, and to send messages; remote subscription support.
- Conversation history now records agent configuration updates (instructions/tools) and initial configs.
- Session tagging and telemetry now include evaluation results and outcome metadata.
Chores
- Utilities: wait_for_agent helper and related exports added.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-17T21:12:56Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds a pluggable LLM-driven evaluation subsystem (judges, JudgeGroup, EvaluationResult), records and exposes agent configuration updates in chat context, introduces client-facing event streaming (ClientEventsHandler/RemoteSession), and wires observability Tagger into JobContext and telemetry for evaluation/outcome tagging.

Changes

Cohort / File(s)	Summary
Evaluation core `livekit-agents/livekit/agents/evals/judge.py`, `livekit-agents/livekit/agents/evals/evaluation.py`, `livekit-agents/livekit/agents/evals/__init__.py`	Adds judging framework: `Judge` and specialized judges (`task_completion`, `handoff`, `accuracy`, `tool_use`, `safety`, `relevancy`, `coherence`, `conciseness`), `JudgmentResult`/`Verdict`, `Evaluator` protocol, `EvaluationResult`, `JudgeGroup`, and re-exports in `evals.__all__`.
Client events & remote session `livekit-agents/livekit/agents/voice/client_events.py`, `livekit-agents/livekit/agents/voice/agent_session.py`, `livekit-agents/livekit/agents/voice/room_io/room_io.py`	Adds `ClientEventsHandler` and `RemoteSession` with typed client event models and RPCs; integrates event handler lifecycle into `AgentSession`; removes legacy text-input registration from `RoomIO`.
Chat context & LLM exports `livekit-agents/livekit/agents/llm/chat_context.py`, `livekit-agents/livekit/agents/llm/__init__.py`, `livekit-agents/livekit/agents/__init__.py`	Introduces `AgentConfigUpdate` model, extends `ChatItem` union, and re-exports `AgentConfigUpdate` (and `AgentHandoff`) from llm and top-level agents package.
Agent runtime tracking `livekit-agents/livekit/agents/voice/agent_activity.py`	Emits `AgentConfigUpdate` on instruction/tool changes and at startup; computes `tools_added`/`tools_removed`; stores full tool definitions in in-memory `_tools` on updates.
Observability & job context `livekit-agents/livekit/agents/observability.py`, `livekit-agents/livekit/agents/job.py`	Adds `Tagger` class to collect tags/evaluations/outcome reasons; `JobContext` now creates/exposes `tagger`, exposes `primary_session`, forwards `tagger` into session-report upload, and removes `run_job`.
Telemetry & traces `livekit-agents/livekit/agents/telemetry/traces.py`	`_upload_session_report` now accepts `tagger`; introduces per-name logger helper and generalized logging; emits evaluation and outcome telemetry derived from `Tagger`.
Types & RPC constants `livekit-agents/livekit/agents/types.py`	Adds agent/topic/RPC constants (`ATTRIBUTE_AGENT_NAME`, `TOPIC_CLIENT_EVENTS`, `RPC_*` constants) for client events and agent RPCs.
Utilities & participant helper `livekit-agents/livekit/agents/utils/participant.py`, `livekit-agents/livekit/agents/utils/__init__.py`	Adds `wait_for_agent` helper to await an agent participant by kind/name and exports it from utils.
Examples `examples/frontdesk/frontdesk_agent.py`	Integrates `JudgeGroup` into `on_session_end`, adds `appointment_booked` to userdata, and uses evaluation outcomes to mark session success/failure.
Worker & types usage `livekit-agents/livekit/agents/worker.py`	Sets participant attribute `lk.agent.name` on accepted availability to the worker's agent name.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant JG as JudgeGroup
    participant J as Judge
    participant LLM as LLM
    participant T as Tagger

    Client->>JG: evaluate(chat_ctx, reference)
    JG->>J: evaluate(chat_ctx, reference, llm)
    J->>LLM: request evaluation (prompt with chat + instructions)
    LLM-->>J: verdict + reasoning
    J-->>JG: JudgmentResult
    JG->>T: _evaluation(EvaluationResult)
    JG-->>Client: EvaluationResult (aggregate)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

client-events: add RemoteSession #4643 — Strong overlap: client-events / RemoteSession additions and evaluation/observability wiring.
migrate to turn_handling #4502 — Touches agent activity/tool-tracking logic related to config/tool diffs and updates.
hand text message request from lk server #4553 — Related to messaging/worker/job surface changes that may interact with agent participant handling.

Suggested reviewers

longcw

Poem

🐰 I hopped through chats and config blooms,
I gathered verdicts in tidy rooms,
Tags and tools I tucked away,
Judges chimed at close of day,
I nibbled code and hummed hoorays! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the two main changes: introducing AgentConfigUpdate class and initial judge implementations for evaluation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch theo/config-update

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76ab013 and 2d4f33a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

examples/frontdesk/frontdesk_agent.py
livekit-agents/livekit/agents/__init__.py
livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/evals/evaluation.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/job.py
livekit-agents/livekit/agents/llm/__init__.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/telemetry/traces.py
livekit-agents/livekit/agents/types.py
livekit-agents/livekit/agents/utils/__init__.py
livekit-agents/livekit/agents/utils/participant.py
livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/voice/agent_session.py
livekit-agents/livekit/agents/voice/client_events.py
livekit-agents/livekit/agents/voice/room_io/room_io.py
livekit-agents/livekit/agents/worker.py

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/llm/chat_context.py`:
- Around line 213-229: AgentConfigUpdate is missing the agent_id field, so
callers that set agent_id and the formatter _format_chat_ctx that reads it can
fail; add an agent_id: str | None = None (or appropriate type) to the
AgentConfigUpdate model declaration so the value is preserved and safe to
access, and ensure the new field is included before PrivateAttr/_tools in the
AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry
agent_id correctly.

♻️ Duplicate comments (2)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

314-323: agent_id field missing in AgentConfigUpdate (covered in model)

This block passes agent_id; ensure the model defines it so the value isn’t lost.

livekit-agents/livekit/agents/evals/judge.py (1)

16-40: Guard agent_id access for config updates

agent_id is referenced here; ensure the model defines it (see AgentConfigUpdate).

🧹 Nitpick comments (3)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

331-349: Stabilize tool diff ordering for deterministic updates

set → list produces nondeterministic ordering; sorting keeps logs/tests stable.
🔧 Proposed fix
-        tools_added = list(new_tool_names - old_tool_names) or None
-        tools_removed = list(old_tool_names - new_tool_names) or None
+        tools_added = sorted(new_tool_names - old_tool_names) or None
+        tools_removed = sorted(old_tool_names - new_tool_names) or None

livekit-agents/livekit/agents/evals/judge.py (2)

8-13: Add a Google‑style class docstring for JudgmentResult

🔧 Proposed fix

 `@dataclass`
 class JudgmentResult:
+    """Result of a judge evaluation.
+
+    Attributes:
+        passed: Whether the evaluation passed.
+        reasoning: Model reasoning for the judgment.
+    """
     passed: bool
     """Whether the evaluation passed."""
     reasoning: str
     """Chain-of-thought reasoning for the judgment."""

As per coding guidelines, please add Google-style docstrings.

43-87: Make PASS/FAIL parsing deterministic

rfind can be tripped by “PASS/FAIL” in the reasoning. Require a final verdict line and parse only that.

🔧 Proposed fix

         prompt_parts.extend(
             [
                 "",
                 "Does the conversation meet the criteria? Don't overthink it.",
-                "Explain your reasoning step by step, then answer Pass or Fail.",
+                "Provide a brief justification, then output a final line with exactly PASS or FAIL.",
             ]
         )
@@
-        response = "".join(response_chunks)
-
-        response_upper = response.upper()
-        pass_pos = response_upper.rfind("PASS")
-        fail_pos = response_upper.rfind("FAIL")
-        passed = pass_pos > fail_pos if pass_pos != -1 else False
+        response = "".join(response_chunks).strip()
+        last_line = response.splitlines()[-1].strip().upper() if response else ""
+        passed = last_line.startswith("PASS")

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 853bc41 and 9af2755.

📒 Files selected for processing (6)

livekit-agents/livekit/agents/__init__.py
livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/llm/__init__.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/voice/agent_activity.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/llm/__init__.py
livekit-agents/livekit/agents/__init__.py
livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/llm/chat_context.py

🧬 Code graph analysis (6)

livekit-agents/livekit/agents/evals/__init__.py (1)

livekit-agents/livekit/agents/evals/judge.py (6)

Judge (43-87)

JudgmentResult (9-13)

accuracy_judge (112-128)

safety_judge (151-168)

task_completion_judge (90-109)

tool_use_judge (131-148)

livekit-agents/livekit/agents/llm/__init__.py (1)

livekit-agents/livekit/agents/llm/chat_context.py (1)

AgentConfigUpdate (213-224)

livekit-agents/livekit/agents/__init__.py (1)

livekit-agents/livekit/agents/llm/chat_context.py (2)

AgentConfigUpdate (213-224)

AgentHandoff (205-210)

livekit-agents/livekit/agents/voice/agent_activity.py (2)

livekit-agents/livekit/agents/llm/tool_context.py (4)

get_fnc_tool_names (283-292)

tools (44-46)

ToolContext (295-418)

flatten (320-325)

livekit-agents/livekit/agents/llm/chat_context.py (1)

AgentConfigUpdate (213-224)

livekit-agents/livekit/agents/evals/judge.py (4)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

llm (2815-2819)

livekit-agents/livekit/agents/llm/chat_context.py (7)

ChatContext (232-670)

items (241-242)

items (245-246)

text_content (164-173)

copy (297-354)

copy (690-691)

add_message (248-281)

livekit-agents/livekit/agents/voice/agent_session.py (1)

output (394-395)

livekit-agents/livekit/agents/voice/agent.py (1)

instructions (99-104)

livekit-agents/livekit/agents/llm/chat_context.py (2)

livekit-agents/livekit/agents/utils/misc.py (1)

shortuuid (21-22)

livekit-agents/livekit/agents/llm/tool_context.py (1)

Tool (31-32)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: livekit-plugins-cartesia
GitHub Check: livekit-plugins-deepgram
GitHub Check: livekit-plugins-inworld
GitHub Check: livekit-plugins-openai
GitHub Check: livekit-plugins-elevenlabs
GitHub Check: unit-tests
GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)

🔇 Additional comments (11)

livekit-agents/livekit/agents/llm/__init__.py (2)

1-15: LGTM: AgentConfigUpdate re-exported from llm

55-69: LGTM: __all__ updated to include AgentConfigUpdate

livekit-agents/livekit/agents/__init__.py (2)

39-49: LGTM: top-level imports updated

117-156: LGTM: __all__ export list updated

livekit-agents/livekit/agents/voice/agent_activity.py (2)

18-25: LGTM: tool diff helpers wired in

603-611: LGTM: initial config snapshot recorded

livekit-agents/livekit/agents/evals/judge.py (4)

90-109: LGTM: task completion judge instructions are clear

112-128: LGTM: accuracy judge instructions look solid

131-148: LGTM: tool-use judge instructions look solid

151-167: LGTM: safety judge instructions look solid

livekit-agents/livekit/agents/evals/__init__.py (1)

1-17: LGTM: judge APIs re-exported

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-17T21:21:11Z

livekit-agents/livekit/agents/llm/chat_context.py

+class AgentConfigUpdate(BaseModel):
+    id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
+    type: Literal["agent_config_update"] = Field(default="agent_config_update")
+
+    instructions: str | None = None
+    tools_added: list[str] | None = None
+    tools_removed: list[str] | None = None
+
+    created_at: float = Field(default_factory=time.time)
+
+    _tools: list[Tool] = PrivateAttr(default_factory=list)
+    """Full tool definitions (in-memory only, not serialized)."""
+
 ChatItem = Annotated[
-    Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff], Field(discriminator="type")
+    Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff, AgentConfigUpdate],
+    Field(discriminator="type"),
 ]


⚠️ Potential issue | 🟠 Major

Add agent_id to AgentConfigUpdate to match callers and formatter

Call sites set agent_id and _format_chat_ctx reads it, but the model doesn’t declare it. Add the field so the value is preserved and attribute access is safe.

🔧 Proposed fix

class AgentConfigUpdate(BaseModel): id: str = Field(default_factory=lambda: utils.shortuuid("item_")) type: Literal["agent_config_update"] = Field(default="agent_config_update") + agent_id: str | None = None instructions: str | None = None tools_added: list[str] | None = None tools_removed: list[str] | None = None

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/llm/chat_context.py` around lines 213 - 229, AgentConfigUpdate is missing the agent_id field, so callers that set agent_id and the formatter _format_chat_ctx that reads it can fail; add an agent_id: str | None = None (or appropriate type) to the AgentConfigUpdate model declaration so the value is preserved and safe to access, and ensure the new field is included before PrivateAttr/_tools in the AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry agent_id correctly.

livekit-agents/livekit/agents/llm/chat_context.py

chenghao-mou · 2026-01-19T20:01:41Z

livekit-agents/livekit/agents/llm/chat_context.py

    created_at: float = Field(default_factory=time.time)


+class AgentConfigUpdate(BaseModel):


I am still very fuzzy about the name. Judging by the name, I would assume it is related to changing stt/llm/tts of the agent.

yeah.. do you have any name suggestion?

Maybe LLMConfigUpdate?

livekit-agents/livekit/agents/evals/judge.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

livekit-agents/livekit/agents/telemetry/traces.py (1)
356-365: Custom Tagger tags are never emitted.
tagger.add/remove won’t surface anywhere because tagger.tags isn’t logged or uploaded. This makes the new tagging API effectively a no-op for custom tags.
🐛 Suggested fix: include tags in the session report log
     _log(
         chat_logger,
         body="session report",
         timestamp=int((report.started_at or report.timestamp or 0) * 1e9),
         attributes={
             "session.options": vars(report.options),
             "session.report_timestamp": report.timestamp,
             "agent_name": agent_name,
+            "session.tags": sorted(tagger.tags),
         },
     )
Also applies to: 385-412

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/evals/evaluation.py`:
- Around line 61-66: The majority_passed property compares a fractional score to
a count, which is wrong; update the logic in majority_passed (in evaluation.py)
to compare like-for-like: either check if self.score > 0.5 (since score is in
[0,1]) or compute passed_count = self.score * len(self.judgments) and compare
passed_count > len(self.judgments)/2; keep the existing empty-judgments shortcut
(return True when not self.judgments).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9af2755 and 754ce38.

📒 Files selected for processing (7)

examples/frontdesk/frontdesk_agent.py
livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/evals/evaluation.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/job.py
livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/telemetry/traces.py

🚧 Files skipped from review as they are similar to previous changes (1)

livekit-agents/livekit/agents/evals/init.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/job.py
examples/frontdesk/frontdesk_agent.py
livekit-agents/livekit/agents/telemetry/traces.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/evals/evaluation.py

🧬 Code graph analysis (5)

livekit-agents/livekit/agents/observability.py (1)

livekit-agents/livekit/agents/evals/evaluation.py (2)

EvaluationResult (32-71)

name (18-20)

livekit-agents/livekit/agents/job.py (1)

livekit-agents/livekit/agents/observability.py (1)

Tagger (9-101)

examples/frontdesk/frontdesk_agent.py (4)

livekit-agents/livekit/agents/evals/evaluation.py (4)

JudgeGroup (73-165)

judges (123-125)

evaluate (22-28)

evaluate (127-165)

livekit-agents/livekit/agents/job.py (3)

userdata (662-663)

make_session_report (269-304)

tagger (255-267)

livekit-agents/livekit/agents/voice/agent_session.py (2)

userdata (371-375)

userdata (378-379)

livekit-agents/livekit/agents/observability.py (2)

success (37-46)

fail (48-57)

livekit-agents/livekit/agents/telemetry/traces.py (1)

livekit-agents/livekit/agents/observability.py (3)

Tagger (9-101)

evaluations (81-83)

outcome_reason (86-88)

livekit-agents/livekit/agents/evals/evaluation.py (2)

livekit-agents/livekit/agents/evals/judge.py (10)

JudgmentResult (15-34)

name (148-149)

name (199-200)

name (267-268)

evaluate (151-185)

evaluate (202-252)

evaluate (270-318)

passed (22-24)

uncertain (32-34)

failed (27-29)

livekit-agents/livekit/agents/job.py (4)

job (323-325)

job (692-693)

get_job_context (57-64)

tagger (255-267)

🔇 Additional comments (30)

livekit-agents/livekit/agents/job.py (3)

41-41: Tagger initialization looks solid.
Creates a per-job Tagger instance and wires it into context state cleanly.

Also applies to: 186-187

254-267: Nice public Tagger accessor.
Clear docstring and straightforward API surface.

225-231: All _upload_session_report call sites have been updated correctly.
Only one call site exists in the codebase (job.py:225), and it includes the required tagger argument along with all other parameters.

examples/frontdesk/frontdesk_agent.py (4)

28-38: Eval judge imports look good.

45-48: Userdata flag default is fine.

111-112: Marks booking success appropriately.

176-204: No changes needed. The code is correct and follows the recommended pattern from the library documentation.

ChatContext.copy() does support both exclude_function_call and exclude_instructions parameters as used in the code.

The pattern of calling judges.evaluate() without error handling matches the official JudgeGroup docstring example, which explicitly documents this as the recommended approach. Results are automatically tagged to the session.

Likely an incorrect or invalid review comment.

livekit-agents/livekit/agents/telemetry/traces.py (1)

320-383: Session-scoped logger reuse is a solid improvement.
Keeps consistent logger attributes across session report and chat-item logs.

livekit-agents/livekit/agents/observability.py (5)

32-35: State initialization is clear and minimal.

37-57: Outcome tagging logic is straightforward.

59-73: Add/remove tag helpers are clean.

75-88: Accessors return copies as expected.

90-101: Evaluation tagging hook integrates cleanly.

livekit-agents/livekit/agents/evals/evaluation.py (6)

14-28: Evaluator protocol reads well.

38-49: Score computation is consistent with the docstring.

51-60: Pass/fail helpers are clear.

68-71: none_failed helper looks good.

95-115: JudgeGroup initialization is straightforward.

127-165: Concurrent evaluation flow looks good.

livekit-agents/livekit/agents/evals/judge.py (11)

14-34: JudgmentResult definition is clear.

37-65: Chat item formatting covers all relevant item types.

68-70: ChatContext formatting helper is concise.

73-81: Latest-instructions extraction is sensible.

84-86: Handoff detection helper is fine.

89-110: Verdict parsing logic looks reasonable.

113-136: LLM evaluation flow is clean and readable.

139-185: Generic Judge evaluation path is solid.

188-253: Task completion judge prompt flow is well-structured.

255-318: Handoff judge logic is clear and pragmatic.

321-500: Built-in judge factories are well documented and consistent.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-21T19:47:40Z

livekit-agents/livekit/agents/evals/evaluation.py

+    @property
+    def majority_passed(self) -> bool:
+        """True if more than half of the judgments passed."""
+        if not self.judgments:
+            return True
+        return self.score > len(self.judgments) / 2


⚠️ Potential issue | 🟠 Major

majority_passed is mathematically incorrect.
self.score is in [0,1], but it’s compared against len(self.judgments)/2 (≥1 for 2+ judges), so this almost always returns False.

🐛 Suggested fix

def majority_passed(self) -> bool: """True if more than half of the judgments passed.""" if not self.judgments: return True - return self.score > len(self.judgments) / 2 + passed = sum(1 for j in self.judgments.values() if j.passed) + return passed > len(self.judgments) / 2

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/evals/evaluation.py` around lines 61 - 66, The majority_passed property compares a fractional score to a count, which is wrong; update the logic in majority_passed (in evaluation.py) to compare like-for-like: either check if self.score > 0.5 (since score is in [0,1]) or compute passed_count = self.score * len(self.judgments) and compare passed_count > len(self.judgments)/2; keep the existing empty-judgments shortcut (return True when not self.judgments).

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

livekit-agents/livekit/agents/job.py (1)

220-231: Tags silently fail to upload when recording is disabled, contradicting documentation.

The Tagger API documentation states "Tags are uploaded to LiveKit Cloud at session end" without condition, but _upload_session_report() (which uploads tags along with session report) is only invoked when report.enable_recording is true. This means tags are silently dropped whenever recording is disabled, creating a mismatch between documented behavior and actual behavior. Either upload tags independently of the recording flag or update documentation to clarify tags only upload when recording is enabled.

🤖 Fix all issues with AI agents

In `@examples/frontdesk/frontdesk_agent.py`:
- Around line 176-204: The early return in on_session_end prevents tagging short
sessions; instead of returning when len(chat.items) < 3, skip running JudgeGroup
evaluation but still call the tagging logic: create the report
(ctx.make_session_report()), skip or bypass judges.evaluate(...) when the chat
is short, then call ctx.tagger.success() if
ctx.primary_session.userdata.appointment_booked is true or ctx.tagger.fail(...)
otherwise, ensuring the tagging runs regardless of whether judges were executed.

In `@livekit-agents/livekit/agents/evals/judge.py`:
- Around line 1-10: The import block is unsorted causing Ruff I001; reorder the
imports into proper groups: keep "from __future__ import annotations" first,
then standard library imports (re, dataclasses/dataclass, typing/Any, Literal)
sorted alphabetically, then third-party (none here), then local/package imports
sorted alphabetically — specifically ensure "from ..llm import LLM,
ChatContext", "from ..log import logger", and "from ..types import NOT_GIVEN,
NotGivenOr" are in the local imports group and sorted; run ruff to verify the
import order is now compliant.

🧹 Nitpick comments (2)

livekit-agents/livekit/agents/job.py (2)

254-267: Align the new Tagger docstring with Google-style format.

This docstring adds a new public API; please standardize it with Returns / Examples sections for consistency.

♻️ Proposed docstring format

     def tagger(self) -> Tagger:
         """Returns the Tagger for adding tags and outcomes to the session.
 
-        Tags are uploaded to LiveKit Cloud at session end.
-
-        Example:
-            ```python
-            ctx.tagger.success(reason="Task completed successfully")
-            ctx.tagger.fail(reason="User hung up before completing")
-            ctx.tagger.add("voicemail:true")
-            ```
+        Notes:
+            Tags are uploaded to LiveKit Cloud at session end.
+
+        Returns:
+            Tagger: Tagger instance scoped to this job.
+
+        Examples:
+            ctx.tagger.success(reason="Task completed successfully")
+            ctx.tagger.fail(reason="User hung up before completing")
+            ctx.tagger.add("voicemail:true")
         """
         return self._tagger

As per coding guidelines, please use Google-style docstrings.

345-351: Document primary_session return/raise behavior in Google style.

Since the accessor raises when unset, the docstring should include a Raises section for clarity and compliance.

♻️ Proposed docstring format

     def primary_session(self) -> AgentSession:
-        """Returns the primary AgentSession for this job."""
+        """Returns the primary AgentSession for this job.
+
+        Returns:
+            AgentSession: The primary session instance.
+
+        Raises:
+            RuntimeError: If no AgentSession was started for this job.
+        """
         if not self._primary_agent_session:
             raise RuntimeError("No AgentSession was started for this job")
         return self._primary_agent_session

As per coding guidelines, please use Google-style docstrings.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 754ce38 and dfd036f.

📒 Files selected for processing (3)

examples/frontdesk/frontdesk_agent.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/job.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

examples/frontdesk/frontdesk_agent.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/job.py

🪛 GitHub Actions: CI

livekit-agents/livekit/agents/evals/judge.py

[error] 1-1: Ruff I001 Import block is un-sorted or un-formatted.

🪛 GitHub Check: ruff

livekit-agents/livekit/agents/evals/judge.py

[failure] 1-9: Ruff (I001)
livekit-agents/livekit/agents/evals/judge.py:1:1: I001 Import block is un-sorted or un-formatted

🔇 Additional comments (23)

livekit-agents/livekit/agents/evals/judge.py (18)

11-34: Clear verdict modeling and helpers.

The Verdict alias plus JudgmentResult helpers are concise and easy to consume.

37-65: Chat item formatting is consistent and readable.

Good coverage across message, tool, handoff, and config update item types.

68-70: Nice, minimal wrapper for chat context formatting.

73-81: Latest-instructions lookup is straightforward and correct.

84-86: Handoff detection is clean and efficient.

89-110: Verdict parsing logic is clear and well-structured.

113-136: LLM evaluation flow is well-scoped and deterministic.

139-185: Judge base class interface is clean and reusable.

188-253: Task-completion judge prompt construction looks solid.

255-318: Handoff judge behavior and prompt are consistent.

321-332: Factory: task_completion_judge is clear and well-scoped.

334-345: Factory: handoff_judge is clear and well-scoped.

347-365: Factory: accuracy_judge criteria are precise and actionable.

367-387: Factory: tool_use_judge criteria are comprehensive.

389-408: Factory: safety_judge criteria are appropriate for regulated contexts.

410-429: Factory: relevancy_judge criteria are clear and testable.

431-448: Factory: coherence_judge criteria are clear and concise.

451-468: Factory: conciseness_judge criteria are well-calibrated for voice.

examples/frontdesk/frontdesk_agent.py (3)

28-38: Evals import additions are clean and localized.

45-48: appointment_booked flag is a useful session signal.

102-114: Booking flag is set at the right point in the flow.

livekit-agents/livekit/agents/job.py (2)

41-42: LGTM for Tagger import.

Clean, scoped addition for the new observability capability.

186-187: LGTM for per-context Tagger initialization.

Keeping a Tagger instance on the JobContext is a clean lifecycle boundary.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-21T19:55:16Z

examples/frontdesk/frontdesk_agent.py

 async def on_session_end(ctx: JobContext) -> None:
-    # import json
+    report = ctx.make_session_report()
+
+    # Skip evaluation for very short conversations
+    chat = report.chat_history.copy(exclude_function_call=True, exclude_instructions=True)
+    if len(chat.items) < 3:
+        return
+
+    judges = JudgeGroup(
+        llm="openai/gpt-4o-mini",
+        judges=[
+            task_completion_judge(),
+            accuracy_judge(),
+            tool_use_judge(),
+            handoff_judge(),
+            safety_judge(),
+            relevancy_judge(),
+            coherence_judge(),
+            conciseness_judge(),
+        ],
+    )

-    # report = ctx.make_session_report()
-    # report_json = json.dumps(report.to_cloud_data(), indent=2)
+    await judges.evaluate(report.chat_history)

-    pass
+    if ctx.primary_session.userdata.appointment_booked:
+        ctx.tagger.success()
+    else:
+        ctx.tagger.fail(reason="Appointment was not booked")



⚠️ Potential issue | 🟠 Major

Short-session early return skips tagging outcomes.

At Line 180–182, returning early prevents ctx.tagger.success() / fail(...) from being called, so short sessions never get tagged—even if a booking completed. Consider skipping evaluation but still tagging.

🔧 Suggested restructure

- if len(chat.items) < 3: - return - - judges = JudgeGroup( - llm="openai/gpt-4o-mini", - judges=[ - task_completion_judge(), - accuracy_judge(), - tool_use_judge(), - handoff_judge(), - safety_judge(), - relevancy_judge(), - coherence_judge(), - conciseness_judge(), - ], - ) - - await judges.evaluate(report.chat_history) + if len(chat.items) >= 3: + judges = JudgeGroup( + llm="openai/gpt-4o-mini", + judges=[ + task_completion_judge(), + accuracy_judge(), + tool_use_judge(), + handoff_judge(), + safety_judge(), + relevancy_judge(), + coherence_judge(), + conciseness_judge(), + ], + ) + + await judges.evaluate(report.chat_history) if ctx.primary_session.userdata.appointment_booked: ctx.tagger.success() else: ctx.tagger.fail(reason="Appointment was not booked")

🤖 Prompt for AI Agents

In `@examples/frontdesk/frontdesk_agent.py` around lines 176 - 204, The early return in on_session_end prevents tagging short sessions; instead of returning when len(chat.items) < 3, skip running JudgeGroup evaluation but still call the tagging logic: create the report (ctx.make_session_report()), skip or bypass judges.evaluate(...) when the chat is short, then call ctx.tagger.success() if ctx.primary_session.userdata.appointment_booked is true or ctx.tagger.fail(...) otherwise, ensuring the tagging runs regardless of whether judges were executed.

it's fine bro

@theomonnom, understood! I'll leave the implementation as is. 👍

livekit-agents/livekit/agents/evals/judge.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/evals/judge.py`:
- Around line 123-145: The loop overwrites `arguments` for each streamed chunk;
change it to accumulate the streamed tool call payload from `llm.chat` so
partial tool call arguments across multiple chunks are concatenated into a
single full payload before calling `llm_utils.prepare_function_arguments`.
Specifically, when inspecting `chunk.delta.tool_calls` (and the extracted
`tool`), append `tool.arguments` to an accumulator (initialize `arguments` as an
empty string or buffer) instead of assigning, preserve ordering of chunks, and
only call `llm_utils.prepare_function_arguments(fnc=submit_verdict,
json_arguments=arguments)` after the loop; keep references to `llm.chat`,
`submit_verdict`, `prepare_function_arguments`, `arguments`, and
`chunk.delta.tool_calls` to locate and modify the code.

In `@livekit-agents/livekit/agents/voice/client_events.py`:
- Around line 268-270: Remove the extra blank line after the assignment
self._rpc_handlers_registered = True to satisfy ruff formatting; open the
function or class containing that assignment (look for the attribute set in
client_events.py) and delete the stray empty line so the assignment is
immediately followed by the next statement or end of block, keeping surrounding
indentation intact.

🧹 Nitpick comments (4)

livekit-agents/livekit/agents/evals/evaluation.py (1)
11-11: Consider handling non-numeric environment variable values.

int(os.getenv("LIVEKIT_EVALS_VERBOSE", 0)) will raise ValueError if the env var contains a non-numeric string. Consider a safer fallback.
🛡️ Proposed defensive fix
-_evals_verbose = int(os.getenv("LIVEKIT_EVALS_VERBOSE", 0))
+try:
+    _evals_verbose = int(os.getenv("LIVEKIT_EVALS_VERBOSE", "0"))
+except ValueError:
+    _evals_verbose = 0
livekit-agents/livekit/agents/voice/client_events.py (3)
441-451: Add exception handling in _read_text to prevent silent task failures.

If text_input_cb raises an exception, the task will fail without logging. Consider wrapping the callback invocation with error handling to improve observability.
♻️ Proposed fix
         async def _read_text(text_input_cb: TextInputCallback) -> None:
             from .room_io.types import TextInputEvent

             text = await reader.read_all()

-            text_input_result = text_input_cb(
-                self._session,
-                TextInputEvent(text=text, info=reader.info, participant=participant),
-            )
-            if asyncio.iscoroutine(text_input_result):
-                await text_input_result
+            try:
+                text_input_result = text_input_cb(
+                    self._session,
+                    TextInputEvent(text=text, info=reader.info, participant=participant),
+                )
+                if asyncio.iscoroutine(text_input_result):
+                    await text_input_result
+            except Exception:
+                logger.exception("error in text input callback")
558-592: Consider adding timeouts to RPC calls.

The fetch_* and send_message methods call perform_rpc without timeouts. If the remote agent becomes unresponsive, these calls could hang indefinitely. Consider adding a configurable timeout parameter or a sensible default.

396-397: Accessing private attribute _started_at.

self._session._started_at accesses an internal attribute. Consider exposing this via a public property on AgentSession if it's part of the intended API.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2c218 and 4412560.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

livekit-agents/livekit/agents/evals/evaluation.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/job.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/telemetry/traces.py
livekit-agents/livekit/agents/types.py
livekit-agents/livekit/agents/voice/agent_session.py
livekit-agents/livekit/agents/voice/client_events.py
livekit-agents/livekit/agents/voice/room_io/room_io.py

💤 Files with no reviewable changes (1)

livekit-agents/livekit/agents/voice/room_io/room_io.py

✅ Files skipped from review due to trivial changes (1)

livekit-agents/livekit/agents/types.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/job.py
livekit-agents/livekit/agents/telemetry/traces.py
livekit-agents/livekit/agents/voice/agent_session.py
livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/evals/evaluation.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/voice/client_events.py

🧠 Learnings (3)

📚 Learning: 2026-01-22T03:28:16.289Z

Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

livekit-agents/livekit/agents/job.py
livekit-agents/livekit/agents/telemetry/traces.py
livekit-agents/livekit/agents/voice/agent_session.py
livekit-agents/livekit/agents/observability.py
livekit-agents/livekit/agents/evals/evaluation.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/voice/client_events.py

📚 Learning: 2026-01-30T12:53:12.738Z

Learnt from: milanperovic
Repo: livekit/agents PR: 4660
File: livekit-plugins/livekit-plugins-personaplex/livekit/plugins/personaplex/__init__.py:19-21
Timestamp: 2026-01-30T12:53:12.738Z
Learning: In the livekit/agents repository, plugin __init__.py files follow a convention where `from livekit.agents import Plugin` and `from .log import logger` imports are placed after the `__all__` definition. These are internal imports for plugin registration and are not part of the public API. This pattern is used consistently across plugins like openai, deepgram, and ultravox, and does not trigger ruff E402 violations.

Applied to files:

livekit-agents/livekit/agents/evals/judge.py

📚 Learning: 2026-01-16T07:44:56.353Z

Learnt from: CR
Repo: livekit/agents PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-16T07:44:56.353Z
Learning: Applies to **/*.py : Run ruff linter and auto-fix issues

Applied to files:

livekit-agents/livekit/agents/evals/judge.py

🧬 Code graph analysis (8)

livekit-agents/livekit/agents/job.py (2)

livekit-agents/livekit/agents/observability.py (1)

Tagger (9-103)

livekit-agents/livekit/agents/telemetry/traces.py (1)

_upload_session_report (320-462)

livekit-agents/livekit/agents/telemetry/traces.py (1)

livekit-agents/livekit/agents/observability.py (3)

Tagger (9-103)

evaluations (81-83)

outcome_reason (86-88)

livekit-agents/livekit/agents/voice/agent_session.py (1)

livekit-agents/livekit/agents/voice/client_events.py (6)

ClientEventsHandler (158-459)

register_text_input (233-249)

start (188-194)

start (512-517)

aclose (196-231)

aclose (519-530)

livekit-agents/livekit/agents/observability.py (2)

livekit-agents/livekit/agents/evals/evaluation.py (2)

EvaluationResult (35-74)

name (21-23)

livekit-agents/livekit/agents/evals/judge.py (3)

name (155-156)

name (202-203)

name (265-266)

livekit-agents/livekit/agents/evals/evaluation.py (3)

livekit-agents/livekit/agents/evals/judge.py (10)

JudgmentResult (14-33)

name (155-156)

name (202-203)

name (265-266)

evaluate (158-188)

evaluate (205-250)

evaluate (268-311)

passed (21-23)

uncertain (31-33)

failed (26-28)

livekit-agents/livekit/agents/job.py (4)

job (321-323)

job (697-698)

get_job_context (55-62)

tagger (253-265)

livekit-agents/livekit/agents/observability.py (1)

_evaluation (90-103)

livekit-agents/livekit/agents/llm/chat_context.py (2)

livekit-agents/livekit/agents/utils/misc.py (1)

shortuuid (21-22)

livekit-agents/livekit/agents/llm/tool_context.py (1)

Tool (31-32)

livekit-agents/livekit/agents/evals/judge.py (7)

livekit-agents/livekit/agents/evals/evaluation.py (2)

llm (122-124)

name (21-23)

livekit-agents/livekit/agents/voice/agent_session.py (3)

llm (1281-1282)

output (396-397)

tools (431-432)

livekit-agents/livekit/agents/voice/agent_activity.py (2)

llm (2815-2819)

tools (291-294)

livekit-agents/livekit/agents/llm/chat_context.py (7)

ChatContext (233-671)

items (242-243)

items (246-247)

text_content (164-173)

add_message (249-282)

copy (298-355)

copy (691-692)

livekit-agents/livekit/agents/voice/speech_handle.py (1)

interrupted (83-84)

livekit-agents/livekit/agents/voice/agent.py (1)

instructions (99-104)

livekit-agents/livekit/agents/llm/utils.py (1)

prepare_function_arguments (359-414)

livekit-agents/livekit/agents/voice/client_events.py (5)

livekit-agents/livekit/agents/voice/agent_session.py (4)

llm (1281-1282)

room_io (740-746)

agent_state (420-421)

run (434-441)

livekit-agents/livekit/agents/llm/chat_context.py (5)

ChatMessage (151-173)

FunctionCall (179-192)

FunctionCallOutput (195-202)

items (242-243)

items (246-247)

livekit-agents/livekit/agents/voice/events.py (1)

AgentStateChangedEvent (108-112)

livekit-agents/livekit/agents/voice/run_result.py (1)

RunResult (70-214)

livekit-agents/livekit/agents/voice/room_io/types.py (1)

TextInputEvent (32-35)

🪛 GitHub Actions: CI

livekit-agents/livekit/agents/voice/client_events.py

[error] 1-1: ruff format --check detected that 1 file would be reformatted: livekit-agents/livekit/agents/voice/client_events.py. Run 'ruff format' to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (34)

livekit-agents/livekit/agents/llm/chat_context.py (1)

227-230: LGTM!

The ChatItem union correctly includes AgentConfigUpdate with the discriminator field "type", maintaining consistency with the existing pattern for other item types.

livekit-agents/livekit/agents/evals/evaluation.py (3)

17-31: LGTM!

The Evaluator Protocol is well-defined with clear type hints and docstrings, providing a clean interface for implementing custom evaluators.

34-62: LGTM!

EvaluationResult dataclass with score, all_passed, any_passed, and none_failed properties are correctly implemented. The scoring logic properly handles pass/maybe/fail verdicts.

77-190: LGTM!

JudgeGroup implementation is solid:

Dynamic LLM instantiation from string is handled correctly.

Parallel judge execution with proper exception handling.

Auto-tagging with graceful fallback when not in job context.

Verbose output for debugging is well-structured.

livekit-agents/livekit/agents/evals/judge.py (7)

1-8: Import order appears correct now.

The imports are properly sorted: __future__ first, standard library (dataclasses, typing), then local imports alphabetically (..llm, ..log). Past CI failure should be resolved.

13-34: LGTM!

JudgmentResult dataclass is well-designed with clear semantics for passed, failed, and uncertain properties. The docstrings clearly explain the "maybe" handling.

36-70: LGTM!

The helper functions _format_items, _format_chat_ctx, _get_latest_instructions, and _has_handoffs are clean utilities that properly handle all ChatItem types including the new agent_config_update.

148-189: LGTM!

The Judge base class is well-structured with proper LLM resolution logic and clean prompt construction. The reference context handling is appropriate.

191-312: LGTM!

_TaskCompletionJudge and _HandoffJudge are well-implemented with:

Proper LLM validation

Meaningful warning when instructions are missing

Smart handling of no-handoff case (auto-pass)

Clear evaluation criteria in prompts

314-462: LGTM!

The factory functions (task_completion_judge, handoff_judge, accuracy_judge, tool_use_judge, safety_judge, relevancy_judge, coherence_judge, conciseness_judge) are well-documented with clear use cases and comprehensive evaluation criteria.

117-121: The model attribute is guaranteed and the exclusion is justified.

The model property is a guaranteed attribute on all LLM instances (defined in the base LLM class with a default fallback to "unknown"), so the substring check is safe and does not require defensive programming. Additionally, the hardcoded "gpt-5" exclusion is justified: OpenAI's GPT-5 models raise an error when temperature is included in requests, so this exclusion prevents API errors. The substring matching approach is also appropriate as it correctly handles model variants (e.g., "gpt-5-turbo", "gpt-5-mini").

livekit-agents/livekit/agents/voice/agent_session.py (4)

44-44: LGTM!

Import of ClientEventsHandler is correctly placed with other voice module imports.

338-338: LGTM!

Initializing _client_events_handler to None follows the pattern of other optional handlers in the session.

845-848: LGTM!

Closing ClientEventsHandler before RoomIO is the correct order to ensure clean shutdown. The handler is properly nullified after close.

582-593: LGTM!

The ClientEventsHandler integration follows proper lifecycle patterns:

Created after RoomIO.start() completes

Text input registered conditionally based on configuration

Handler started asynchronously

livekit-agents/livekit/agents/job.py (5)

39-40: LGTM!

Imports for Tagger and _upload_session_report are correctly placed and sorted.

184-184: LGTM!

Initializing Tagger in JobContext.__init__ ensures each job context has its own isolated tagging state.

252-265: LGTM!

The tagger property is well-documented with a Google-style docstring including practical usage examples. The property correctly exposes the internal _tagger instance.

343-348: LGTM!

The primary_session property provides convenient access with appropriate error handling when no session exists.

223-228: LGTM!

The tagger is correctly passed to _upload_session_report, enabling evaluation and outcome data to be included in the telemetry upload.

livekit-agents/livekit/agents/telemetry/traces.py (4)

42-42: LGTM!

TYPE_CHECKING import for Tagger avoids circular import issues while enabling type hints.

320-336: LGTM!

The _get_logger factory function cleanly encapsulates logger creation with session metadata, improving code reusability.

338-352: LGTM!

The _log helper function refactoring generalizes logging to accept any logger instance, enabling reuse for both chat and evaluation logging.

385-410: LGTM!

Evaluation and outcome logging is well-implemented:

Creates a dedicated eval_logger for separation of concerns

Maps "fail" verdicts to ERROR severity appropriately

Logs outcome reason when present

Uses consistent timestamp formatting

livekit-agents/livekit/agents/observability.py (6)

1-6: LGTM!

Imports are minimal and correctly use TYPE_CHECKING to avoid circular imports with EvaluationResult.

9-35: LGTM!

The Tagger class has excellent documentation with practical examples. The internal state is well-structured with appropriate types.

37-57: LGTM!

success() and fail() methods correctly implement mutual exclusivity by discarding the opposite tag before adding the new one. The _outcome_reason is properly stored for both.

59-73: LGTM!

add() and remove() methods are simple and correct. Using discard() for removal is appropriate as it doesn't raise an error if the tag doesn't exist.

75-88: LGTM!

Properties correctly return copies of internal state to prevent external mutation of _tags and _evaluation_results.

90-103: LGTM!

The _evaluation() internal method properly:

Iterates over judgment results

Creates tags in the format lk.judge.{name}:{verdict}

Appends structured evaluation data for telemetry

The underscore prefix and docstring clearly indicate this is for internal use by JudgeGroup.evaluate().

livekit-agents/livekit/agents/voice/client_events.py (4)

1-46: LGTM!

The imports and module structure are well-organized. The use of from __future__ import annotations correctly enables PEP 604 union syntax (|) for Python 3.9+ compatibility.

50-106: LGTM!

The client event models are well-defined Pydantic models with appropriate type discriminators. The ClientEvent discriminated union using Annotated with Field(discriminator="type") enables proper JSON deserialization.

148-155: LGTM!

The _tool_names utility correctly handles recursive extraction from nested Toolset objects.

473-530: LGTM!

The RemoteSession class is well-documented with a clear docstring and usage example. The lifecycle management (start/aclose) properly handles text stream registration and task cleanup.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-30T17:45:06Z

livekit-agents/livekit/agents/evals/judge.py

+    arguments: str | None = None
+    async for chunk in llm.chat(
+        chat_ctx=eval_ctx,
+        tools=[submit_verdict],
+        tool_choice={"type": "function", "function": {"name": "submit_verdict"}},
+        extra_kwargs=extra_kwargs,
+    ):
+        if not chunk.delta:
+            continue
+
+        if chunk.delta.tool_calls:
+            tool = chunk.delta.tool_calls[0]
+            arguments = tool.arguments
+
+    if not arguments:
+        raise ValueError("LLM did not return verdict arguments")
+
+    fnc_args, fnc_kwargs = llm_utils.prepare_function_arguments(
+        fnc=submit_verdict, json_arguments=arguments
+    )
+    verdict, reasoning = await submit_verdict(*fnc_args, **fnc_kwargs)
+
+    return JudgmentResult(verdict=verdict, reasoning=reasoning)


⚠️ Potential issue | 🟡 Minor

Tool call argument accumulation may be incomplete.

The loop overwrites arguments on each chunk rather than accumulating. If tool call arguments are streamed across multiple chunks, only the last chunk's arguments will be captured.

🔧 Proposed fix for argument accumulation

- arguments: str | None = None + arguments_parts: list[str] = [] async for chunk in llm.chat( chat_ctx=eval_ctx, tools=[submit_verdict], tool_choice={"type": "function", "function": {"name": "submit_verdict"}}, extra_kwargs=extra_kwargs, ): if not chunk.delta: continue if chunk.delta.tool_calls: tool = chunk.delta.tool_calls[0] - arguments = tool.arguments + if tool.arguments: + arguments_parts.append(tool.arguments) + arguments = "".join(arguments_parts) if arguments_parts else None if not arguments: raise ValueError("LLM did not return verdict arguments")

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/evals/judge.py` around lines 123 - 145, The loop overwrites `arguments` for each streamed chunk; change it to accumulate the streamed tool call payload from `llm.chat` so partial tool call arguments across multiple chunks are concatenated into a single full payload before calling `llm_utils.prepare_function_arguments`. Specifically, when inspecting `chunk.delta.tool_calls` (and the extracted `tool`), append `tool.arguments` to an accumulator (initialize `arguments` as an empty string or buffer) instead of assigning, preserve ordering of chunks, and only call `llm_utils.prepare_function_arguments(fnc=submit_verdict, json_arguments=arguments)` after the loop; keep references to `llm.chat`, `submit_verdict`, `prepare_function_arguments`, `arguments`, and `chunk.delta.tool_calls` to locate and modify the code.

coderabbitai · 2026-01-30T17:45:07Z

livekit-agents/livekit/agents/voice/client_events.py

+        self._rpc_handlers_registered = True
+
+


⚠️ Potential issue | 🟡 Minor

Fix formatting: remove extra blank line.

The pipeline reports a ruff format failure. There's an extra blank line after self._rpc_handlers_registered = True that should be removed.

🔧 Proposed fix

self._rpc_handlers_registered = True - def _register_event_handlers(self) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._rpc_handlers_registered = True

self._rpc_handlers_registered = True

def _register_event_handlers(self) -> None:

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/voice/client_events.py` around lines 268 - 270, Remove the extra blank line after the assignment self._rpc_handlers_registered = True to satisfy ruff formatting; open the function or class containing that assignment (look for the attribute set in client_events.py) and delete the stray empty line so the assignment is immediately followed by the next statement or end of block, keeping surrounding indentation intact.

chenghao-mou requested a review from a team January 17, 2026 21:12

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

chenghao-mou reviewed Jan 19, 2026

View reviewed changes

theomonnom requested a review from a team January 21, 2026 19:43

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

theomonnom and others added 3 commits January 30, 2026 15:00

add evals framework, session tagging, and agent config tracking

f4bbc0b

client-events: add RemoteSession (#4643)

1e8e72b

add lk.agent.name attr & wait_for_agent utility (#4670)

2d4f33a

theomonnom force-pushed the theo/config-update branch from 76ab013 to 2d4f33a Compare January 30, 2026 23:01

theomonnom merged commit bbfe854 into main Jan 30, 2026
15 of 18 checks passed

theomonnom deleted the theo/config-update branch January 30, 2026 23:03

		created_at: float = Field(default_factory=time.time)


		class AgentConfigUpdate(BaseModel):

add AgentConfigUpdate & initial judges #4547

add AgentConfigUpdate & initial judges #4547

Uh oh!

Conversation

theomonnom commented Jan 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chenghao-mou Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

theomonnom Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add `AgentConfigUpdate` & initial judges #4547

add `AgentConfigUpdate` & initial judges #4547

theomonnom commented Jan 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 17, 2026 •

edited

Loading

coderabbitai bot Jan 21, 2026 •

edited

Loading