fix(tb-lf): align eval response structs with devportal API#34
Merged
Conversation
EvalItem field names and types were out of sync with the server
response from SpaApi::Ai::Eval::RunsController#item_json:
- conversation_log was typed Option<String>; server returns a JSON
array of {role, content} (JSON column, surfaced via as_json). This
caused `tb-lf eval run <id>` to fail with "invalid type: sequence,
expected a string". Switched to Option<serde_json::Value>.
- Renamed suite/case to suite_key/suite_name + case_key/case_name,
duration_seconds to duration_ms, trace_langfuse_id to trace_id —
matching the keys actually emitted by the controller. Previously
these fields silently deserialized to None, leaving the items list
blank in the CLI output.
- Display formatting in main.rs updated to use the new fields and
format duration_ms / 1000.0 for the seconds display.
EvalAction::Cases was deserializing into Vec<EvalCase>, but
CoverageController#cases renders { data: [...], meta: {...} }. Switched
to PaginatedResponse<EvalCase> and use resp.data. Also fixed the suite
filter query param: server expects suite_key (line 14 of the
controller), CLI was sending suite — so --suite silently didn't filter.
Added deserialize tests against representative payloads for both
endpoints so the contracts can't drift again silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trogulja
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tb-lf eval run <id>(v0.6.0) failed withinvalid type: sequence, expected a string at line 1 column 998, andtb-lf eval cases [--suite <key>]failed withinvalid type: map, expected a sequence at line 1 column 0. Both bugs were client-side struct/shape mismatches against DevPortal's actual response.EvalItem(crates/tb-lf/src/types.rs) was out of sync withSpaApi::Ai::Eval::RunsController#item_json:conversation_log: Option<String>— server returns a JSON array of{role, content}(a JSON column, surfaced viaas_json). Switched toOption<serde_json::Value>. This is the field that triggered the deserialize failure.suite→suite_key+ addedsuite_name,case→case_key+ addedcase_name,duration_seconds→duration_ms,trace_langfuse_id→trace_id— matching the keys actually emitted by the controller. Previously these silently deserialized toNone, leaving the items list rendered as blank rows.main.rsupdated accordingly; duration is nowduration_ms / 1000.0for theXsdisplay.EvalAction::Cases(crates/tb-lf/src/main.rs) was decoding intoVec<EvalCase>, butCoverageController#casesrenders{ data: [...], meta: {...} }. Switched toPaginatedResponse<EvalCase>and useresp.data. Also fixed the suite filter query param: server expectssuite_key, CLI was sendingsuite, so--suitesilently didn't filter.Added deserialize tests covering both endpoints with representative payloads so the contracts can't drift again silently.
Test plan
Local verification against live DevPortal (Development project):
cargo run -p tb-lf -- eval run 60 --project Development— renders 352 items with populated suite / case / status / score / duration columns (previously failed at first item with conversation_log)cargo run -p tb-lf -- eval run 60 --failed --project Development— filters to failed onlycargo run -p tb-lf -- eval run 60 --full --project Development— prints truncated JSON conversation log per itemcargo run -p tb-lf -- eval cases --project Development— lists 50 cases (default per_page) from the paginated wrappercargo run -p tb-lf -- eval cases --suite crm-agent --project Development— filters to 2 CRM Agent cases (previously: did not filter, deserialize-failed before display)cargo fmt --check,cargo clippy --workspace -- -D warnings,cargo test --workspace— all clean (5 tb-lf unit tests now, +2 regression tests)Task
n/a — incidental tooling fix surfaced while running evals.