add skills_trajectory evaluator#11
add skills_trajectory evaluator#11frivas-at-navteca wants to merge 2 commits intoagentevals-dev:mainfrom
Conversation
Deterministic trajectory-based evaluator that scores whether a configured set of skills (tool names) was observed in each invocation. Supports three match modes: - ANY_ORDER: fractional credit via Counter, order ignored - IN_ORDER: fractional credit via subsequence scan - EXACT: binary, called list must match required exactly Returns NOT_EVALUATED for missing, empty, or invalid config. Partial credit distinguishes this from the binary tool_sequence_match.
- Fix critical bug: tool_calls are ToolCallData objects not dicts; isinstance(call, dict) guard silently dropped all tool calls, causing score=0.0 regardless of agent behavior. Use call.name directly. - Merge redundant skills None+isinstance guards into single check - Add NOT_EVALUATED for empty invocations list - Replace list[dict] annotation with _Comparison TypedDict - Simplify tool name extraction to single list comprehension
krisztianfekete
left a comment
There was a problem hiding this comment.
Thanks for pushing this forward!
I tried to look up docs on how skills work without too much luck, but based on this example here: https://code.claude.com/docs/en/skills#restrict-claude%E2%80%99s-skill-access
# Allow only specific skills
Skill(commit)
Skill(review-pr *)
# Deny specific skills
Skill(deploy *)It's almost like Anthropic is re-using tool calls, where the name of the tool is always skill, and its name is its first arg. We should check what other conventions other providers are following.
It's not 1:1 applicable to the approach OTel genai semconv is taking, but maybe we can do something like this:
skills:
- review # what is being proposed here
- tool: Skill # this is Anthorpic's approach where Skill is a tool that takes args
args: { skill: review }
- skill: review # once we have OTel support, we could just add support for this What do you all think? Also cc. @peterj
| @@ -0,0 +1,148 @@ | |||
| """Skills trajectory evaluator. | |||
|
|
|||
| Scores whether a configured set of skills (tool names) was observed in each | |||
There was a problem hiding this comment.
I think we have to take tool arguments into consideration as well.
|
this doc talks about how skills are being activated and this will differ from the harness agent is using. the activation is either by calling the file-read tool OR having a dedicated tool that takes a skill name and returns the content. |
Summary
This PR is a conversation starter around issue #10 . First attempt. Completely open to improvements, comments etc, thanks in advanced. This PR adds a new deterministic trajectory-based evaluator that scores whether a configured set of skills (tool names) was observed in each agent invocation.
Motivation
It is important to mention that while
tool_sequence_matchcovers the generic case, it uses binary scoring and exact/multiset matching only.skills_trajectoryadds:Fractional partial credit — It calculates de average of the required called skills, for example, if the agent called 3 of 4 required skills, the score is
0.75rather than0.0.True subsequence mode (
IN_ORDER) — It allows extra tool calls between required ones, which it is more realistic compared to exact matching.Skills framing — explicitly scoped as a best-effort proxy for skill invocation while skills are not yet first-class citizens in OTel GenAI semconv (see Add skill span open-telemetry/semantic-conventions#3540).
Match modes
match_typeANY_ORDER(default)Counter.IN_ORDEREXACTConfig
skillslist[str]match_typestr"ANY_ORDER""ANY_ORDER","IN_ORDER", or"EXACT".Returns
NOT_EVALUATEDwhenskillsis missing, empty, or invalid, or whenmatch_typeis invalid.Example
Notes: