Skip to content

add skills_trajectory evaluator#11

Open
frivas-at-navteca wants to merge 2 commits intoagentevals-dev:mainfrom
Navteca:feature/skills-trajectory-evaluator
Open

add skills_trajectory evaluator#11
frivas-at-navteca wants to merge 2 commits intoagentevals-dev:mainfrom
Navteca:feature/skills-trajectory-evaluator

Conversation

@frivas-at-navteca
Copy link
Copy Markdown

Summary

This PR is a conversation starter around issue #10 . First attempt. Completely open to improvements, comments etc, thanks in advanced. This PR adds a new deterministic trajectory-based evaluator that scores whether a configured set of skills (tool names) was observed in each agent invocation.

Motivation

It is important to mention that while tool_sequence_match covers the generic case, it uses binary scoring and exact/multiset matching only. skills_trajectory adds:

  • Fractional partial credit — It calculates de average of the required called skills, for example, if the agent called 3 of 4 required skills, the score is 0.75 rather than 0.0.

  • True subsequence mode (IN_ORDER) — It allows extra tool calls between required ones, which it is more realistic compared to exact matching.

  • Skills framing — explicitly scoped as a best-effort proxy for skill invocation while skills are not yet first-class citizens in OTel GenAI semconv (see Add skill span open-telemetry/semantic-conventions#3540).

Match modes

match_type Behaviour
ANY_ORDER (default) All required skills must appear; order and extras ignored. Duplicate requirements handled via Counter.
IN_ORDER Required skills must appear as a subsequence; extras between hits allowed, order preserved.
EXACT Called tool names must match required skills exactly — same names, same order, no extras. Binary score only.

Config

Option Type Default Description
skills list[str] Required. Names of skills/tools that must be observed.
match_type str "ANY_ORDER" Match mode: "ANY_ORDER", "IN_ORDER", or "EXACT".

Returns NOT_EVALUATED when skills is missing, empty, or invalid, or when match_type is invalid.

Example

evaluators:
  - name: skills_trajectory
    type: remote
    source: github
    ref: evaluators/skills_trajectory/skills_trajectory.py
    threshold: 0.7
    config:
      skills: ["search", "summarize"]
      match_type: ANY_ORDER

Notes:

Deterministic trajectory-based evaluator that scores whether a
configured set of skills (tool names) was observed in each invocation.

Supports three match modes:
- ANY_ORDER: fractional credit via Counter, order ignored
- IN_ORDER:  fractional credit via subsequence scan
- EXACT:     binary, called list must match required exactly

Returns NOT_EVALUATED for missing, empty, or invalid config.
Partial credit distinguishes this from the binary tool_sequence_match.
- Fix critical bug: tool_calls are ToolCallData objects not dicts;
  isinstance(call, dict) guard silently dropped all tool calls, causing
  score=0.0 regardless of agent behavior. Use call.name directly.
- Merge redundant skills None+isinstance guards into single check
- Add NOT_EVALUATED for empty invocations list
- Replace list[dict] annotation with _Comparison TypedDict
- Simplify tool name extraction to single list comprehension
Copy link
Copy Markdown
Collaborator

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward!

I tried to look up docs on how skills work without too much luck, but based on this example here: https://code.claude.com/docs/en/skills#restrict-claude%E2%80%99s-skill-access

# Allow only specific skills
Skill(commit)
Skill(review-pr *)

# Deny specific skills
Skill(deploy *)

It's almost like Anthropic is re-using tool calls, where the name of the tool is always skill, and its name is its first arg. We should check what other conventions other providers are following.

It's not 1:1 applicable to the approach OTel genai semconv is taking, but maybe we can do something like this:

  skills:
    - review                              # what is being proposed here
    - tool: Skill                         # this is Anthorpic's approach where Skill is a tool that takes args
      args: { skill: review }
    - skill: review                       # once we have OTel support, we could just add support for this 

What do you all think? Also cc. @peterj

@@ -0,0 +1,148 @@
"""Skills trajectory evaluator.

Scores whether a configured set of skills (tool names) was observed in each
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to take tool arguments into consideration as well.

@peterj
Copy link
Copy Markdown
Collaborator

peterj commented Apr 21, 2026

this doc talks about how skills are being activated and this will differ from the harness agent is using. the activation is either by calling the file-read tool OR having a dedicated tool that takes a skill name and returns the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants