add skills_trajectory evaluator by frivas-at-navteca · Pull Request #11 · agentevals-dev/evaluators

frivas-at-navteca · 2026-04-20T16:07:23Z

Summary

This PR is a conversation starter around issue #10 . First attempt. Completely open to improvements, comments etc, thanks in advanced. This PR adds a new deterministic trajectory-based evaluator that scores whether a configured set of skills (tool names) was observed in each agent invocation.

Motivation

It is important to mention that while tool_sequence_match covers the generic case, it uses binary scoring and exact/multiset matching only. skills_trajectory adds:

Fractional partial credit — It calculates de average of the required called skills, for example, if the agent called 3 of 4 required skills, the score is 0.75 rather than 0.0.
True subsequence mode (IN_ORDER) — It allows extra tool calls between required ones, which it is more realistic compared to exact matching.
Skills framing — explicitly scoped as a best-effort proxy for skill invocation while skills are not yet first-class citizens in OTel GenAI semconv (see Add skill span open-telemetry/semantic-conventions#3540).

Match modes

`match_type`	Behaviour
`ANY_ORDER` (default)	All required skills must appear; order and extras ignored. Duplicate requirements handled via `Counter`.
`IN_ORDER`	Required skills must appear as a subsequence; extras between hits allowed, order preserved.
`EXACT`	Called tool names must match required skills exactly — same names, same order, no extras. Binary score only.

Config

Option	Type	Default	Description
`skills`	`list[str]`	—	Required. Names of skills/tools that must be observed.
`match_type`	`str`	`"ANY_ORDER"`	Match mode: `"ANY_ORDER"`, `"IN_ORDER"`, or `"EXACT"`.

Returns NOT_EVALUATED when skills is missing, empty, or invalid, or when match_type is invalid.

Example

evaluators:
  - name: skills_trajectory
    type: remote
    source: github
    ref: evaluators/skills_trajectory/skills_trajectory.py
    threshold: 0.7
    config:
      skills: ["search", "summarize"]
      match_type: ANY_ORDER

Notes:

Fully deterministic — no LLM involved
Partial credit distinguishes this from the binary tool_sequence_match evaluator
Best-effort proxy until skills become first-class citizens in OTel GenAI semconv (see Add skill span open-telemetry/semantic-conventions#3540)

Deterministic trajectory-based evaluator that scores whether a configured set of skills (tool names) was observed in each invocation. Supports three match modes: - ANY_ORDER: fractional credit via Counter, order ignored - IN_ORDER: fractional credit via subsequence scan - EXACT: binary, called list must match required exactly Returns NOT_EVALUATED for missing, empty, or invalid config. Partial credit distinguishes this from the binary tool_sequence_match.

- Fix critical bug: tool_calls are ToolCallData objects not dicts; isinstance(call, dict) guard silently dropped all tool calls, causing score=0.0 regardless of agent behavior. Use call.name directly. - Merge redundant skills None+isinstance guards into single check - Add NOT_EVALUATED for empty invocations list - Replace list[dict] annotation with _Comparison TypedDict - Simplify tool name extraction to single list comprehension

krisztianfekete

Thanks for pushing this forward!

I tried to look up docs on how skills work without too much luck, but based on this example here: https://code.claude.com/docs/en/skills#restrict-claude%E2%80%99s-skill-access

# Allow only specific skills
Skill(commit)
Skill(review-pr *)

# Deny specific skills
Skill(deploy *)

It's almost like Anthropic is re-using tool calls, where the name of the tool is always skill, and its name is its first arg. We should check what other conventions other providers are following.

It's not 1:1 applicable to the approach OTel genai semconv is taking, but maybe we can do something like this:

  skills:
    - review                              # what is being proposed here
    - tool: Skill                         # this is Anthorpic's approach where Skill is a tool that takes args
      args: { skill: review }
    - skill: review                       # once we have OTel support, we could just add support for this

What do you all think? Also cc. @peterj

krisztianfekete · 2026-04-21T10:05:38Z

@@ -0,0 +1,148 @@
+"""Skills trajectory evaluator.
+
+Scores whether a configured set of skills (tool names) was observed in each


I think we have to take tool arguments into consideration as well.

peterj · 2026-04-21T11:53:48Z

this doc talks about how skills are being activated and this will differ from the harness agent is using. the activation is either by calling the file-read tool OR having a dedicated tool that takes a skill name and returns the content.

frivas-at-navteca added 2 commits April 20, 2026 17:41

krisztianfekete reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add skills_trajectory evaluator#11

add skills_trajectory evaluator#11
frivas-at-navteca wants to merge 2 commits intoagentevals-dev:mainfrom
Navteca:feature/skills-trajectory-evaluator

frivas-at-navteca commented Apr 20, 2026

Uh oh!

krisztianfekete left a comment

Uh oh!

krisztianfekete Apr 21, 2026

Uh oh!

peterj commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,148 @@
		"""Skills trajectory evaluator.

		Scores whether a configured set of skills (tool names) was observed in each

Conversation

frivas-at-navteca commented Apr 20, 2026

Summary

Motivation

Match modes

Config

Example

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

krisztianfekete Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

peterj commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants