Skip to content

[Feature]: Support expected_output / ground truth in evaluate_workflow for reference-based evaluators #5135

@pablocast

Description

@pablocast

Description

Description
I need to evaluate workflow outputs against golden answers using Foundry evaluators that require a reference answer, such as SIMILARITY.

I have golden answers for each workflow query and need to use reference-based evaluators on the final workflow output.

My workflow evaluation code is using SIMILARITY and calls evaluate_workflow. There is no supported way to pass the golden answers alongside the queries.

The result is a provider-side failure like this: (from Foundry)

FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.

Expected behavior
evaluate_workflow should accept expected_output, or an equivalent ground truth parameter, with one value per query.
The public API should preferably use expected_output for consistency with evaluate_agent, and map it internally to the provider field expected by Foundry.
Those values should be stamped onto EvalItem.expected_output for overall workflow items and, where appropriate, per-agent items.
FoundryEvals should include the expected output field in its JSONL item schema and add the corresponding data mapping for evaluators that require ground truth.
If an evaluator requires ground truth and none is provided, the library should fail early with a clear validation error instead of surfacing a remote provider error.

Alternatives considered
evaluate_agent with expected_output is not sufficient because I am evaluating a multi-agent workflow, not a single agent.
Rebuilding workflow eval items manually would require relying on internal helpers and still would not fix the missing Foundry JSONL mapping.
Doing local exact-match or F1 scoring outside the framework bypasses Foundry-managed evaluation and reporting, which is not what I want.

Code sample

from agent_framework import evaluate_workflow
from agent_framework.foundry import FoundryEvals
evals = FoundryEvals(   client=client,    evaluators=[FoundryEvals.SIMILARITY],)
await evaluate_workflow(   workflow=workflow,    queries=["How many plans does Netlife offer in Ecuador in B1 2025?"],    evaluators=evals,    # Wanted:    # expected_output=["Ofrece 41 planes"])

Current result

FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
One recommendation: ask for expected_output in the public API, not ground_truth, because that already exists on the agent path and makes the feature a parity improvement rather than a new concept.

Code Sample

Keep the existing API for backward compatibility:

queries=["q1", "q2"]

But also support a structured form like:

queries=[
    {
        "query": "How many plans does Netlife offer in Ecuador in B1 2025?",
        "ground_truth": "Ofrece 41 planes"
    }
]

Language/SDK

Both

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions