Description
Description
I need to evaluate workflow outputs against golden answers using Foundry evaluators that require a reference answer, such as SIMILARITY.
I have golden answers for each workflow query and need to use reference-based evaluators on the final workflow output.
My workflow evaluation code is using SIMILARITY and calls evaluate_workflow. There is no supported way to pass the golden answers alongside the queries.
The result is a provider-side failure like this: (from Foundry)
FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
Expected behavior
evaluate_workflow should accept expected_output, or an equivalent ground truth parameter, with one value per query.
The public API should preferably use expected_output for consistency with evaluate_agent, and map it internally to the provider field expected by Foundry.
Those values should be stamped onto EvalItem.expected_output for overall workflow items and, where appropriate, per-agent items.
FoundryEvals should include the expected output field in its JSONL item schema and add the corresponding data mapping for evaluators that require ground truth.
If an evaluator requires ground truth and none is provided, the library should fail early with a clear validation error instead of surfacing a remote provider error.
Alternatives considered
evaluate_agent with expected_output is not sufficient because I am evaluating a multi-agent workflow, not a single agent.
Rebuilding workflow eval items manually would require relying on internal helpers and still would not fix the missing Foundry JSONL mapping.
Doing local exact-match or F1 scoring outside the framework bypasses Foundry-managed evaluation and reporting, which is not what I want.
Code sample
from agent_framework import evaluate_workflow
from agent_framework.foundry import FoundryEvals
evals = FoundryEvals( client=client, evaluators=[FoundryEvals.SIMILARITY],)
await evaluate_workflow( workflow=workflow, queries=["How many plans does Netlife offer in Ecuador in B1 2025?"], evaluators=evals, # Wanted: # expected_output=["Ofrece 41 planes"])
Current result
FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
One recommendation: ask for expected_output in the public API, not ground_truth, because that already exists on the agent path and makes the feature a parity improvement rather than a new concept.
Code Sample
Keep the existing API for backward compatibility:
But also support a structured form like:
queries=[
{
"query": "How many plans does Netlife offer in Ecuador in B1 2025?",
"ground_truth": "Ofrece 41 planes"
}
]
Language/SDK
Both
Description
Description
I need to evaluate workflow outputs against golden answers using Foundry evaluators that require a reference answer, such as SIMILARITY.
I have golden answers for each workflow query and need to use reference-based evaluators on the final workflow output.
My workflow evaluation code is using SIMILARITY and calls evaluate_workflow. There is no supported way to pass the golden answers alongside the queries.
The result is a provider-side failure like this: (from Foundry)
FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
Expected behavior
evaluate_workflow should accept expected_output, or an equivalent ground truth parameter, with one value per query.
The public API should preferably use expected_output for consistency with evaluate_agent, and map it internally to the provider field expected by Foundry.
Those values should be stamped onto EvalItem.expected_output for overall workflow items and, where appropriate, per-agent items.
FoundryEvals should include the expected output field in its JSONL item schema and add the corresponding data mapping for evaluators that require ground truth.
If an evaluator requires ground truth and none is provided, the library should fail early with a clear validation error instead of surfacing a remote provider error.
Alternatives considered
evaluate_agent with expected_output is not sufficient because I am evaluating a multi-agent workflow, not a single agent.
Rebuilding workflow eval items manually would require relying on internal helpers and still would not fix the missing Foundry JSONL mapping.
Doing local exact-match or F1 scoring outside the framework bypasses Foundry-managed evaluation and reporting, which is not what I want.
Code sample
Current result
FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
One recommendation: ask for expected_output in the public API, not ground_truth, because that already exists on the agent path and makes the feature a parity improvement rather than a new concept.
Code Sample
Keep the existing API for backward compatibility:
But also support a structured form like:
Language/SDK
Both