[Feature]: Support expected_output / ground truth in evaluate_workflow for reference-based evaluators

### Description

**Description**
I need to evaluate workflow outputs against golden answers using Foundry evaluators that require a reference answer, such as SIMILARITY.

I have golden answers for each workflow query and need to use reference-based evaluators on the final workflow output.

My workflow evaluation code is using SIMILARITY and calls evaluate_workflow. There is no supported way to pass the golden answers alongside the queries.

The result is a provider-side failure like this: (from Foundry)

FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.

**Expected behavior**
evaluate_workflow should accept expected_output, or an equivalent ground truth parameter, with one value per query.
The public API should preferably use expected_output for consistency with evaluate_agent, and map it internally to the provider field expected by Foundry.
Those values should be stamped onto EvalItem.expected_output for overall workflow items and, where appropriate, per-agent items.
FoundryEvals should include the expected output field in its JSONL item schema and add the corresponding data mapping for evaluators that require ground truth.
If an evaluator requires ground truth and none is provided, the library should fail early with a clear validation error instead of surfacing a remote provider error.

**Alternatives considered**
evaluate_agent with expected_output is not sufficient because I am evaluating a multi-agent workflow, not a single agent.
Rebuilding workflow eval items manually would require relying on internal helpers and still would not fix the missing Foundry JSONL mapping.
Doing local exact-match or F1 scoring outside the framework bypasses Foundry-managed evaluation and reporting, which is not what I want.

**Code sample**
```python
from agent_framework import evaluate_workflow
from agent_framework.foundry import FoundryEvals
evals = FoundryEvals(   client=client,    evaluators=[FoundryEvals.SIMILARITY],)
await evaluate_workflow(   workflow=workflow,    queries=["How many plans does Netlife offer in Ecuador in B1 2025?"],    evaluators=evals,    # Wanted:    # expected_output=["Ofrece 41 planes"])
```
Current result

FAILED_EXECUTION: (UserError) Either 'conversation' or individual inputs must be provided. 'ground_truth' is missing.
One recommendation: ask for expected_output in the public API, not ground_truth, because that already exists on the agent path and makes the feature a parity improvement rather than a new concept.


### Code Sample
Keep the existing API for backward compatibility:
```markdown
queries=["q1", "q2"]
```
But also support a structured form like:
```markdown
queries=[
    {
        "query": "How many plans does Netlife offer in Ecuador in B1 2025?",
        "ground_truth": "Ofrece 41 planes"
    }
]
```

### Language/SDK

Both

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support expected_output / ground truth in evaluate_workflow for reference-based evaluators #5135

Description

Code Sample

Language/SDK

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Support expected_output / ground truth in evaluate_workflow for reference-based evaluators #5135

Description

Description

Code Sample

Language/SDK

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions