feature/PAAL-212-trace-ids-in-testworkflow by qa-jil-kamerling · Pull Request #11 · agentic-layer/testbench

qa-jil-kamerling · 2025-12-11T13:42:59Z

OTEL Tracing Implementation

Phase 2: test run

Created otel_setup.py module and instrumented run.py to wrap each agent query in a span with trace context propagation via HTTPX, enabling end-to-end trace visualization in Tempo/Grafana
Captured trace IDs in experiment results

Phase 3: evaluation

Each test case in ragas_experiment.jsonl now includes a trace_id field linking evaluation results to distributed traces,
enabling correlation between test execution and agent behavior
Fixed trace ID preservation through evaluation
Modified evaluate.py to manually preserve trace IDs when RAGAS processes experiment data (RAGAS drops custom fields), ensuring evaluation_scores.json maintains trace context for debugging failed tests

Dependencies added: opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http, opentelemetry-instrumentation-httpx

by adding otel mock, updating parameters

deploy/local/ragas-evaluation-workflow.yaml

Tiltfile

felixk101 · 2025-12-15T12:47:21Z

CLAUDE.md

+# Required environment variable for local testing
+export OPENAI_API_BASE="http://localhost:11001"  # AI Gateway endpoint
+export GOOGLE_API_KEY="your-api-key"            # Required for Gemini models
+```
+
+### Running the 4-Phase Pipeline Locally
+
+```shell
+# Phase 1: Download and convert dataset to RAGAS format
+uv run python3 scripts/setup.py "http://localhost:11020/dataset.csv"
+
+# Phase 2: Execute queries through agent via A2A protocol
+uv run python3 scripts/run.py "http://localhost:11010"
+
+# Phase 3: Evaluate responses using RAGAS metrics
+uv run python3 scripts/evaluate.py gemini-2.5-flash-lite "faithfulness answer_relevancy"
+
+# Phase 4: Publish metrics to OTLP endpoint
+uv run python3 scripts/publish.py "workflow-name"
+```
+
+### Testkube Execution
+
+```shell
+# Run complete evaluation workflow in Kubernetes
+kubectl testkube run testworkflow ragas-evaluation-workflow \
+    --config datasetUrl="http://data-server.data-server:8000/dataset.csv" \
+    --config agentUrl="http://weather-agent.sample-agents:8000" \
+    --config metrics="nv_accuracy context_recall" \
+    --config workflowName="Test-Run" \
+    -n testkube
+
+# Watch workflow execution


ggf. magst du einfach @README.md verwenden

Danke, aber wir hatten uns entschieden Claude.mds und readmes zu trennen

felixk101 · 2025-12-15T13:01:27Z

scripts/evaluate.py


    # Save to file
    with open(output_file, "w") as f:
        json.dump(asdict(evaluation_scores), f, indent=2)


Ich hatte mit dem Code schonmal Probleme. Es wird manchmal NaN als Zahl in die JSON-Outputdatei geschrieben. Das ist allerdings nicht valides JSON. 😕

In meinem Code musste ich in publish.py folgendes einbauen, was nicht so schön ist.

def _is_metric_value(value: Any) -> TypeGuard[int | float]: """Check if a value is a valid metric score (numeric and not NaN).""" if not isinstance(value, (int, float)): return False if isinstance(value, float) and math.isnan(value): return False return True

(Das ist aber nicht Teil des Tickets)

FYI @fmallmann

qa-jil-kamerling added 9 commits December 10, 2025 14:40

feat: PAAL-212 update operator versions

c2902ec

feat: PAAL-212 set up otel

c485bbb

feat: PAAL-212 add tracing to run.py

11b89d6

feat: PAAL-212 update unit tests;

7db8ff3

by adding otel mock, updating parameters

feat: PAAL-212 add claude.md

eeb653e

feat: PAAL-212 add otel service name

a2a982c

feat: PAAL-212 update readme examples with values from tiltfile

4de1d39

feat: PAAL-212 update evaluate step

9509447

feat: PAAL-212 add otel port to tiltfile

35bb526

qa-jil-kamerling marked this pull request as ready for review December 12, 2025 10:39

feat: PAAL-212 add correct otel endpoint to workflow

4675104

qa-jil-kamerling requested a review from g3force December 12, 2025 10:44

g3force reviewed Dec 12, 2025

View reviewed changes

deploy/local/ragas-evaluation-workflow.yaml Outdated Show resolved Hide resolved

feat: PAAL-212 add timeout for helm resource

ed9c8f9

qa-jil-kamerling requested a review from fmallmann December 15, 2025 08:09

g3force reviewed Dec 15, 2025

View reviewed changes

Tiltfile Outdated Show resolved Hide resolved

felixk101 approved these changes Dec 15, 2025

View reviewed changes

feat: PAAL-212 implement review comments

d9f4d56

qa-jil-kamerling merged commit b79bfe5 into main Dec 15, 2025
6 checks passed

qa-jil-kamerling deleted the feature/PAAL-212-trace-ids-in-testworkflow branch December 15, 2025 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/PAAL-212-trace-ids-in-testworkflow#11

feature/PAAL-212-trace-ids-in-testworkflow#11
qa-jil-kamerling merged 12 commits intomainfrom
feature/PAAL-212-trace-ids-in-testworkflow

qa-jil-kamerling commented Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

felixk101 Dec 15, 2025

Uh oh!

qa-jil-kamerling Dec 15, 2025

Uh oh!

felixk101 Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qa-jil-kamerling commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

felixk101 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

qa-jil-kamerling Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

felixk101 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qa-jil-kamerling commented Dec 11, 2025 •

edited

Loading

felixk101 Dec 15, 2025 •

edited

Loading