feat: PAAL-233 refactor gauges to differentiate metrics and user_input by felixk101 · Pull Request #12 · agentic-layer/testbench

felixk101 · 2025-12-12T15:33:55Z

~~This branch is based on @qa-jil-kamerling 's feature branch (not main). This PR should stay in draft until feature/PAAL-212-trace-ids-in-testworkflow is merged.~~

felixk101 · 2025-12-15T15:43:16Z

@claude please review

claude · 2025-12-15T15:43:32Z

Claude finished @felixk101's task in 1m 59s —— View job

Code Review Complete

I've reviewed the PR that refactors gauge metrics to differentiate between metrics and user_input. Here's my comprehensive feedback:

Overview

What Changed:

Refactored scripts/publish.py to publish per-sample gauges instead of overall scores
Changed from multiple metric-specific gauges to a single gauge differentiated by name attribute
Added user_input and trace_id as metric attributes for better observability
Updated tests to match the new architecture
Fixed CI configuration and deployment issues

Architecture Change:

Before: One gauge per metric (e.g., ragas_evaluation_faithfulness, ragas_evaluation_answer_relevancy)
After: Single gauge testbench_evaluation_metric with attributes: name, workflow_name, user_input, trace_id

Critical Issues

1. High Cardinality Risk in OpenTelemetry Metrics 🔴

Location: scripts/publish.py:95-101

attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    "user_input": user_input,  # ⚠️ HIGH CARDINALITY
    "trace_id": trace_id,       # ⚠️ HIGH CARDINALITY
}
metric_gauge.set(score, attributes)

Problem: Using user_input and trace_id as metric attributes creates extremely high cardinality. Each unique combination creates a new time series in your metrics backend (LGTM/Prometheus).

Impact:

If you run 1000 samples with 4 metrics = 4,000 time series per workflow run
This can overwhelm Prometheus/Mimir, causing:
- Memory exhaustion
- Query performance degradation
- Potential data loss
- Cardinality limits being hit

Best Practices:

OpenTelemetry/Prometheus recommend keeping metric cardinality low (< 10 dimensions typically)
High-cardinality data (like individual queries/traces) should use traces or logs, not metrics
Metrics should aggregate data, not store individual events

Recommendation:

# Option 1: Use aggregated metrics + traces for details
# Keep metrics for aggregations only
attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    # Don't include user_input or trace_id
}

# Option 2: Use exemplars (if supported by LGTM stack)
# Exemplars let you attach trace IDs to metrics without creating new time series

Alternative: If you need per-sample visibility, consider:

Using OpenTelemetry Logs with structured attributes
Creating a custom trace span for each evaluation with span attributes
Storing results in a database and using metrics only for aggregates

2. Incomplete OTLP Endpoint Migration

Location: deploy/local/ragas-evaluation-workflow.yaml:60

otlpEndpoint:
  default: "http://otel-collector.monitoring.svc.cluster.local:4318"

Problem: You changed the default OTLP endpoint from http://lgtm.monitoring:4318 to http://otel-collector.monitoring.svc.cluster.local:4318, but:

No OTel collector exists in this PR - I don't see an otel-collector deployment in deploy/local/
Inconsistent with Tiltfile - Tiltfile:42 still forwards port from lgtm service
May break local dev - The new endpoint won't work if no collector is deployed

Questions:

Is there supposed to be an OTel collector deployment in this PR?
Or should this change be reverted to use lgtm.monitoring:4318 directly?
Was this intended for the base branch (PAAL-212)?

Recommendation: Either:

Add the otel-collector deployment manifests
Or revert this change and keep using lgtm.monitoring:4318

Code Quality Issues

3. Missing Error Handling for Invalid Trace IDs

Location: scripts/publish.py:94

trace_id = result.get("trace_id", "unknown")

Issue: Using "unknown" as a fallback adds noise to metrics. If trace IDs are missing, you should:

Log a warning
Skip the sample or use a standardized placeholder
Consider making trace_id optional if it's from the base branch

Recommendation:

trace_id = result.get("trace_id")
if not trace_id:
    logger.warning(f"Missing trace_id for sample: {user_input[:50]}...")
    trace_id = "missing-trace-id"

4. Silent Skipping of Invalid Metrics

Location: scripts/publish.py:91-92

if not _is_metric_value(score):
    continue

Issue: Invalid metrics are silently skipped. This could hide data quality issues.

Recommendation:

if not _is_metric_value(score):
    logger.debug(f"Skipping invalid metric value for {metric_name} in sample {user_input[:50]}: {score}")
    continue

5. Inconsistent Error Handling in Tests

Location: tests/test_publish.py:415-420

The mock provider raises an exception but doesn't use _OtelMockMeter() like other test fixtures. This inconsistency could hide bugs if the real provider's get_meter is called during initialization.

Recommendation: Make mock patterns consistent across all tests.

Minor Issues & Suggestions

6. Unused Overall Scores

You removed overall_scores from being published, but they're still in the JSON file. Consider documenting why they're not published anymore, or remove them from evaluate.py if they're truly unused.

7. Metric Naming Convention

Location: scripts/publish.py:82

name="testbench_evaluation_metric",

Observation: You changed from ragas_evaluation_* to testbench_evaluation_*. This is a breaking change for any existing dashboards/alerts.

Recommendations:

Document this breaking change in the PR description
Provide migration notes for dashboard updates
Consider if backward compatibility is needed

8. CI Improvements Are Good ✅

The CI fixes in .github/workflows/ci.yml are excellent:

Increased timeout to 10 minutes (line 34 in Tiltfile)
Added retry logic with cleanup (lines 118-132)
Clean up stuck Helm releases (line 128)

These should help with the flaky Tilt setup mentioned in the code.

9. Test Coverage Looks Comprehensive ✅

The test updates are thorough:

Tests for _is_metric_value() type guard
Per-sample gauge value tests
Tests verify all three gauges are created
Proper mocking of OpenTelemetry components

Good job maintaining test quality!

10. Code Style Issue in `scripts/run.py`

Location: scripts/run.py:19

from otel_setup import setup_otel

This import was moved (likely by a linter), but the change is purely cosmetic. However, imports should follow PEP 8 order:

Standard library
Third-party
Local imports

The new position is correct. ✅

Security Considerations

11. Potential PII in Metrics

Location: scripts/publish.py:93-98

If user_input contains personally identifiable information (PII), you're now publishing PII to your metrics backend. This could violate:

GDPR
CCPA
Internal data policies

Recommendations:

Hash or truncate user_input before using it as an attribute
Add documentation about PII handling
Consider using only the first N characters or a hash
Review your organization's data retention policies

Example:

import hashlib

user_input = result.get("user_input", "unknown")
# Use hash instead of raw input
user_input_hash = hashlib.sha256(user_input.encode()).hexdigest()[:16]
attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    "user_input_hash": user_input_hash,  # Use hash instead
    "trace_id": trace_id,
}

Performance Considerations

12. Metric Export Performance

Location: scripts/publish.py:88-102

You're calling gauge.set() in nested loops. For a dataset with:

100 samples
4 metrics

That's 400 gauge.set() calls. While not inherently bad, be aware this could slow down for large datasets (1000+ samples).

Recommendation: This is acceptable for now, but consider batching if you hit performance issues.

Documentation Gaps

13. Missing Migration Notes

This PR introduces breaking changes but doesn't update:

CLAUDE.md - Should document new metric structure
README (if exists) - Should document metric naming changes
Grafana dashboard updates needed

Summary

Critical Issues to Address:

🔴 High cardinality risk - Using user_input and trace_id as metric attributes is an anti-pattern
🟡 OTLP endpoint change - May break if otel-collector doesn't exist

Recommendations:

Remove user_input and trace_id from metric attributes, use traces/logs instead
Clarify the OTLP endpoint situation
Add PII handling if user_input contains sensitive data
Document the breaking changes in metric names

What's Good:

Comprehensive test coverage
CI improvements are solid
Code structure is clean
Type safety with TypeGuard

Overall Assessment: The refactoring is well-implemented from a code perspective, but the high cardinality concern is critical and should be addressed before merging. Consider whether per-sample metrics are the right approach, or if traces/logs would be more appropriate for this use case.

felixk101 · 2025-12-15T15:49:06Z

🔴 High cardinality risk - Using user_input and trace_id as metric attributes is an anti-pattern

This is explicitly desired. The issue will be mitigated by a limit to how many times our user_inputs change over time.

🟡 OTLP endpoint change - May break if otel-collector doesn't exist

This was modified in ba92327 (not part of my changes).

Code Quality Issues

✅ addressed

(see also: https://docs.testkube.io/articles/test-workflows-expressions)

fmallmann · 2025-12-16T13:40:13Z

Tiltfile

 k8s_resource('ai-gateway-litellm', port_forwards=['11001:4000'])
 k8s_resource('weather-agent', port_forwards='11010:8000', labels=['agents'], resource_deps=['agent-runtime'])
-k8s_resource('lgtm', port_forwards=['11000:3000'])
+k8s_resource('lgtm', port_forwards=['11000:3000', '4318:4318'])


Why is this port needed?

I have dropped the commit from Jil's old branch. Fixed.

Nevermind, this is needed - we want to be able to test run.py locally (See the README). That only works if our locally running python script can send spans to the otel-collector running in our Tiltfile.

deploy/local/ragas-evaluation-workflow.yaml

felixk101 · 2025-12-19T19:11:39Z

I am closing this PR in favor of doing everything all-at-once in #13. This is because dashboard changes are tightly bound to metric changes.

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 5fad9e3 to 5eceb19 Compare December 12, 2025 15:59

felixk101 changed the title ~~Feature/paal 233 publish gauge per sample and metric~~ feat: PAAL-233 refactor gauges to differentiate metrics and user_input Dec 12, 2025

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 45f1fc1 to e9ad6c0 Compare December 12, 2025 16:11

felixk101 mentioned this pull request Dec 15, 2025

feat: PAAL-245 add Grafana Dashboards for testkube and custom metrics #13

Merged

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch 2 times, most recently from 7db9c80 to a5bed35 Compare December 15, 2025 15:42

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from a5bed35 to aa44bff Compare December 15, 2025 15:54

felixk101 marked this pull request as ready for review December 15, 2025 15:55

felixk101 requested review from fmallmann and g3force December 15, 2025 15:55

refactor: use built-in testkube expression to get name of workflow

88ce899

(see also: https://docs.testkube.io/articles/test-workflows-expressions)

fmallmann reviewed Dec 16, 2025

View reviewed changes

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 2d794d0 to 8eeab05 Compare December 16, 2025 15:38

felixk101 requested a review from fmallmann December 16, 2025 15:39

felixk101 added 6 commits December 17, 2025 13:23

chore: linting

65514fe

fix(ci): configure initialDelaySeconds on deploy/local/lgtm.yaml

0b05a40

fix(ci): clean up stuck helm releases before retrying

cd790e7

fix(ci): increase Tilt timeout to 10 minutes

a63d983

feat: PAAL-233 refactor gauges to differentiate metrics and execution id

2639310

fix: port-forward lgtm:4318 in Tiltfile to test local otel interaction

3ebbe26

felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 8eeab05 to 3ebbe26 Compare December 17, 2025 12:28

felixk101 closed this Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PAAL-233 refactor gauges to differentiate metrics and user_input#12

feat: PAAL-233 refactor gauges to differentiate metrics and user_input#12
felixk101 wants to merge 7 commits intomainfrom
feature/PAAL-233-publish-gauge-per-sample-and-metric

felixk101 commented Dec 12, 2025 •

edited

Loading

Uh oh!

felixk101 commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

felixk101 commented Dec 15, 2025 •

edited

Loading

Uh oh!

fmallmann Dec 16, 2025

Uh oh!

felixk101 Dec 16, 2025

Uh oh!

felixk101 Dec 17, 2025

Uh oh!

Uh oh!

felixk101 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

felixk101 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixk101 commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Overview

Critical Issues

1. High Cardinality Risk in OpenTelemetry Metrics 🔴

2. Incomplete OTLP Endpoint Migration

Code Quality Issues

3. Missing Error Handling for Invalid Trace IDs

4. Silent Skipping of Invalid Metrics

5. Inconsistent Error Handling in Tests

Minor Issues & Suggestions

6. Unused Overall Scores

7. Metric Naming Convention

8. CI Improvements Are Good ✅

9. Test Coverage Looks Comprehensive ✅

10. Code Style Issue in scripts/run.py

Security Considerations

11. Potential PII in Metrics

Performance Considerations

12. Metric Export Performance

Documentation Gaps

13. Missing Migration Notes

Summary

Uh oh!

felixk101 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmallmann Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

felixk101 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

felixk101 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felixk101 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

felixk101 commented Dec 12, 2025 •

edited

Loading

claude bot commented Dec 15, 2025 •

edited

Loading

10. Code Style Issue in `scripts/run.py`

felixk101 commented Dec 15, 2025 •

edited

Loading