Skip to content

feat: PAAL-233 refactor gauges to differentiate metrics and user_input#12

Closed
felixk101 wants to merge 7 commits intomainfrom
feature/PAAL-233-publish-gauge-per-sample-and-metric
Closed

feat: PAAL-233 refactor gauges to differentiate metrics and user_input#12
felixk101 wants to merge 7 commits intomainfrom
feature/PAAL-233-publish-gauge-per-sample-and-metric

Conversation

@felixk101
Copy link
Contributor

@felixk101 felixk101 commented Dec 12, 2025

This branch is based on @qa-jil-kamerling 's feature branch (not main). This PR should stay in draft until feature/PAAL-212-trace-ids-in-testworkflow is merged.

@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 5fad9e3 to 5eceb19 Compare December 12, 2025 15:59
@felixk101 felixk101 changed the title Feature/paal 233 publish gauge per sample and metric feat: PAAL-233 refactor gauges to differentiate metrics and user_input Dec 12, 2025
@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 45f1fc1 to e9ad6c0 Compare December 12, 2025 16:11
@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch 2 times, most recently from 7db9c80 to a5bed35 Compare December 15, 2025 15:42
@felixk101
Copy link
Contributor Author

@claude please review

@claude
Copy link

claude bot commented Dec 15, 2025

Claude finished @felixk101's task in 1m 59s —— View job


Code Review Complete

I've reviewed the PR that refactors gauge metrics to differentiate between metrics and user_input. Here's my comprehensive feedback:


Overview

What Changed:

  • Refactored scripts/publish.py to publish per-sample gauges instead of overall scores
  • Changed from multiple metric-specific gauges to a single gauge differentiated by name attribute
  • Added user_input and trace_id as metric attributes for better observability
  • Updated tests to match the new architecture
  • Fixed CI configuration and deployment issues

Architecture Change:

  • Before: One gauge per metric (e.g., ragas_evaluation_faithfulness, ragas_evaluation_answer_relevancy)
  • After: Single gauge testbench_evaluation_metric with attributes: name, workflow_name, user_input, trace_id

Critical Issues

1. High Cardinality Risk in OpenTelemetry Metrics 🔴

Location: scripts/publish.py:95-101

attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    "user_input": user_input,  # ⚠️ HIGH CARDINALITY
    "trace_id": trace_id,       # ⚠️ HIGH CARDINALITY
}
metric_gauge.set(score, attributes)

Problem: Using user_input and trace_id as metric attributes creates extremely high cardinality. Each unique combination creates a new time series in your metrics backend (LGTM/Prometheus).

Impact:

  • If you run 1000 samples with 4 metrics = 4,000 time series per workflow run
  • This can overwhelm Prometheus/Mimir, causing:
    • Memory exhaustion
    • Query performance degradation
    • Potential data loss
    • Cardinality limits being hit

Best Practices:

  • OpenTelemetry/Prometheus recommend keeping metric cardinality low (< 10 dimensions typically)
  • High-cardinality data (like individual queries/traces) should use traces or logs, not metrics
  • Metrics should aggregate data, not store individual events

Recommendation:

# Option 1: Use aggregated metrics + traces for details
# Keep metrics for aggregations only
attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    # Don't include user_input or trace_id
}

# Option 2: Use exemplars (if supported by LGTM stack)
# Exemplars let you attach trace IDs to metrics without creating new time series

Alternative: If you need per-sample visibility, consider:

  1. Using OpenTelemetry Logs with structured attributes
  2. Creating a custom trace span for each evaluation with span attributes
  3. Storing results in a database and using metrics only for aggregates

2. Incomplete OTLP Endpoint Migration

Location: deploy/local/ragas-evaluation-workflow.yaml:60

otlpEndpoint:
  default: "http://otel-collector.monitoring.svc.cluster.local:4318"

Problem: You changed the default OTLP endpoint from http://lgtm.monitoring:4318 to http://otel-collector.monitoring.svc.cluster.local:4318, but:

  1. No OTel collector exists in this PR - I don't see an otel-collector deployment in deploy/local/
  2. Inconsistent with Tiltfile - Tiltfile:42 still forwards port from lgtm service
  3. May break local dev - The new endpoint won't work if no collector is deployed

Questions:

  • Is there supposed to be an OTel collector deployment in this PR?
  • Or should this change be reverted to use lgtm.monitoring:4318 directly?
  • Was this intended for the base branch (PAAL-212)?

Recommendation: Either:

  • Add the otel-collector deployment manifests
  • Or revert this change and keep using lgtm.monitoring:4318

Code Quality Issues

3. Missing Error Handling for Invalid Trace IDs

Location: scripts/publish.py:94

trace_id = result.get("trace_id", "unknown")

Issue: Using "unknown" as a fallback adds noise to metrics. If trace IDs are missing, you should:

  1. Log a warning
  2. Skip the sample or use a standardized placeholder
  3. Consider making trace_id optional if it's from the base branch

Recommendation:

trace_id = result.get("trace_id")
if not trace_id:
    logger.warning(f"Missing trace_id for sample: {user_input[:50]}...")
    trace_id = "missing-trace-id"

4. Silent Skipping of Invalid Metrics

Location: scripts/publish.py:91-92

if not _is_metric_value(score):
    continue

Issue: Invalid metrics are silently skipped. This could hide data quality issues.

Recommendation:

if not _is_metric_value(score):
    logger.debug(f"Skipping invalid metric value for {metric_name} in sample {user_input[:50]}: {score}")
    continue

5. Inconsistent Error Handling in Tests

Location: tests/test_publish.py:415-420

The mock provider raises an exception but doesn't use _OtelMockMeter() like other test fixtures. This inconsistency could hide bugs if the real provider's get_meter is called during initialization.

Recommendation: Make mock patterns consistent across all tests.


Minor Issues & Suggestions

6. Unused Overall Scores

You removed overall_scores from being published, but they're still in the JSON file. Consider documenting why they're not published anymore, or remove them from evaluate.py if they're truly unused.


7. Metric Naming Convention

Location: scripts/publish.py:82

name="testbench_evaluation_metric",

Observation: You changed from ragas_evaluation_* to testbench_evaluation_*. This is a breaking change for any existing dashboards/alerts.

Recommendations:

  • Document this breaking change in the PR description
  • Provide migration notes for dashboard updates
  • Consider if backward compatibility is needed

8. CI Improvements Are Good

The CI fixes in .github/workflows/ci.yml are excellent:

  • Increased timeout to 10 minutes (line 34 in Tiltfile)
  • Added retry logic with cleanup (lines 118-132)
  • Clean up stuck Helm releases (line 128)

These should help with the flaky Tilt setup mentioned in the code.


9. Test Coverage Looks Comprehensive

The test updates are thorough:

  • Tests for _is_metric_value() type guard
  • Per-sample gauge value tests
  • Tests verify all three gauges are created
  • Proper mocking of OpenTelemetry components

Good job maintaining test quality!


10. Code Style Issue in scripts/run.py

Location: scripts/run.py:19

from otel_setup import setup_otel

This import was moved (likely by a linter), but the change is purely cosmetic. However, imports should follow PEP 8 order:

  1. Standard library
  2. Third-party
  3. Local imports

The new position is correct. ✅


Security Considerations

11. Potential PII in Metrics

Location: scripts/publish.py:93-98

If user_input contains personally identifiable information (PII), you're now publishing PII to your metrics backend. This could violate:

  • GDPR
  • CCPA
  • Internal data policies

Recommendations:

  1. Hash or truncate user_input before using it as an attribute
  2. Add documentation about PII handling
  3. Consider using only the first N characters or a hash
  4. Review your organization's data retention policies

Example:

import hashlib

user_input = result.get("user_input", "unknown")
# Use hash instead of raw input
user_input_hash = hashlib.sha256(user_input.encode()).hexdigest()[:16]
attributes = {
    "name": metric_name,
    "workflow_name": workflow_name,
    "user_input_hash": user_input_hash,  # Use hash instead
    "trace_id": trace_id,
}

Performance Considerations

12. Metric Export Performance

Location: scripts/publish.py:88-102

You're calling gauge.set() in nested loops. For a dataset with:

  • 100 samples
  • 4 metrics

That's 400 gauge.set() calls. While not inherently bad, be aware this could slow down for large datasets (1000+ samples).

Recommendation: This is acceptable for now, but consider batching if you hit performance issues.


Documentation Gaps

13. Missing Migration Notes

This PR introduces breaking changes but doesn't update:

  • CLAUDE.md - Should document new metric structure
  • README (if exists) - Should document metric naming changes
  • Grafana dashboard updates needed

Summary

Critical Issues to Address:

  1. 🔴 High cardinality risk - Using user_input and trace_id as metric attributes is an anti-pattern
  2. 🟡 OTLP endpoint change - May break if otel-collector doesn't exist

Recommendations:

  1. Remove user_input and trace_id from metric attributes, use traces/logs instead
  2. Clarify the OTLP endpoint situation
  3. Add PII handling if user_input contains sensitive data
  4. Document the breaking changes in metric names

What's Good:

  • Comprehensive test coverage
  • CI improvements are solid
  • Code structure is clean
  • Type safety with TypeGuard

Overall Assessment: The refactoring is well-implemented from a code perspective, but the high cardinality concern is critical and should be addressed before merging. Consider whether per-sample metrics are the right approach, or if traces/logs would be more appropriate for this use case.


@felixk101
Copy link
Contributor Author

felixk101 commented Dec 15, 2025

🔴 High cardinality risk - Using user_input and trace_id as metric attributes is an anti-pattern

This is explicitly desired. The issue will be mitigated by a limit to how many times our user_inputs change over time.

🟡 OTLP endpoint change - May break if otel-collector doesn't exist

This was modified in ba92327 (not part of my changes).

Code Quality Issues

✅ addressed

@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from a5bed35 to aa44bff Compare December 15, 2025 15:54
@felixk101 felixk101 marked this pull request as ready for review December 15, 2025 15:55
k8s_resource('ai-gateway-litellm', port_forwards=['11001:4000'])
k8s_resource('weather-agent', port_forwards='11010:8000', labels=['agents'], resource_deps=['agent-runtime'])
k8s_resource('lgtm', port_forwards=['11000:3000'])
k8s_resource('lgtm', port_forwards=['11000:3000', '4318:4318'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this port needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have dropped the commit from Jil's old branch. Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, this is needed - we want to be able to test run.py locally (See the README). That only works if our locally running python script can send spans to the otel-collector running in our Tiltfile.

@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 2d794d0 to 8eeab05 Compare December 16, 2025 15:38
@felixk101 felixk101 requested a review from fmallmann December 16, 2025 15:39
@felixk101 felixk101 force-pushed the feature/PAAL-233-publish-gauge-per-sample-and-metric branch from 8eeab05 to 3ebbe26 Compare December 17, 2025 12:28
@felixk101
Copy link
Contributor Author

I am closing this PR in favor of doing everything all-at-once in #13. This is because dashboard changes are tightly bound to metric changes.

@felixk101 felixk101 closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants