feat(job_queue): export cancel_stats and active_jobs to OpenTelemetry#13123
Draft
ogabrielluiz wants to merge 1 commit into
Draft
feat(job_queue): export cancel_stats and active_jobs to OpenTelemetry#13123ogabrielluiz wants to merge 1 commit into
ogabrielluiz wants to merge 1 commit into
Conversation
Wire the new Redis job queue counters into the existing langflow OTel
instrumentation so they flow to the Prometheus exporter when
LANGFLOW_PROMETHEUS_ENABLED=true. Until now cancel_stats lived only in
the JSON /monitor/job_queue snapshot — fine for ad-hoc inspection,
useless for ongoing ops dashboards or alerting.
Two new metrics are registered:
* langflow_job_queue_cancel_events_total (Counter, label: event_type) —
one Counter per distinct event_type covering all eleven cancel_stats
keys (published, marker_hit, dispatched_owned, dispatched_foreign,
publish_errors, dispatcher_reconnects, dispatcher_internal_errors,
polling_watchdog_kills, activity_touch_errors, activity_get_errors,
activity_parse_errors).
* langflow_job_queue_active_jobs (UpDownCounter, label: backend) —
bumped +1 in JobQueueService.create_queue and -1 in cleanup_job, so
both the in-memory ("memory") and Redis ("redis") backends report.
A single _bump_cancel_stat helper on RedisJobQueueService is now the
sole mutation point for the cancel_stats dict, guaranteeing the JSON
snapshot and Prometheus stay in lockstep.
The OTel emit path is best-effort: a cached lazy resolver fetches the
telemetry singleton once, and contextlib.suppress wraps the emit so a
broken or unavailable telemetry layer never propagates into the queue
hot path.
Tests cover dict + counter parity, active_jobs delta on create/cleanup,
silent failure when telemetry is unavailable, and that every
cancel_stats key routes through the helper.
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the Redis job queue's
cancel_statsandactive_jobsinto the existing OpenTelemetry instrumentation. WhenLANGFLOW_PROMETHEUS_ENABLED=true, the new/monitor/job_queuecounters now flow to the Prometheus exporter alongsidefile_uploads/num_files_uploaded. Until now they only existed in the JSON snapshot — usable for ad-hoc debugging, useless for ongoing ops dashboards or alerting.What's new
Two metrics registered in
services/telemetry/opentelemetry.py:langflow_job_queue_cancel_events_total— Counter, labelevent_type. One Counter covering all 11 cancel_stats keys (published,marker_hit,dispatched_owned,dispatched_foreign,publish_errors,dispatcher_reconnects,dispatcher_internal_errors,polling_watchdog_kills,activity_touch_errors,activity_get_errors,activity_parse_errors).langflow_job_queue_active_jobs— UpDownCounter, labelbackend(memoryorredis). +1 inJobQueueService.create_queue, -1 incleanup_job.A new
_bump_cancel_stathelper onRedisJobQueueServiceis the sole mutation point for the dict, so the JSON snapshot and Prometheus stay in lockstep — no risk of forgetting to bump one or the other.Reliability
The OTel emit path is best-effort:
contextlib.suppress(Exception)wraps the emit so a broken or unavailable telemetry layer never propagates into the queue hot path.Stacks on #13084
Depends on the
cancel_statsintroduced there. Until #13084 merges, this PR's diff will show those commits too — only89379bb46eis unique to this branch.