fix(telemetry): schedule ExtendedHeartbeat on worker start#1910
fix(telemetry): schedule ExtendedHeartbeat on worker start#1910khanayan123 wants to merge 6 commits intomainfrom
Conversation
The `ExtendedHeartbeat` lifecycle action was present in the scheduler's `delays` catalog (populated at `build_worker`) but was never added to the `deadlines` queue, so its handler was never invoked and `app-extended-heartbeat` payloads were never emitted. `Scheduler::new(delays)` always starts with an empty `deadlines` vec, and `next_deadline()` only ever reads from `deadlines`. Events must be explicitly scheduled via `schedule_event(event)` to actually fire. `Lifecycle(Start)` only scheduled `FlushMetricAggr` and `FlushData`. The `Lifecycle(ExtendedHeartbeat)` handler self-reschedules after its first fire — which meant bootstrapping a chicken-and-egg that never resolved. Fix: schedule `ExtendedHeartbeat` alongside `FlushMetricAggr` and `FlushData` inside `Lifecycle(Start)`. The bug went unnoticed because: - Default `telemetry_extended_heartbeat_interval` is 24h - Prior to #1824 the scheduler used a hardcoded 24h anyway, so it was impossible to shorten the interval in tests - No existing unit / integration test waited long enough (or used a short enough interval) to observe the first extended heartbeat Surfaced by system-tests PR 6338, which adds a `TELEMETRY_EXTENDED_HEARTBEAT` scenario with a 2s interval. All libdatadog-based tracers (PHP, Go, .NET, Java Spring-Boot-3-native, etc.) fail that scenario with `app-extended-heartbeat event not found`. Added unit test `lifecycle_start_schedules_extended_heartbeat` that verifies all three events are scheduled after processing `Lifecycle(Start)`. Fails without the fix, passes with it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
Replace the ExtendedHeartbeat-specific assertion with an invariant test that walks the scheduler's `delays` catalog and asserts every entry is present in `deadlines` after `Lifecycle(Start)`. A specific test would only catch a regression of this exact bug; an invariant test catches the whole class — if a future periodic `LifecycleAction` is added with a delay but nobody schedules it on Start, the test fails with a message naming the forgotten variant. Also add a negative guard for the `MetricsLogs` flavor to lock in its intentional exclusion of lifecycle events like ExtendedHeartbeat, so a future change that starts emitting lifecycle telemetry from the metrics-only worker has to update the test explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 64ce4b6 | Docs | Datadog PR Page | Give us feedback! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1910 +/- ##
==========================================
- Coverage 71.79% 71.79% -0.01%
==========================================
Files 434 434
Lines 69978 70050 +72
==========================================
+ Hits 50239 50290 +51
- Misses 19739 19760 +21
🚀 New features to boost your workflow:
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
Both tests build a real `TelemetryWorker`. `dispatch_action(Start)` issues an HTTP `app-started` request via reqwest, and the worker's http client itself is constructed via reqwest — neither of which miri supports. The full-flavor invariant test was hanging miri >540s before timing out. Add `#[cfg_attr(miri, ignore)]` matching the pattern used by 360+ other reqwest-touching tests across the repo. Tests still run on regular `cargo test`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9efac85472
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.deadlines | ||
| .schedule_event(LifecycleAction::ExtendedHeartbeat) | ||
| .unwrap(); |
There was a problem hiding this comment.
Avoid scheduling extended heartbeat before normal heartbeat
Scheduling LifecycleAction::ExtendedHeartbeat on startup causes a starvation loop when telemetry_extended_heartbeat_interval < telemetry_heartbeat_interval: each extended-heartbeat execution re-schedules FlushData from “now” (schedule_events([FlushData, ExtendedHeartbeat])), and Scheduler::schedule_event_with_from replaces the existing FlushData deadline, so FlushData keeps getting pushed out and never fires. In that configuration, app heartbeats and observability payload delivery from FlushData can stop indefinitely.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — addressed in 7d30638c9.
The ExtendedHeartbeat handler now only re-schedules itself; it no longer touches FlushData's deadline. FlushData already self-reschedules in its own handler, so the two timers operate independently and the starvation case you described can't occur regardless of the relative interval values.
Added a regression test (extended_heartbeat_does_not_reset_flush_data) that captures FlushData's deadline before and after dispatching ExtendedHeartbeat and asserts it's unchanged. Verified the test fails without the fix.
Previously the ExtendedHeartbeat handler called schedule_events([FlushData, ExtendedHeartbeat]). Since schedule_event_with_from removes-and-reinserts the deadline, this replaced FlushData's existing deadline with `now + heartbeat_interval`. When `extended_heartbeat_interval < heartbeat_interval`, each ExtendedHeartbeat firing pushes FlushData out further than the next ExtendedHeartbeat deadline, so FlushData never fires — starving app-heartbeat and observability payload delivery. Fix: only re-schedule self in the ExtendedHeartbeat handler. FlushData already self-reschedules in its own handler; the two timers operate independently. This was latent before #1910 because ExtendedHeartbeat never fired at all. Caught by codex review on the PR. Added regression test `extended_heartbeat_does_not_reset_flush_data` that captures FlushData's deadline before and after dispatching ExtendedHeartbeat and asserts it is unchanged. Verified the test fails without the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug
ExtendedHeartbeatis in the scheduler'sdelayscatalog but never gets a deadline, so its handler never fires andapp-extended-heartbeatis never emitted by any libdatadog-based tracer.Scheduler::newinitializesdeadlines: Vec::new().next_deadline()only readsdeadlines. Events must be moved fromdelays→deadlinesviaschedule_event(event).Lifecycle(Start)only scheduledFlushMetricAggr+FlushData; theExtendedHeartbeathandler self-reschedules but needs a first fire to do so — chicken-and-egg.Hidden until now because the default interval is 24h, and prior to #1824 it was hardcoded to 24h regardless of config. Surfaced by system-tests PR 6338 (
TELEMETRY_EXTENDED_HEARTBEATscenario, 2s interval)Fix
Schedule
ExtendedHeartbeatalongside the others indispatch_action'sLifecycle(Start).Tests
full_flavor_start_schedules_every_periodic_action— invariant: walksdelays, asserts each is indeadlinesafter Start. Catches the whole class of bug, not just this variant. Verified to fail without the fix and pass with it.metrics_logs_flavor_start_does_not_schedule_extended_heartbeat— negative guard locking in theMetricsLogsflavor's intentional exclusion of lifecycle events.Both are
#[cfg_attr(miri, ignore)]since they exercise reqwest.