Skip to content

fix(telemetry): schedule ExtendedHeartbeat on worker start#1910

Open
khanayan123 wants to merge 6 commits intomainfrom
ayan.khan/fix-extended-heartbeat-scheduling
Open

fix(telemetry): schedule ExtendedHeartbeat on worker start#1910
khanayan123 wants to merge 6 commits intomainfrom
ayan.khan/fix-extended-heartbeat-scheduling

Conversation

@khanayan123
Copy link
Copy Markdown
Contributor

@khanayan123 khanayan123 commented Apr 22, 2026

Bug

ExtendedHeartbeat is in the scheduler's delays catalog but never gets a deadline, so its handler never fires and app-extended-heartbeat is never emitted by any libdatadog-based tracer.

Scheduler::new initializes deadlines: Vec::new(). next_deadline() only reads deadlines. Events must be moved from delaysdeadlines via schedule_event(event). Lifecycle(Start) only scheduled FlushMetricAggr + FlushData; the ExtendedHeartbeat handler self-reschedules but needs a first fire to do so — chicken-and-egg.

Hidden until now because the default interval is 24h, and prior to #1824 it was hardcoded to 24h regardless of config. Surfaced by system-tests PR 6338 (TELEMETRY_EXTENDED_HEARTBEAT scenario, 2s interval)

Fix

Schedule ExtendedHeartbeat alongside the others in dispatch_action's Lifecycle(Start).

Tests

  • full_flavor_start_schedules_every_periodic_action — invariant: walks delays, asserts each is in deadlines after Start. Catches the whole class of bug, not just this variant. Verified to fail without the fix and pass with it.
  • metrics_logs_flavor_start_does_not_schedule_extended_heartbeat — negative guard locking in the MetricsLogs flavor's intentional exclusion of lifecycle events.

Both are #[cfg_attr(miri, ignore)] since they exercise reqwest.

The `ExtendedHeartbeat` lifecycle action was present in the scheduler's
`delays` catalog (populated at `build_worker`) but was never added to
the `deadlines` queue, so its handler was never invoked and
`app-extended-heartbeat` payloads were never emitted.

`Scheduler::new(delays)` always starts with an empty `deadlines` vec,
and `next_deadline()` only ever reads from `deadlines`. Events must be
explicitly scheduled via `schedule_event(event)` to actually fire.

`Lifecycle(Start)` only scheduled `FlushMetricAggr` and `FlushData`.
The `Lifecycle(ExtendedHeartbeat)` handler self-reschedules after its
first fire — which meant bootstrapping a chicken-and-egg that never
resolved.

Fix: schedule `ExtendedHeartbeat` alongside `FlushMetricAggr` and
`FlushData` inside `Lifecycle(Start)`.

The bug went unnoticed because:
- Default `telemetry_extended_heartbeat_interval` is 24h
- Prior to #1824 the scheduler used a hardcoded 24h anyway, so it was
  impossible to shorten the interval in tests
- No existing unit / integration test waited long enough (or used a
  short enough interval) to observe the first extended heartbeat

Surfaced by system-tests PR 6338, which adds a
`TELEMETRY_EXTENDED_HEARTBEAT` scenario with a 2s interval. All
libdatadog-based tracers (PHP, Go, .NET, Java Spring-Boot-3-native,
etc.) fail that scenario with `app-extended-heartbeat event not found`.

Added unit test `lifecycle_start_schedules_extended_heartbeat` that
verifies all three events are scheduled after processing
`Lifecycle(Start)`. Fails without the fix, passes with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

📚 Documentation Check Results

⚠️ 495 documentation warning(s) found

📦 libdd-telemetry - 495 warning(s)


Updated: 2026-04-28 17:08:10 UTC | Commit: 99657c4 | missing-docs job results

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/ayan.khan/fix-extended-heartbeat-scheduling

Summary by Rule

Rule Base Branch PR Branch Change
unwrap_used 13 14 ⚠️ +1 (+7.7%)
Total 13 14 ⚠️ +1 (+7.7%)

Annotation Counts by File

File Base Branch PR Branch Change
libdd-telemetry/src/worker/mod.rs 13 14 ⚠️ +1 (+7.7%)

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 21 21 No change (0%)
datadog-live-debugger 6 6 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-remote-config 3 3 No change (0%)
datadog-sidecar 56 56 No change (0%)
libdd-common 10 10 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 5 5 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-telemetry 19 20 ⚠️ +1 (+5.3%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 8 8 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 15 15 No change (0%)
Total 198 199 ⚠️ +1 (+0.5%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🔒 Cargo Deny Results

⚠️ 4 issue(s) found, showing only errors (advisories, bans, sources)

📦 libdd-telemetry - 4 error(s)

Show output
error[unsound]: Rand is unsound with a custom logger using `rand::rng()`
   ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:78:1
   │
78 │ rand 0.8.5 registry+https://github.com/rust-lang/crates.io-index
   │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ unsound advisory detected
   │
   ├ ID: RUSTSEC-2026-0097
   ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0097
   ├ It has been reported (by @lopopolo) that the `rand` library is [unsound](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library) (i.e. that safe code using the public API can cause Undefined Behaviour) when all the following conditions are met:
     
     - The `log` and `thread_rng` features are enabled
     - A [custom logger](https://docs.rs/log/latest/log/#implementing-a-logger) is defined
     - The custom logger accesses `rand::rng()` (previously `rand::thread_rng()`) and calls any `TryRng` (previously `RngCore`) methods on `ThreadRng`
     - The `ThreadRng` (attempts to) reseed while called from the custom logger (this happens every 64 kB of generated data)
     - Trace-level logging is enabled or warn-level logging is enabled and the random source (the `getrandom` crate) is unable to provide a new seed
     
     `TryRng` (previously `RngCore`) methods for `ThreadRng` use `unsafe` code to cast `*mut BlockRng<ReseedingCore>` to `&mut BlockRng<ReseedingCore>`. When all the above conditions are met this results in an aliased mutable reference, violating the Stacked Borrows rules. Miri is able to detect this violation in sample code. Since construction of [aliased mutable references is Undefined Behaviour](https://doc.rust-lang.org/stable/nomicon/references.html), the behaviour of optimized builds is hard to predict.
   ├ Announcement: https://github.com/rust-random/rand/pull/1763
   ├ Solution: Upgrade to >=0.10.1 OR <0.10.0, >=0.9.3 OR <0.9.0, >=0.8.6 (try `cargo update -p rand`)
   ├ rand v0.8.5
     └── libdd-common v4.0.0
         ├── libdd-shared-runtime v0.1.0
         │   └── libdd-telemetry v4.0.0
         └── libdd-telemetry v4.0.0 (*)

error[vulnerability]: Name constraints for URI names were incorrectly accepted
   ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:89:1
   │
89 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
   │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
   │
   ├ ID: RUSTSEC-2026-0098
   ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0098
   ├ Name constraints for URI names were ignored and therefore accepted.
     
     Note this library does not provide an API for asserting URI names, and URI name constraints are otherwise not implemented.  URI name constraints are now rejected unconditionally.
     
     Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
     
     This vulnerability is identified as [GHSA-965h-392x-2mh5](https://github.com/rustls/webpki/security/advisories/GHSA-965h-392x-2mh5). Thank you to @1seal for the report.
   ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
   ├ rustls-webpki v0.103.10
     └── rustls v0.23.37
         ├── hyper-rustls v0.27.7
         │   └── libdd-common v4.0.0
         │       ├── libdd-shared-runtime v0.1.0
         │       │   └── libdd-telemetry v4.0.0
         │       └── libdd-telemetry v4.0.0 (*)
         ├── libdd-common v4.0.0 (*)
         └── tokio-rustls v0.26.0
             ├── hyper-rustls v0.27.7 (*)
             └── libdd-common v4.0.0 (*)

error[vulnerability]: Name constraints were accepted for certificates asserting a wildcard name
   ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:89:1
   │
89 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
   │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
   │
   ├ ID: RUSTSEC-2026-0099
   ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0099
   ├ Permitted subtree name constraints for DNS names were accepted for certificates asserting a wildcard name.
     
     This was incorrect because, given a name constraint of `accept.example.com`, `*.example.com` could feasibly allow a name of `reject.example.com` which is outside the constraint.
     This is very similar to [CVE-2025-61727](https://go.dev/issue/76442).
     
     Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
     
     This vulnerability is identified as [GHSA-xgp8-3hg3-c2mh](https://github.com/rustls/webpki/security/advisories/GHSA-xgp8-3hg3-c2mh). Thank you to @1seal for the report.
   ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
   ├ rustls-webpki v0.103.10
     └── rustls v0.23.37
         ├── hyper-rustls v0.27.7
         │   └── libdd-common v4.0.0
         │       ├── libdd-shared-runtime v0.1.0
         │       │   └── libdd-telemetry v4.0.0
         │       └── libdd-telemetry v4.0.0 (*)
         ├── libdd-common v4.0.0 (*)
         └── tokio-rustls v0.26.0
             ├── hyper-rustls v0.27.7 (*)
             └── libdd-common v4.0.0 (*)

error[vulnerability]: Reachable panic in certificate revocation list parsing
   ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:89:1
   │
89 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
   │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
   │
   ├ ID: RUSTSEC-2026-0104
   ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0104
   ├ A panic was reachable when parsing certificate revocation lists via [`BorrowedCertRevocationList::from_der`]
     or [`OwnedCertRevocationList::from_der`].  This was the result of mishandling a syntactically valid empty
     `BIT STRING` appearing in the `onlySomeReasons` element of a `IssuingDistributionPoint` CRL extension.
     
     This panic is reachable prior to a CRL's signature being verified.
     
     Applications that do not use CRLs are not affected.
     
     Thank you to @tynus3 for the report.
   ├ Solution: Upgrade to >=0.103.13, <0.104.0-alpha.1 OR >=0.104.0-alpha.7 (try `cargo update -p rustls-webpki`)
   ├ rustls-webpki v0.103.10
     └── rustls v0.23.37
         ├── hyper-rustls v0.27.7
         │   └── libdd-common v4.0.0
         │       ├── libdd-shared-runtime v0.1.0
         │       │   └── libdd-telemetry v4.0.0
         │       └── libdd-telemetry v4.0.0 (*)
         ├── libdd-common v4.0.0 (*)
         └── tokio-rustls v0.26.0
             ├── hyper-rustls v0.27.7 (*)
             └── libdd-common v4.0.0 (*)

advisories FAILED, bans ok, sources ok

Updated: 2026-04-28 17:09:45 UTC | Commit: 99657c4 | dependency-check job results

Replace the ExtendedHeartbeat-specific assertion with an invariant
test that walks the scheduler's `delays` catalog and asserts every
entry is present in `deadlines` after `Lifecycle(Start)`.

A specific test would only catch a regression of this exact bug; an
invariant test catches the whole class — if a future periodic
`LifecycleAction` is added with a delay but nobody schedules it on
Start, the test fails with a message naming the forgotten variant.

Also add a negative guard for the `MetricsLogs` flavor to lock in
its intentional exclusion of lifecycle events like ExtendedHeartbeat,
so a future change that starts emitting lifecycle telemetry from the
metrics-only worker has to update the test explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented Apr 22, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 71.79% (-0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 64ce4b6 | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.79%. Comparing base (cff7291) to head (64ce4b6).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1910      +/-   ##
==========================================
- Coverage   71.79%   71.79%   -0.01%     
==========================================
  Files         434      434              
  Lines       69978    70050      +72     
==========================================
+ Hits        50239    50290      +51     
- Misses      19739    19760      +21     
Components Coverage Δ
libdd-crashtracker 65.94% <ø> (-0.08%) ⬇️
libdd-crashtracker-ffi 34.47% <ø> (ø)
libdd-alloc 98.77% <ø> (ø)
libdd-data-pipeline 85.86% <ø> (ø)
libdd-data-pipeline-ffi 71.94% <ø> (ø)
libdd-common 79.61% <ø> (+0.20%) ⬆️
libdd-common-ffi 73.87% <ø> (ø)
libdd-telemetry 69.35% <100.00%> (+1.25%) ⬆️
libdd-telemetry-ffi 19.37% <ø> (ø)
libdd-dogstatsd-client 82.64% <ø> (ø)
datadog-ipc 74.84% <ø> (-1.33%) ⬇️
libdd-profiling 81.61% <ø> (ø)
libdd-profiling-ffi 64.36% <ø> (ø)
datadog-sidecar 29.34% <ø> (ø)
datdog-sidecar-ffi 8.41% <ø> (ø)
spawn-worker 54.69% <ø> (ø)
libdd-tinybytes 93.16% <ø> (ø)
libdd-trace-normalization 81.71% <ø> (ø)
libdd-trace-obfuscation 87.26% <ø> (ø)
libdd-trace-protobuf 68.25% <ø> (ø)
libdd-trace-utils 89.27% <ø> (ø)
libdd-tracer-flare 86.88% <ø> (ø)
libdd-log 74.69% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 22, 2026

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 83.31 MB 83.31 MB +0% (+96 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.63 MB 7.63 MB 0% (0 B) 👌
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 99.42 MB 99.42 MB +0% (+672 B) 👌
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.10 MB 10.10 MB +0% (+24 B) 👌
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 25.19 MB 25.19 MB --.01% (-5.00 KB) 💪
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 79.90 KB 79.90 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 184.54 MB 184.48 MB --.02% (-56.00 KB) 💪
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 918.37 MB 918.30 MB -0% (-70.10 KB) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 7.89 MB 7.89 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 79.90 KB 79.90 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 23.67 MB 23.67 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 46.19 MB 46.19 MB -0% (-48 B) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 21.67 MB 21.67 MB -0% (-1.00 KB) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 81.14 KB 81.14 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 188.61 MB 188.59 MB -0% (-16.00 KB) 👌
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 904.02 MB 903.96 MB -0% (-60.83 KB) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.13 MB 6.13 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 81.14 KB 81.14 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 25.35 MB 25.35 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 43.67 MB 43.67 MB +0% (+120 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 74.27 MB 74.27 MB +0% (+416 B) 👌
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.55 MB 8.55 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 91.78 MB 91.78 MB +0% (+528 B) 👌
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.20 MB 10.20 MB 0% (0 B) 👌

khanayan123 and others added 3 commits April 27, 2026 11:20
Both tests build a real `TelemetryWorker`. `dispatch_action(Start)`
issues an HTTP `app-started` request via reqwest, and the worker's
http client itself is constructed via reqwest — neither of which
miri supports. The full-flavor invariant test was hanging miri >540s
before timing out.

Add `#[cfg_attr(miri, ignore)]` matching the pattern used by 360+
other reqwest-touching tests across the repo. Tests still run on
regular `cargo test`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khanayan123 khanayan123 marked this pull request as ready for review April 28, 2026 16:55
@khanayan123 khanayan123 requested a review from a team as a code owner April 28, 2026 16:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9efac85472

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +432 to +434
self.deadlines
.schedule_event(LifecycleAction::ExtendedHeartbeat)
.unwrap();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid scheduling extended heartbeat before normal heartbeat

Scheduling LifecycleAction::ExtendedHeartbeat on startup causes a starvation loop when telemetry_extended_heartbeat_interval < telemetry_heartbeat_interval: each extended-heartbeat execution re-schedules FlushData from “now” (schedule_events([FlushData, ExtendedHeartbeat])), and Scheduler::schedule_event_with_from replaces the existing FlushData deadline, so FlushData keeps getting pushed out and never fires. In that configuration, app heartbeats and observability payload delivery from FlushData can stop indefinitely.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 7d30638c9.

The ExtendedHeartbeat handler now only re-schedules itself; it no longer touches FlushData's deadline. FlushData already self-reschedules in its own handler, so the two timers operate independently and the starvation case you described can't occur regardless of the relative interval values.

Added a regression test (extended_heartbeat_does_not_reset_flush_data) that captures FlushData's deadline before and after dispatching ExtendedHeartbeat and asserts it's unchanged. Verified the test fails without the fix.

Previously the ExtendedHeartbeat handler called
schedule_events([FlushData, ExtendedHeartbeat]). Since
schedule_event_with_from removes-and-reinserts the deadline, this
replaced FlushData's existing deadline with `now + heartbeat_interval`.

When `extended_heartbeat_interval < heartbeat_interval`, each
ExtendedHeartbeat firing pushes FlushData out further than the next
ExtendedHeartbeat deadline, so FlushData never fires — starving
app-heartbeat and observability payload delivery.

Fix: only re-schedule self in the ExtendedHeartbeat handler. FlushData
already self-reschedules in its own handler; the two timers operate
independently.

This was latent before #1910 because ExtendedHeartbeat never fired at
all. Caught by codex review on the PR.

Added regression test `extended_heartbeat_does_not_reset_flush_data`
that captures FlushData's deadline before and after dispatching
ExtendedHeartbeat and asserts it is unchanged. Verified the test
fails without the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants