Skip to content

fix(dsm): move dsm plugin init to start from bindStart#7395

Merged
robcarlan-datadog merged 23 commits intomasterfrom
rob.carlan/dsm-context-prop-race-conditions
Feb 23, 2026
Merged

fix(dsm): move dsm plugin init to start from bindStart#7395
robcarlan-datadog merged 23 commits intomasterfrom
rob.carlan/dsm-context-prop-race-conditions

Conversation

@robcarlan-datadog
Copy link
Copy Markdown
Contributor

@robcarlan-datadog robcarlan-datadog commented Jan 30, 2026

Summary

  • DSM context is correctly bound when set in start, which runs inside the async context that bindStart establishes. Setting it in bindStart itself caused race conditions where the DSM context could leak across concurrent operations. This can cause DSM to read context from incorrect operations, leading to incorrect pathways and latencies.
  • Move DSM checkpoint logic from bindStart to start in messaging plugins (kafkajs, amqplib, bullmq), which also fixes Confluent Kafka because that extends kafkajs.
  • I moved Rhea and Google Cloud Pubsub out of this because the tests are failing. I will fix those implementations separately.

Test plan

  • Regression tests (see failing PR before changes here)
  • [ x] Tested locally

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 30, 2026

Overall package size

Self size: 4.69 MB
Deduped: 5.53 MB
No deduping: 5.53 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 2.0.6 | 81.92 kB | 816.75 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Jan 30, 2026

Benchmarks

Benchmark execution time: 2026-02-19 20:16:27

Comparing candidate commit 2720a11 in PR branch rob.carlan/dsm-context-prop-race-conditions with baseline commit 631fb6a in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 226 metrics, 34 unstable metrics.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.30%. Comparing base (631fb6a) to head (2720a11).
⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
packages/datadog-plugin-kafkajs/src/consumer.js 9.09% 10 Missing ⚠️
packages/datadog-plugin-kafkajs/src/producer.js 28.57% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7395      +/-   ##
==========================================
- Coverage   80.32%   80.30%   -0.02%     
==========================================
  Files         733      733              
  Lines       31546    31560      +14     
==========================================
+ Hits        25338    25343       +5     
- Misses       6208     6217       +9     
Flag Coverage Δ
aiguard-macos 38.93% <ø> (-0.11%) ⬇️
aiguard-ubuntu 39.05% <ø> (-0.11%) ⬇️
aiguard-windows 38.79% <ø> (-0.11%) ⬇️
apm-capabilities-tracing-macos 48.60% <0.00%> (-0.03%) ⬇️
apm-capabilities-tracing-ubuntu 48.63% <0.00%> (-0.03%) ⬇️
apm-capabilities-tracing-windows 48.33% <0.00%> (-0.02%) ⬇️
apm-integrations-child-process 38.50% <ø> (-0.10%) ⬇️
apm-integrations-couchbase-18 37.42% <ø> (-0.10%) ⬇️
apm-integrations-couchbase-eol 37.75% <ø> (-0.25%) ⬇️
apm-integrations-oracledb 37.73% <ø> (-0.10%) ⬇️
appsec-express 55.53% <ø> (-0.08%) ⬇️
appsec-fastify 51.84% <ø> (-0.07%) ⬇️
appsec-graphql 52.03% <ø> (-0.07%) ⬇️
appsec-kafka 44.45% <20.00%> (-0.18%) ⬇️
appsec-ldapjs 44.09% <ø> (-0.09%) ⬇️
appsec-lodash 43.77% <ø> (-0.08%) ⬇️
appsec-macos 58.61% <ø> (-0.07%) ⬇️
appsec-mongodb-core 48.84% <ø> (-0.19%) ⬇️
appsec-mongoose 49.64% <ø> (-0.08%) ⬇️
appsec-mysql 51.01% <ø> (-0.07%) ⬇️
appsec-node-serialize 43.29% <ø> (-0.09%) ⬇️
appsec-passport 47.78% <ø> (-0.09%) ⬇️
appsec-postgres 50.77% <ø> (-0.08%) ⬇️
appsec-sourcing 42.64% <ø> (-0.09%) ⬇️
appsec-template 43.46% <ø> (-0.09%) ⬇️
appsec-ubuntu 58.68% <ø> (-0.07%) ⬇️
appsec-windows 58.46% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-bluebird 32.20% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-body-parser 40.51% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-child_process 37.82% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-cookie-parser 34.24% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-express 34.58% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-express-mongo-sanitize 34.37% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-express-session 40.13% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-fs 31.80% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-generic-pool 29.76% <ø> (ø)
instrumentations-instrumentation-http 39.85% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-knex 32.20% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-mongoose 33.37% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-multer 40.25% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-mysql2 38.29% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-passport 44.09% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-passport-http 43.76% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-passport-local 44.30% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-pg 37.71% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-promise 32.13% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-promise-js 32.13% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-q 32.18% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-url 32.10% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-when 32.15% <ø> (-0.11%) ⬇️
llmobs-ai 41.33% <ø> (-0.24%) ⬇️
llmobs-anthropic 40.32% <ø> (-0.10%) ⬇️
llmobs-bedrock 39.25% <ø> (-0.09%) ⬇️
llmobs-google-genai 39.84% <ø> (-0.09%) ⬇️
llmobs-langchain 39.43% <ø> (-0.08%) ⬇️
llmobs-openai 44.14% <ø> (-0.09%) ⬇️
llmobs-vertex-ai 40.11% <ø> (-0.04%) ⬇️
platform-core 29.71% <ø> (ø)
platform-esbuild 32.89% <ø> (ø)
platform-instrumentations-misc 40.53% <ø> (ø)
platform-shimmer 36.14% <ø> (ø)
platform-unit-guardrails 31.27% <ø> (ø)
plugins-azure-event-hubs 24.02% <ø> (ø)
plugins-azure-service-bus 23.42% <ø> (ø)
plugins-bullmq 43.61% <100.00%> (-0.16%) ⬇️
plugins-cassandra 37.77% <ø> (-0.10%) ⬇️
plugins-cookie 25.08% <ø> (ø)
plugins-cookie-parser 24.87% <ø> (ø)
plugins-crypto 24.72% <ø> (ø)
plugins-dd-trace-api 38.36% <ø> (-0.11%) ⬇️
plugins-express-mongo-sanitize 25.04% <ø> (ø)
plugins-express-session 24.83% <ø> (ø)
plugins-fastify 42.27% <ø> (-0.10%) ⬇️
plugins-fetch 38.32% <ø> (-0.10%) ⬇️
plugins-fs 38.61% <ø> (-0.11%) ⬇️
plugins-generic-pool 24.06% <ø> (ø)
plugins-google-cloud-pubsub 45.46% <ø> (-0.09%) ⬇️
plugins-grpc 40.97% <ø> (-0.10%) ⬇️
plugins-handlebars 25.08% <ø> (ø)
plugins-hapi 40.14% <ø> (-0.10%) ⬇️
plugins-hono 40.41% <ø> (-0.10%) ⬇️
plugins-ioredis 38.41% <ø> (-0.10%) ⬇️
plugins-knex 24.80% <ø> (ø)
plugins-ldapjs 22.61% <ø> (ø)
plugins-light-my-request 24.48% <ø> (ø)
plugins-limitd-client 32.50% <ø> (-0.11%) ⬇️
plugins-lodash 24.13% <ø> (ø)
plugins-mariadb 39.49% <ø> (-0.10%) ⬇️
plugins-memcached 38.15% <ø> (-0.11%) ⬇️
plugins-microgateway-core 39.17% <ø> (-0.10%) ⬇️
plugins-moleculer 40.53% <ø> (-0.10%) ⬇️
plugins-mongodb 39.20% <ø> (-0.17%) ⬇️
plugins-mongodb-core 39.03% <ø> (-0.10%) ⬇️
plugins-mongoose 38.85% <ø> (-0.10%) ⬇️
plugins-multer 24.83% <ø> (ø)
plugins-mysql 39.17% <ø> (-0.10%) ⬇️
plugins-mysql2 39.27% <ø> (-0.10%) ⬇️
plugins-node-serialize 25.12% <ø> (ø)
plugins-opensearch 37.60% <ø> (-0.10%) ⬇️
plugins-passport-http 24.91% <ø> (ø)
plugins-postgres 35.69% <ø> (-0.09%) ⬇️
plugins-process 24.72% <ø> (ø)
plugins-pug 25.08% <ø> (ø)
plugins-redis 38.89% <ø> (-0.10%) ⬇️
plugins-router 43.03% <ø> (-0.10%) ⬇️
plugins-sequelize 23.66% <ø> (ø)
plugins-test-and-upstream-amqp10 38.48% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-amqplib 43.90% <100.00%> (-0.06%) ⬇️
plugins-test-and-upstream-apollo 39.03% <ø> (-0.09%) ⬇️
plugins-test-and-upstream-avsc 38.70% <ø> (-0.11%) ⬇️
plugins-test-and-upstream-bunyan 33.79% <ø> (-0.11%) ⬇️
plugins-test-and-upstream-connect 40.81% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-graphql 40.15% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-koa 40.39% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-protobufjs 38.93% <ø> (-0.11%) ⬇️
plugins-test-and-upstream-rhea 44.10% <ø> (-0.13%) ⬇️
plugins-undici 39.11% <ø> (-0.09%) ⬇️
plugins-url 24.72% <ø> (ø)
plugins-valkey 38.07% <ø> (-0.10%) ⬇️
plugins-vm 24.72% <ø> (ø)
plugins-winston 33.99% <ø> (-0.10%) ⬇️
plugins-ws 41.91% <ø> (-0.10%) ⬇️
profiling-macos 39.84% <ø> (-0.10%) ⬇️
profiling-ubuntu 39.97% <ø> (-0.10%) ⬇️
profiling-windows 41.20% <ø> (-0.10%) ⬇️
serverless-azure-functions-client 23.75% <ø> (ø)
serverless-azure-functions-eventhubs 23.75% <ø> (ø)
serverless-azure-functions-servicebus 23.75% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@watson
Copy link
Copy Markdown
Collaborator

watson commented Feb 1, 2026

FYI: Due to changes in master you need to rebase and run the linter. There's changes in this PR that will break if they land as-is, even though the tests currently pass here.

@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/dsm-context-prop-race-conditions branch from 50700ad to df9f01e Compare February 2, 2026 17:11
@robcarlan-datadog robcarlan-datadog added bug Something isn't working semver-patch labels Feb 2, 2026
@robcarlan-datadog robcarlan-datadog changed the title fix(kafkajs): sync DSM context to currentStore to prevent context leaking fix(DSM): sync DSM context to currentStore to prevent context leaks Feb 2, 2026
@robcarlan-datadog robcarlan-datadog changed the title fix(DSM): sync DSM context to currentStore to prevent context leaks fix(dsm): sync context to currentStore to prevent leaking between concurrent handlers Feb 2, 2026
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/dsm-context-prop-race-conditions branch from 08689a7 to 20b2e55 Compare February 2, 2026 17:48
@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented Feb 2, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2720a11 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@tlhunter
Copy link
Copy Markdown
Member

tlhunter commented Feb 2, 2026

Maybe we would be better off having some sort of bindStart static helper in the producer / consumer base classes? Something to signal that it should always happen at the end of bindStart? That would at least reduce the number of requires and hopefully ensure that the author of the next messaging plugin remembers to call the method.

Copy link
Copy Markdown
Member

@rochdev rochdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a helper should not be needed if we use Diagnostics Channel properly. There should either be a completely separate store or the syncing should happen implicitly.

@BridgeAR BridgeAR marked this pull request as draft February 10, 2026 11:28
@BridgeAR
Copy link
Copy Markdown
Member

I marked it as draft while we do not have a conclusion. @rochdev @robcarlan-datadog what about having a call to discuss the implementation together? :)

@robcarlan-datadog
Copy link
Copy Markdown
Contributor Author

@BridgeAR thank you!
We had a call last week, I have a path forwards and I just came back from PTO :)

robcarlan-datadog and others added 4 commits February 17, 2026 10:23
…king

When processing concurrent Kafka messages, the DSM context was being set
via enterWith() on AsyncLocalStorage but not synced to ctx.currentStore.
Since ctx.currentStore is what gets returned from bindStart and bound to
async continuations via runStores, this caused DSM context to leak between
concurrent message handlers.

The fix syncs the DSM context to ctx.currentStore after DSM operations
complete, ensuring each handler's async continuations maintain the correct
DSM context.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests that verify DSM context is properly scoped to each handler's
async continuations when using diagnostic channels with runStores.

The tests verify that:
1. ctx.currentStore has dataStreamsContext after setDataStreamsContext
2. Concurrent handlers maintain isolated DSM contexts
3. DSM context persists through multiple async boundaries

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add syncToStore helper in DSM context module that syncs DSM context
from AsyncLocalStorage to ctx.currentStore after DSM operations.

This fixes a race condition where DSM context was being set via
enterWith() but not synced to ctx.currentStore, which is what gets
bound to async continuations via store.run(). Without syncing, DSM
context would leak between concurrent message handlers.

Updated plugins:
- kafkajs (consumer, producer)
- amqplib (consumer, producer)
- bullmq (consumer, producer)
- rhea (consumer)
- google-cloud-pubsub (consumer, producer)
- aws-sdk (sqs, kinesis)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The actual fix for the context propagation race condition is moving DSM
logic from bindStart to start. The syncToStore workaround is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/dsm-context-prop-race-conditions branch from 5c06340 to beea240 Compare February 17, 2026 15:24
@robcarlan-datadog robcarlan-datadog changed the title fix(dsm): sync context to currentStore to prevent leaking between concurrent handlers fix(dsm): move dsm init to start from bindStart Feb 17, 2026
@robcarlan-datadog robcarlan-datadog changed the title fix(dsm): move dsm init to start from bindStart fix(dsm): move dsm plugin init to start from bindStart Feb 17, 2026
robcarlan-datadog and others added 3 commits February 17, 2026 10:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog marked this pull request as ready for review February 17, 2026 16:52
robcarlan-datadog and others added 6 commits February 19, 2026 12:32
…rrent consumers

When two KafkaJS consumers process messages concurrently and each
produces to a different topic, the DSM (Data Streams Monitoring) context
leaks between them. The first consumer to process loses its DSM context
entirely (null parent), while the second consumer picks up the first's
context instead of its own.

This test forces the interleaving by using promise gates to ensure both
eachMessage handlers have fired before either produces, reliably
reproducing the bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mq, rhea, google-cloud-pubsub

Same root cause as the kafkajs test: ctx.currentStore is set by
startSpan before decodeDataStreamsContext/setCheckpoint call enterWith,
so the DSM context is never included in the bound store for async
continuations.

Verified deterministic: 10/10 failures across all plugins and versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rhea's producer DSM checkpoint happens in a separate encode hook, not
in bindStart/start, so the current fix doesn't cover it. Will be
addressed in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robcarlan-datadog and others added 2 commits February 19, 2026 14:36
Moved DSM context fix and regression test for google-cloud-pubsub
to #7582 to fix independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
if (!this.config.dsmEnabled) return
const { topic, messages, clusterId, disableHeaderInjection, currentStore: { span } } = ctx

for (const message of messages) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame we need to iterate the message list twice now.

How many messages are usually in the list? 10? 100? more?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the default is 16kB and the message count is how many messages fits into that limit?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expensive part of a loop is its content, not the loop itself, so I would say having better separation of concerns is better. In the future I'd even want DSM to have separate plugins entirely that are not loaded at all when DSM is disabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it really depends on the client configuration how many messages would be present. It at least exits early in the vast majority of cases where DSM isn't enabled

@tlhunter
Copy link
Copy Markdown
Member

@codex review

@DataDog DataDog deleted a comment from chatgpt-codex-connector Bot Feb 20, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@robcarlan-datadog robcarlan-datadog merged commit c639f33 into master Feb 23, 2026
792 checks passed
@robcarlan-datadog robcarlan-datadog deleted the rob.carlan/dsm-context-prop-race-conditions branch February 23, 2026 14:53
dd-octo-sts Bot pushed a commit that referenced this pull request Feb 24, 2026
* fix(kafkajs): sync DSM context to currentStore to prevent context leaking

When processing concurrent Kafka messages, the DSM context was being set
via enterWith() on AsyncLocalStorage but not synced to ctx.currentStore.
Since ctx.currentStore is what gets returned from bindStart and bound to
async continuations via runStores, this caused DSM context to leak between
concurrent message handlers.

The fix syncs the DSM context to ctx.currentStore after DSM operations
complete, ensuring each handler's async continuations maintain the correct
DSM context.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(dsm): add regression test for context propagation race condition

Add tests that verify DSM context is properly scoped to each handler's
async continuations when using diagnostic channels with runStores.

The tests verify that:
1. ctx.currentStore has dataStreamsContext after setDataStreamsContext
2. Concurrent handlers maintain isolated DSM contexts
3. DSM context persists through multiple async boundaries

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): sync DSM context to currentStore to prevent context leaking

Add syncToStore helper in DSM context module that syncs DSM context
from AsyncLocalStorage to ctx.currentStore after DSM operations.

This fixes a race condition where DSM context was being set via
enterWith() but not synced to ctx.currentStore, which is what gets
bound to async continuations via store.run(). Without syncing, DSM
context would leak between concurrent message handlers.

Updated plugins:
- kafkajs (consumer, producer)
- amqplib (consumer, producer)
- bullmq (consumer, producer)
- rhea (consumer)
- google-cloud-pubsub (consumer, producer)
- aws-sdk (sqs, kinesis)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* remove comment

* chore: remove contrived regression test

The unit test was useful for validating the fix during development
but is contrived and doesn't add value as a permanent test.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(dsm): add unit tests for syncToStore and integration spy tests

Add comprehensive tests for the new syncToStore helper:
- Unit tests for syncToStore in context.spec.js covering normal
  operation, edge cases, and integration with setDataStreamsContext
- Spy tests in 6 integration test files to verify syncToStore is
  called after produce and consume operations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): call syncToStore via module object for testability

Change all plugins to call DataStreamsContext.syncToStore(ctx)
instead of destructuring syncToStore at import time. This allows
sinon spies to intercept calls during testing.

When functions are destructured at require-time, they bind to the
original function reference. Spies set up later on the module
object don't affect these bindings. Calling via the module object
ensures spies work correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): remove invalid producer-side syncToStore tests for AWS SDK

The syncToStore fix only applies to the consumer path where bindStart
is used. AWS SDK plugins (SQS, Kinesis) use requestInject for producers,
which doesn't need context synchronization since:

1. requestInject is called before the request is sent
2. DSM context is encoded directly into the message
3. There's no async continuation where context leaking would occur

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: remove unnecessary comments from test files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): move DSM logic from bindStart to start for correct context binding

By moving DSM checkpoint/encoding logic into start() (which runs after
the child context is bound), the DSM context naturally lands in the
correct async store — eliminating the need for the syncToStore workaround
in kafkajs, amqplib, bullmq, and rhea consumer plugins.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(dsm): remove syncToStore functionality

The actual fix for the context propagation race condition is moving DSM
logic from bindStart to start. The syncToStore workaround is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(dsm): move DSM logic from bindStart to start for google-cloud-pubsub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove extra blank lines in google-cloud-pubsub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: fix comma placement in amqplib producer setCheckpoint call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(kafkajs): add regression test for DSM context leak between concurrent consumers

When two KafkaJS consumers process messages concurrently and each
produces to a different topic, the DSM (Data Streams Monitoring) context
leaks between them. The first consumer to process loses its DSM context
entirely (null parent), while the second consumer picks up the first's
context instead of its own.

This test forces the interleaving by using promise gates to ensure both
eachMessage handlers have fired before either produces, reliably
reproducing the bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(kafkajs): remove redundant comments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): add regression tests for DSM context leak in amqplib, bullmq, rhea, google-cloud-pubsub

Same root cause as the kafkajs test: ctx.currentStore is set by
startSpan before decodeDataStreamsContext/setCheckpoint call enterWith,
so the DSM context is never included in the bound store for async
continuations.

Verified deterministic: 10/10 failures across all plugins and versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): fix lint and remove redundant comments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(rhea): use const for senderAOut

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(rhea): remove rhea DSM regression test

Rhea's producer DSM checkpoint happens in a separate encode hook, not
in bindStart/start, so the current fix doesn't cover it. Will be
addressed in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(rhea): remove rhea consumer changes, tracked separately in #7581

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): move google-cloud-pubsub changes to separate PR

Moved DSM context fix and regression test for google-cloud-pubsub
to #7582 to fix independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts Bot mentioned this pull request Feb 24, 2026
juan-fernandez pushed a commit that referenced this pull request Mar 5, 2026
* fix(kafkajs): sync DSM context to currentStore to prevent context leaking

When processing concurrent Kafka messages, the DSM context was being set
via enterWith() on AsyncLocalStorage but not synced to ctx.currentStore.
Since ctx.currentStore is what gets returned from bindStart and bound to
async continuations via runStores, this caused DSM context to leak between
concurrent message handlers.

The fix syncs the DSM context to ctx.currentStore after DSM operations
complete, ensuring each handler's async continuations maintain the correct
DSM context.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(dsm): add regression test for context propagation race condition

Add tests that verify DSM context is properly scoped to each handler's
async continuations when using diagnostic channels with runStores.

The tests verify that:
1. ctx.currentStore has dataStreamsContext after setDataStreamsContext
2. Concurrent handlers maintain isolated DSM contexts
3. DSM context persists through multiple async boundaries

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): sync DSM context to currentStore to prevent context leaking

Add syncToStore helper in DSM context module that syncs DSM context
from AsyncLocalStorage to ctx.currentStore after DSM operations.

This fixes a race condition where DSM context was being set via
enterWith() but not synced to ctx.currentStore, which is what gets
bound to async continuations via store.run(). Without syncing, DSM
context would leak between concurrent message handlers.

Updated plugins:
- kafkajs (consumer, producer)
- amqplib (consumer, producer)
- bullmq (consumer, producer)
- rhea (consumer)
- google-cloud-pubsub (consumer, producer)
- aws-sdk (sqs, kinesis)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* remove comment

* chore: remove contrived regression test

The unit test was useful for validating the fix during development
but is contrived and doesn't add value as a permanent test.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(dsm): add unit tests for syncToStore and integration spy tests

Add comprehensive tests for the new syncToStore helper:
- Unit tests for syncToStore in context.spec.js covering normal
  operation, edge cases, and integration with setDataStreamsContext
- Spy tests in 6 integration test files to verify syncToStore is
  called after produce and consume operations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): call syncToStore via module object for testability

Change all plugins to call DataStreamsContext.syncToStore(ctx)
instead of destructuring syncToStore at import time. This allows
sinon spies to intercept calls during testing.

When functions are destructured at require-time, they bind to the
original function reference. Spies set up later on the module
object don't affect these bindings. Calling via the module object
ensures spies work correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): remove invalid producer-side syncToStore tests for AWS SDK

The syncToStore fix only applies to the consumer path where bindStart
is used. AWS SDK plugins (SQS, Kinesis) use requestInject for producers,
which doesn't need context synchronization since:

1. requestInject is called before the request is sent
2. DSM context is encoded directly into the message
3. There's no async continuation where context leaking would occur

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: remove unnecessary comments from test files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dsm): move DSM logic from bindStart to start for correct context binding

By moving DSM checkpoint/encoding logic into start() (which runs after
the child context is bound), the DSM context naturally lands in the
correct async store — eliminating the need for the syncToStore workaround
in kafkajs, amqplib, bullmq, and rhea consumer plugins.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(dsm): remove syncToStore functionality

The actual fix for the context propagation race condition is moving DSM
logic from bindStart to start. The syncToStore workaround is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(dsm): move DSM logic from bindStart to start for google-cloud-pubsub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove extra blank lines in google-cloud-pubsub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: fix comma placement in amqplib producer setCheckpoint call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(kafkajs): add regression test for DSM context leak between concurrent consumers

When two KafkaJS consumers process messages concurrently and each
produces to a different topic, the DSM (Data Streams Monitoring) context
leaks between them. The first consumer to process loses its DSM context
entirely (null parent), while the second consumer picks up the first's
context instead of its own.

This test forces the interleaving by using promise gates to ensure both
eachMessage handlers have fired before either produces, reliably
reproducing the bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(kafkajs): remove redundant comments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): add regression tests for DSM context leak in amqplib, bullmq, rhea, google-cloud-pubsub

Same root cause as the kafkajs test: ctx.currentStore is set by
startSpan before decodeDataStreamsContext/setCheckpoint call enterWith,
so the DSM context is never included in the bound store for async
continuations.

Verified deterministic: 10/10 failures across all plugins and versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): fix lint and remove redundant comments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(rhea): use const for senderAOut

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(rhea): remove rhea DSM regression test

Rhea's producer DSM checkpoint happens in a separate encode hook, not
in bindStart/start, so the current fix doesn't cover it. Will be
addressed in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(rhea): remove rhea consumer changes, tracked separately in #7581

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(dsm): move google-cloud-pubsub changes to separate PR

Moved DSM context fix and regression test for google-cloud-pubsub
to #7582 to fix independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants