Skip to content

[system_tests] Fix race conditions in set_and_wait_rc by using unique config IDs#6371

Merged
vlad-scherbich merged 9 commits intomainfrom
vlad/fix-dynamic-config-flakes
Feb 26, 2026
Merged

[system_tests] Fix race conditions in set_and_wait_rc by using unique config IDs#6371
vlad-scherbich merged 9 commits intomainfrom
vlad/fix-dynamic-config-flakes

Conversation

@vlad-scherbich
Copy link
Copy Markdown
Contributor

@vlad-scherbich vlad-scherbich commented Feb 23, 2026

https://datadoghq.atlassian.net/browse/PROF-13796

Motivation

~25 of 32 system test failures on dd-trace-py main in 2026 stem from dynamic configuration tests. Root cause: a race in set_and_wait_rc—it can match stale RC ACKs from a previous config update and return before the new config is actually applied.

Tests fixed

All tests in test_dynamic_configuration.py that use set_and_wait_rc. Top flaky tests by frequency (dd-trace-py main, 2026):

Rank Hits Test
1 6 test_remote_sampling_rules_retention
2 6 test_trace_sampling_rate_override_default
3 5 test_capability_tracing_sample_rules
4 3 test_trace_sampling_rules_override_env

Others: test_apply_state, test_trace_sampling_rate_override_env, test_trace_sampling_rate_with_sampling_rules, test_log_injection_enabled, test_tracing_client_tracing_tags, test_trace_sampling_rules_override_rate, test_trace_sampling_rules_with_tags.

Changes

  • _set_rc: Use uuid.uuid4() for config_id when not passed—avoids repeating IDs for identical payloads (hash would recreate the stale-ACK race). Return the config_id used.
  • set_and_wait_rc: When config_id is passed (reuse case), clear the agent before set_rc to discard buffered RC requests so we only see responses from our update. Use config_id filtering in wait_for_rc_apply_state so we only match ACKs for the config we just set.
  • wait_for_rc_apply_state (_test_agent.py): Add optional config_id parameter; when set, only match config_states whose id equals it. Use str() on both sides for robust int/str comparison.
  • test_capability_tracing_sample_rules: Use wait_loops=_RC_WAIT_LOOPS (400, ~4s) so the library has enough time to send its first RC request.

Reviewer checklist

  • Anything but tests/ or manifests/ is modified ? I have the approval from R&P team
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added, removed or renamed?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 23, 2026

CODEOWNERS have been resolved as:

tests/parametric/test_dynamic_configuration.py                          @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
utils/docker_fixtures/_test_agent.py                                    @DataDog/system-tests-core

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented Feb 23, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 236b7e9 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch 2 times, most recently from 1efc4da to a8cf21a Compare February 23, 2026 21:10
@vlad-scherbich vlad-scherbich changed the title Attempt to fix race conditions in 'set_and_wait_rc' by counting telem… [system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events Feb 24, 2026
@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch 3 times, most recently from 0fff71e to 7885079 Compare February 24, 2026 17:14
@vlad-scherbich vlad-scherbich marked this pull request as ready for review February 24, 2026 21:04
@vlad-scherbich vlad-scherbich requested review from a team as code owners February 24, 2026 21:04
@vlad-scherbich vlad-scherbich requested review from mtoffl01 and removed request for a team February 24, 2026 21:04
@vlad-scherbich
Copy link
Copy Markdown
Contributor Author

vlad-scherbich commented Feb 24, 2026

@cbeauchesne , second attempt to generalize the fix for #6342

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 057ab1dc44

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/parametric/test_dynamic_configuration.py Outdated
@vlad-scherbich vlad-scherbich marked this pull request as draft February 24, 2026 22:54
on RC update, so we skip the telemetry wait and use config_id filtering to avoid stale ACKs.
"""
rc_config = _create_rc_config(config_overrides)
resolved_config_id = config_id or str(hash(json.dumps(rc_config)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does json.dumps provide deterministic ordering of fields if you don't pass sort_fields or whatever the flag is? I think it could mean the resolved_config_id is not deterministic either (and I'm not sure whether it matters in this context)

Copy link
Copy Markdown
Contributor Author

@vlad-scherbich vlad-scherbich Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does json.dumps provide deterministic ordering of fields

No - you would need to do this: json.dumps(data, sort_keys=True)

This ... is an excellent observation! However, it also OLD code, so I don't know if changing this behavior here is desired. If anything, it should be done separately - @cbeauchesne for your thoughts on this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm going to go with this fix right now, as it might help with unflaking all runtimes.

Comment on lines +219 to +223
for _ in range(_MAX_RC_EVENT_WAIT_LOOPS):
if test_agent.count_telemetry_events("app-client-configuration-change") > pre_count:
break
time.sleep(0.01)
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for / else really is something I'll never be able to wrap my head around, but good job using it here 😅

Comment thread utils/docker_fixtures/_test_agent.py Outdated
Comment on lines +725 to +727
if message.get("request_type") == event_name:
if message.get("application", {}).get("language_version") != "SIDECAR":
count += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do guard-style here? like if not: continue instead of if: if: if: do()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, another way could be to just chain all the ANDs ... taking a look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KowalskiThomas I made it pretty :)

@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch 3 times, most recently from 5cba844 to 5cf7369 Compare February 25, 2026 18:53
@vlad-scherbich vlad-scherbich changed the title [system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events [system_tests] Fix a race condition in 'set_and_wait_rc' by waiting on new RC update take effect Feb 25, 2026
@vlad-scherbich vlad-scherbich marked this pull request as ready for review February 25, 2026 21:18
@vlad-scherbich vlad-scherbich enabled auto-merge (squash) February 25, 2026 21:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8e025751e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/parametric/test_dynamic_configuration.py Outdated
@vlad-scherbich vlad-scherbich changed the title [system_tests] Fix a race condition in 'set_and_wait_rc' by waiting on new RC update take effect [system_tests] Fix race conditions in set_and_wait_rc by using unique config IDs Feb 26, 2026
@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch from f8e0257 to 236b7e9 Compare February 26, 2026 15:15
@vlad-scherbich vlad-scherbich merged commit 7932d7c into main Feb 26, 2026
446 checks passed
@vlad-scherbich vlad-scherbich deleted the vlad/fix-dynamic-config-flakes branch February 26, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants