[system_tests] Fix race conditions in set_and_wait_rc by using unique config IDs#6371
[system_tests] Fix race conditions in set_and_wait_rc by using unique config IDs#6371vlad-scherbich merged 9 commits intomainfrom
Conversation
|
|
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: 236b7e9 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
1efc4da to
a8cf21a
Compare
0fff71e to
7885079
Compare
|
@cbeauchesne , second attempt to generalize the fix for #6342 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 057ab1dc44
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| on RC update, so we skip the telemetry wait and use config_id filtering to avoid stale ACKs. | ||
| """ | ||
| rc_config = _create_rc_config(config_overrides) | ||
| resolved_config_id = config_id or str(hash(json.dumps(rc_config))) |
There was a problem hiding this comment.
Does json.dumps provide deterministic ordering of fields if you don't pass sort_fields or whatever the flag is? I think it could mean the resolved_config_id is not deterministic either (and I'm not sure whether it matters in this context)
There was a problem hiding this comment.
Does json.dumps provide deterministic ordering of fields
No - you would need to do this: json.dumps(data, sort_keys=True)
This ... is an excellent observation! However, it also OLD code, so I don't know if changing this behavior here is desired. If anything, it should be done separately - @cbeauchesne for your thoughts on this?
There was a problem hiding this comment.
Actually, I'm going to go with this fix right now, as it might help with unflaking all runtimes.
| for _ in range(_MAX_RC_EVENT_WAIT_LOOPS): | ||
| if test_agent.count_telemetry_events("app-client-configuration-change") > pre_count: | ||
| break | ||
| time.sleep(0.01) | ||
| else: |
There was a problem hiding this comment.
for / else really is something I'll never be able to wrap my head around, but good job using it here 😅
| if message.get("request_type") == event_name: | ||
| if message.get("application", {}).get("language_version") != "SIDECAR": | ||
| count += 1 |
There was a problem hiding this comment.
Could we do guard-style here? like if not: continue instead of if: if: if: do()?
There was a problem hiding this comment.
I like this, another way could be to just chain all the ANDs ... taking a look.
5cba844 to
5cf7369
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8e025751e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
f8e0257 to
236b7e9
Compare
https://datadoghq.atlassian.net/browse/PROF-13796
Motivation
~25 of 32 system test failures on dd-trace-py main in 2026 stem from dynamic configuration tests. Root cause: a race in
set_and_wait_rc—it can match stale RC ACKs from a previous config update and return before the new config is actually applied.Tests fixed
All tests in
test_dynamic_configuration.pythat useset_and_wait_rc. Top flaky tests by frequency (dd-trace-py main, 2026):test_remote_sampling_rules_retentiontest_trace_sampling_rate_override_defaulttest_capability_tracing_sample_rulestest_trace_sampling_rules_override_envOthers:
test_apply_state,test_trace_sampling_rate_override_env,test_trace_sampling_rate_with_sampling_rules,test_log_injection_enabled,test_tracing_client_tracing_tags,test_trace_sampling_rules_override_rate,test_trace_sampling_rules_with_tags.Changes
_set_rc: Useuuid.uuid4()forconfig_idwhen not passed—avoids repeating IDs for identical payloads (hash would recreate the stale-ACK race). Return theconfig_idused.set_and_wait_rc: Whenconfig_idis passed (reuse case), clear the agent beforeset_rcto discard buffered RC requests so we only see responses from our update. Useconfig_idfiltering inwait_for_rc_apply_stateso we only match ACKs for the config we just set.wait_for_rc_apply_state(_test_agent.py): Add optionalconfig_idparameter; when set, only matchconfig_stateswhoseidequals it. Usestr()on both sides for robust int/str comparison.test_capability_tracing_sample_rules: Usewait_loops=_RC_WAIT_LOOPS(400, ~4s) so the library has enough time to send its first RC request.Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P teambuild-XXX-imagelabel is present