test: fix flaky MultipleVhdsOverAds by synchronizing on VHDS removal propagation#44556
Closed
test: fix flaky MultipleVhdsOverAds by synchronizing on VHDS removal propagation#44556
Conversation
…s to fix flaky test Before asserting 404 for a VHDS-removed virtual host, wait for the VHDS config_reload counter to advance to 2 (reflecting the removal). Without this wait, the test races against VHDS update propagation: if the removal hasn't been applied yet, the request is forwarded to the upstream (which nobody handles in the 404 path), causing the upstream to timeout and return 504 instead of 404. Fixes #44425 Agent-Logs-Url: https://github.com/envoyproxy/envoy/sessions/0e6e418a-1de6-4dd4-ba31-a34aaab55671 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix flaky test AdsIntegrationTest.MultipleVhdsOverAds
test: fix flaky MultipleVhdsOverAds by synchronizing on VHDS removal propagation
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AdsIntegrationTest.MultipleVhdsOverAdsraces between sending a VHDS delta removal and asserting a 404: if the removal hasn't reached the worker's route table yet, the request is forwarded to the upstream which nobody reads in the 404 path, causing a timeout and a 504 instead of 404.Fix
After sending the
route_config_0/fooremoval, wait for the VHDSconfig_reloadcounter to reach 2 before sending the request that expects 404:sendDiscoveryResponse<envoy::config::route::v3::VirtualHost>( Config::TestTypeUrl::get().VirtualHost, {}, {}, {"route_config_0/foo"}, "1"); test_server_->waitForCounterGe("http.ads_test.rds.vhds.route_config_0.config_reload", 2); send_request_and_verify("http1", foo_request_headers, true);VhdsSubscription::onConfigUpdate()incrementsconfig_reloadafter updating the route table and immediately before callingroute_config_provider_->onConfigUpdate()to dispatch to workers. Reaching count 2 (1 = initial add batch, 2 = removal batch) guarantees the worker dispatch is queued; the time to establish the next client connection is sufficient for workers to drain it.The stat path
http.ads_test.rds.vhds.route_config_0.config_reloadfollows the existing pattern already used in the same file (e.g.http.ads_test.rds.route_config_0.config_reloadat line 1450).Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
dl.google.com/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30(dns block)/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/REDACTED/.cache/envoy-bazel/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/home/REDACTED/.(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Fix flaky test
AdsIntegrationTest.MultipleVhdsOverAdsintest/integration/ads_integration_test.cc.Tracking issue: #44425
Symptoms
The test is flaky (~1/1000 in deflake testing) and is a release blocker. Two observed failure modes:
Failure mode 1 — 504 instead of expected 404/200
From https://github.com/envoyproxy/envoy/actions/runs/24180584932/job/70572277286#step:19:2057
Failure mode 2 — Timed out waiting for new connection
The 504 response strongly suggests a race where the request is sent and routed before the VHDS-delivered virtual host / upstream cluster membership is fully applied, so the router either selects no healthy upstream or times out establishing an upstream connection. The "Timed out waiting for new connection" mode points to the same underlying race from a slightly different timing angle.
Investigation guidance
test/integration/ads_integration_test.ccaround theMultipleVhdsOverAdstest (line ~3133) and understand the VHDS delta sequencing. Identify what the test waits for before sending each HTTP request.vhds.*update_success,cluster_manager.cluster_added,http.config_test.rds.*.update_success, or usingtest_server_->waitForCounterGe(...)/waitForGaugeEq(...)).waitForNextUpstreamConnectiontiming out means either no upstream was routed to, or the upstream cluster warm-up hadn't completed. Confirm whether the test waits for the cluster to become ACTIVE before sending traffic.404path (request to a virtual host that was just removed via VHDS) has an analogous race where the removal hasn't been applied yet when the request is sent.Fix
Add the missing synchronization between VHDS/CDS/EDS updates and the HTTP request dispatch so that:
404for a removed virtual host, wait for the VHDS update counter to advance to reflect the removal.200, wait for the new virtual host and any associated cluster to be fully applied (e.g.,waitForCounterGeon the appropriatevhds.*.update_success/ cluster warming counters).waitForNextUpstreamConnectionis only called after the relevant cluster is guaranteed to exist and be warmed.Prefer deterministic stat-based waits over sleeps. Do not disable or skip the test.
Run the test under
--runs_per_test=1000(or equivalent) to validate the fix removes the flake. Keep the change minimal and scoped to the test file unless a production bug is uncovered, in which case clearly call it out in the PR description.Reference the issue in the PR body with
Fixes #44425.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: this needs a fix - its kinda infrequent - only triggered 1/1000 in my deflake testing
[----------] 1 test from IpVersionsClientTypeDeltaWildcard/AdsIntegrationTest
[ RUN ] IpVersionsClientTypeDeltaWildcard/AdsIntegrationTest.MultipleVhdsOverAds/IPv6_GoogleGrpc_Delta
[2026-04-21 13:16:30.672][17548][critical][assert] [test/integration/http_integration.cc:573] assert failure: 0. Details: Timed out waiting for new connection.
Error: 4-21 13:16:30.672][17548][error][envoy_bug] [./source/common/common/assert.h:38] stacktrace for envoy bug
Error: 4-21 13:16:30.697][17548][error][envoy_bug] [./source/common/common/assert.h:43] #0 Envoy::HttpIntegrationTest::waitForNextUpstreamRequest() [0x555d2036e749]
Error: 4-21 13:16:30.704][17548][error][envoy_bug] [./source/common/common/assert.h:43] #1 Envoy::HttpIntegrationTest::sendRequestAndWaitForResponse() [0x555d2036d0e0]
Erro...
This pull request was created from Copilot chat.