Fix integration-test hang and stale proxy.conf directive#63
Fix integration-test hang and stale proxy.conf directive#63pawelchcki wants to merge 3 commits intopc/devcontainerfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
e6962bf to
0b3660e
Compare
a38dc0a to
99e8f12
Compare
0b3660e to
8995192
Compare
6a34305 to
9f8d21d
Compare
8995192 to
311f10b
Compare
311f10b to
0058e96
Compare
9f8d21d to
2d7e81a
Compare
2d7e81a to
a3cf24d
Compare
0058e96 to
00ea1b6
Compare
a3cf24d to
6db78f7
Compare
00ea1b6 to
1b4094c
Compare
There was a problem hiding this comment.
Pull request overview
Fixes integration-test shutdown hangs and stale Apache proxy configuration by improving teardown behavior, reducing noisy readiness probes, and ensuring CI runs the full integration suite.
Changes:
- Add context-manager support and safer thread settings for the temporary aiohttp server used in
test_http_proxy. - Fix
TestAgentteardown to properly close aiohttp runner/sites, bound shutdown time, and avoid readiness HTTP probes that create traces. - Remove orphan
/delegateApache config and add CI jobs to run the full integration-test suite.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
test/integration-test/scenarios/test_proxy.py |
Add context-manager lifecycle to prevent test hangs from leaked background threads. |
test/integration-test/scenarios/conf/proxy.conf |
Remove stale <Location "/delegate"> block that breaks apache config validation. |
test/integration-test/conftest.py |
Use TCP readiness probe, fix aiohttp cleanup, bound shutdown, and improve thread behavior for the test agent. |
.gitlab-ci.yml |
Add CI jobs to run the full integration-test suite using the same entrypoint as local devs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def stop(self) -> None: | ||
| self.loop.call_soon_threadsafe(self._stop.set) | ||
| if self._thread is not None: | ||
| self._thread.join(timeout=3.0) |
There was a problem hiding this comment.
AioHTTPServer.internal_run() still calls app.cleanup() and never calls runner.cleanup() / site.stop(). That can leave the listening socket open and may prevent the thread from exiting cleanly (same teardown issue fixed for TestAgent in conftest.py). Consider switching to runner.cleanup() (and/or explicitly stopping the site) before closing the loop so the context-manager shutdown reliably releases the TCP listener.
| def stop(self) -> None: | ||
| self.loop.call_soon_threadsafe(self._stop.set) | ||
| if self._thread is not None: | ||
| self._thread.join(timeout=3.0) |
There was a problem hiding this comment.
stop() joins the server thread with a timeout but then silently proceeds even if the thread is still alive. That can mask teardown failures and leave a background aiohttp server bound to the port, making later tests flaky. After join(timeout=…), consider checking is_alive() and surfacing a clear failure (or at least logging a warning) so a stuck shutdown doesn’t go unnoticed.
| self._thread.join(timeout=3.0) | |
| self._thread.join(timeout=3.0) | |
| if self._thread.is_alive(): | |
| raise RuntimeError( | |
| f"AioHTTPServer failed to stop within 3.0 seconds for " | |
| f"{self._host}:{self._port}; the server thread is still running." | |
| ) |
| def stop(self) -> None: | ||
| self.loop.call_soon_threadsafe(self._stop.set) | ||
| self._thread.join() | ||
| self._thread.join(timeout=3.0) |
There was a problem hiding this comment.
TestAgent.stop() now uses join(timeout=3.0) but doesn’t check whether the thread actually stopped. If the aiohttp loop is stuck (the scenario this change is trying to mitigate), the test run will continue while the agent thread is still running, which can lead to port-binding flakiness or a noisy shutdown. Consider checking self._thread.is_alive() after the timeout and surfacing a clear error/warning, and optionally guard call_soon_threadsafe() in case the loop has already been closed due to an internal_run error.
1b4094c to
f2b235e
Compare
6db78f7 to
b20bf17
Compare
|
🎯 Code Coverage (details) 🔗 Commit SHA: c5d42da | Docs | Datadog PR Page | Give us feedback! |
b20bf17 to
bd7f943
Compare
f2b235e to
846ed8d
Compare
bd7f943 to
ad645a1
Compare
c67a9c1 to
e6f3db7
Compare
ad645a1 to
5ea1662
Compare
e6f3db7 to
c400619
Compare
5ea1662 to
b636ade
Compare
test_http_proxy constructs an AioHTTPServer in a non-daemon background thread. When an assertion failed before http_server.stop() ran, the thread kept the interpreter alive after pytest_sessionfinish and the whole process wedged in __futex_wait until the container was killed. Make AioHTTPServer a context manager (the existing `# TODO: support with syntax` asked for this) so cleanup runs on any exit path, set daemon=True on its thread, and bound the stop-time join. conftest's TestAgent had the same shape — same treatment — plus two real teardown bugs: - internal_run called _app.cleanup() instead of runner.cleanup(), leaving the TCP listener open and the event loop unable to exit. - TCPSite had no shutdown_timeout, so aiohttp waited the default 60s on keep-alive sockets left over from SIGKILL'd Apache workers. Server.load_configuration's readiness probe is also switched from an HTTP GET to a bare TCP connect: probes shouldn't emit traces into the per-test agent session. Finally, remove the orphan <Location "/delegate"> block from proxy.conf - commit 470db65 ("chore: deprecate sampling delegation") removed DatadogDelegateSampling from the module but missed this config, breaking test_http_proxy (which only uses the /http ProxyPass directly above) with "Invalid command" from apachectl.
Adds a GitLab `test-integration:$ARCH` job per arch that exercises the same entrypoint as `make test-integration` locally — the `.devcontainer/run-integration-tests.sh` script, with MODULE_PATH pointed at the build:$ARCH artifact so the cmake build is skipped. This gives the stack three benefits: - Local and CI test invocations run identical commands, so "works on my machine" diverging from "fails in CI" now has one fewer cause. - The existing test-rum:$ARCH job only runs tests tagged requires_rum; the new job covers everything else (configuration, proxy, log injection, smoke, etc.) that was previously only exercised manually. - Snapshot uploads gate on full-suite success rather than just the RUM subset, so a broken non-RUM scenario can't ship a green snapshot. Only makes sense on top of the proxy.conf + teardown-hang fix from this branch — on earlier stack branches the suite fails on test_http_proxy and the teardown hang wedges the pipeline.
Match the narrowing done for build/test-rum in the devcontainer branch: `test-integration:$ARCH` no longer waits for both devcontainer-image matrix slots. It scopes the needs to its own arch slot via `parallel: matrix:`, inheriting the arch-specific DEVCONTAINER_IMAGE without blocking on the other arch's rebuild.
b636ade to
c5d42da
Compare
c400619 to
1b72823
Compare
xlamorlette-datadog
left a comment
There was a problem hiding this comment.
Thanks a lot for fixing these tests and adding the missing tests in the CI, this is great!
| ARCH: arm64 | ||
| tags: ["arch:arm64"] | ||
|
|
||
| test-integration:amd64: |
There was a problem hiding this comment.
This can be factorized:
test-integration:
extends: .test-integration-template
parallel:
matrix:
- ARCH: amd64
TOOLCHAIN_ARCH: x86_64
- ARCH: arm64
TOOLCHAIN_ARCH: aarch64
needs:
- job: devcontainer-image
parallel:
matrix:
- ARCH: $ARCH
TOOLCHAIN_ARCH: $TOOLCHAIN_ARCH
artifacts: true
- job: "build:$ARCH"
tags: ["arch:$ARCH"]
Note that this needs to update the needs in .upload-template (to be tested):
test-integration:amd64→test-integration:[amd64]test-integration:arm64→test-integration:[arm64]
Similarly, we should try to factorize build:amd64 / build:arm64 and test-rum:amd64 / test-rum:arm64.

Three fixes that together unwedge the integration suite and make it CI-visible:
_app.cleanup()→runner.cleanup(), and set aTCPSiteshutdown_timeout. Without these, a failed assertion intest_http_proxykept non-daemon threads alive and wedged the process in__futex_waituntil the container was SIGKILL'd.<Location "/delegate">block left behind by470db65(deprecate sampling delegation); it broke apachectl with "Invalid command" for every proxy test.test-integration:$ARCHjobs running the samerun-integration-tests.shentrypoint asmake test-integration, scoped to the matching devcontainer-image matrix slot. The existingtest-rumjob only coveredrequires_rumtests — the rest of the suite was never run in CI.