Skip to content

Fix integration-test hang and stale proxy.conf directive#63

Open
pawelchcki wants to merge 3 commits intopc/devcontainerfrom
pc/test-hang-fix
Open

Fix integration-test hang and stale proxy.conf directive#63
pawelchcki wants to merge 3 commits intopc/devcontainerfrom
pc/test-hang-fix

Conversation

@pawelchcki
Copy link
Copy Markdown
Contributor

@pawelchcki pawelchcki commented Apr 21, 2026

Three fixes that together unwedge the integration suite and make it CI-visible:

  • AioHTTPServer / TestAgent cleanup: make them context managers, daemonize their background threads, bound the stop-time join, fix _app.cleanup()runner.cleanup(), and set a TCPSite shutdown_timeout. Without these, a failed assertion in test_http_proxy kept non-daemon threads alive and wedged the process in __futex_wait until the container was SIGKILL'd.
  • proxy.conf: drop the orphan <Location "/delegate"> block left behind by 470db65 (deprecate sampling delegation); it broke apachectl with "Invalid command" for every proxy test.
  • CI: add test-integration:$ARCH jobs running the same run-integration-tests.sh entrypoint as make test-integration, scoped to the matching devcontainer-image matrix slot. The existing test-rum job only covered requires_rum tests — the rest of the suite was never run in CI.

Copy link
Copy Markdown
Contributor Author

pawelchcki commented Apr 21, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pawelchcki pawelchcki force-pushed the pc/test-hang-fix branch 2 times, most recently from 6a34305 to 9f8d21d Compare April 21, 2026 09:55
@pawelchcki pawelchcki changed the base branch from pc/tri-state-merge to graphite-base/63 April 21, 2026 13:45
@pawelchcki pawelchcki changed the base branch from graphite-base/63 to pc/devcontainer April 21, 2026 13:45
@pawelchcki pawelchcki marked this pull request as ready for review April 21, 2026 17:18
@pawelchcki pawelchcki requested a review from a team as a code owner April 21, 2026 17:18
@pawelchcki pawelchcki requested review from Copilot and dubloom and removed request for a team April 21, 2026 17:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes integration-test shutdown hangs and stale Apache proxy configuration by improving teardown behavior, reducing noisy readiness probes, and ensuring CI runs the full integration suite.

Changes:

  • Add context-manager support and safer thread settings for the temporary aiohttp server used in test_http_proxy.
  • Fix TestAgent teardown to properly close aiohttp runner/sites, bound shutdown time, and avoid readiness HTTP probes that create traces.
  • Remove orphan /delegate Apache config and add CI jobs to run the full integration-test suite.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
test/integration-test/scenarios/test_proxy.py Add context-manager lifecycle to prevent test hangs from leaked background threads.
test/integration-test/scenarios/conf/proxy.conf Remove stale <Location "/delegate"> block that breaks apache config validation.
test/integration-test/conftest.py Use TCP readiness probe, fix aiohttp cleanup, bound shutdown, and improve thread behavior for the test agent.
.gitlab-ci.yml Add CI jobs to run the full integration-test suite using the same entrypoint as local devs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 43 to +46
def stop(self) -> None:
self.loop.call_soon_threadsafe(self._stop.set)
if self._thread is not None:
self._thread.join(timeout=3.0)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AioHTTPServer.internal_run() still calls app.cleanup() and never calls runner.cleanup() / site.stop(). That can leave the listening socket open and may prevent the thread from exiting cleanly (same teardown issue fixed for TestAgent in conftest.py). Consider switching to runner.cleanup() (and/or explicitly stopping the site) before closing the loop so the context-manager shutdown reliably releases the TCP listener.

Copilot uses AI. Check for mistakes.
def stop(self) -> None:
self.loop.call_soon_threadsafe(self._stop.set)
if self._thread is not None:
self._thread.join(timeout=3.0)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop() joins the server thread with a timeout but then silently proceeds even if the thread is still alive. That can mask teardown failures and leave a background aiohttp server bound to the port, making later tests flaky. After join(timeout=…), consider checking is_alive() and surfacing a clear failure (or at least logging a warning) so a stuck shutdown doesn’t go unnoticed.

Suggested change
self._thread.join(timeout=3.0)
self._thread.join(timeout=3.0)
if self._thread.is_alive():
raise RuntimeError(
f"AioHTTPServer failed to stop within 3.0 seconds for "
f"{self._host}:{self._port}; the server thread is still running."
)

Copilot uses AI. Check for mistakes.
Comment on lines 221 to +223
def stop(self) -> None:
self.loop.call_soon_threadsafe(self._stop.set)
self._thread.join()
self._thread.join(timeout=3.0)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestAgent.stop() now uses join(timeout=3.0) but doesn’t check whether the thread actually stopped. If the aiohttp loop is stuck (the scenario this change is trying to mitigate), the test run will continue while the agent thread is still running, which can lead to port-binding flakiness or a noisy shutdown. Consider checking self._thread.is_alive() after the timeout and surfacing a clear error/warning, and optionally guard call_soon_threadsafe() in case the loop has already been closed due to an internal_run error.

Copilot uses AI. Check for mistakes.
@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 Bot commented Apr 21, 2026

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 50.93% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c5d42da | Docs | Datadog PR Page | Give us feedback!

test_http_proxy constructs an AioHTTPServer in a non-daemon background
thread. When an assertion failed before http_server.stop() ran, the
thread kept the interpreter alive after pytest_sessionfinish and the
whole process wedged in __futex_wait until the container was killed.

Make AioHTTPServer a context manager (the existing `# TODO: support
with syntax` asked for this) so cleanup runs on any exit path, set
daemon=True on its thread, and bound the stop-time join.

conftest's TestAgent had the same shape — same treatment — plus two
real teardown bugs:

- internal_run called _app.cleanup() instead of runner.cleanup(),
  leaving the TCP listener open and the event loop unable to exit.
- TCPSite had no shutdown_timeout, so aiohttp waited the default 60s
  on keep-alive sockets left over from SIGKILL'd Apache workers.

Server.load_configuration's readiness probe is also switched from an
HTTP GET to a bare TCP connect: probes shouldn't emit traces into the
per-test agent session.

Finally, remove the orphan <Location "/delegate"> block from proxy.conf
- commit 470db65 ("chore: deprecate sampling delegation") removed
DatadogDelegateSampling from the module but missed this config,
breaking test_http_proxy (which only uses the /http ProxyPass directly
above) with "Invalid command" from apachectl.
Adds a GitLab `test-integration:$ARCH` job per arch that exercises the
same entrypoint as `make test-integration` locally — the
`.devcontainer/run-integration-tests.sh` script, with MODULE_PATH
pointed at the build:$ARCH artifact so the cmake build is skipped.

This gives the stack three benefits:
- Local and CI test invocations run identical commands, so "works on
  my machine" diverging from "fails in CI" now has one fewer cause.
- The existing test-rum:$ARCH job only runs tests tagged requires_rum;
  the new job covers everything else (configuration, proxy, log
  injection, smoke, etc.) that was previously only exercised
  manually.
- Snapshot uploads gate on full-suite success rather than just the
  RUM subset, so a broken non-RUM scenario can't ship a green
  snapshot.

Only makes sense on top of the proxy.conf + teardown-hang fix from
this branch — on earlier stack branches the suite fails on
test_http_proxy and the teardown hang wedges the pipeline.
Match the narrowing done for build/test-rum in the devcontainer branch:
`test-integration:$ARCH` no longer waits for both devcontainer-image
matrix slots. It scopes the needs to its own arch slot via
`parallel: matrix:`, inheriting the arch-specific DEVCONTAINER_IMAGE
without blocking on the other arch's rebuild.
Copy link
Copy Markdown
Contributor

@xlamorlette-datadog xlamorlette-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing these tests and adding the missing tests in the CI, this is great!

Comment thread .gitlab-ci.yml
ARCH: arm64
tags: ["arch:arm64"]

test-integration:amd64:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be factorized:

test-integration:
  extends: .test-integration-template
  parallel:
    matrix:
      - ARCH: amd64
        TOOLCHAIN_ARCH: x86_64
      - ARCH: arm64
        TOOLCHAIN_ARCH: aarch64
  needs:
    - job: devcontainer-image
      parallel:
        matrix:
          - ARCH: $ARCH
            TOOLCHAIN_ARCH: $TOOLCHAIN_ARCH
      artifacts: true
    - job: "build:$ARCH"
  tags: ["arch:$ARCH"]

Note that this needs to update the needs in .upload-template (to be tested):

  • test-integration:amd64test-integration:[amd64]
  • test-integration:arm64test-integration:[arm64]

Similarly, we should try to factorize build:amd64 / build:arm64 and test-rum:amd64 / test-rum:arm64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants