[devtools] Setup Circleci by masci · Pull Request #6 · DataDog/datadog-agent

masci · 2016-06-15T14:42:43Z

Run testsuite on CircleCI, fixes #5

To see the coverage report, go to Artifacts/coverage.html on the build page on Circleci

olivielpeau · 2016-06-16T12:53:42Z

👍

Implemented Two getCheck and runCheck tests

Backend format

Ci unit tests

# This is the 1st commit message: [build] Fork rules_multitool to our own extension. Import the core parts of rules_multitool but with appropriate modifications for our needs. - Re-root things so this is a local extension, rather than a distinct module. - remove the ability to use .netrc for authentication. - remove the :cwd and :workspace_root variations. - There are equivalent workarounds using `--run_under="cd <path> &&"` - Using `bazel run` is usually not the best practice. Rules should use `$(location <tool>)`. If a tool is so common that people need to call it anywhere, at any time, then it should be in their path. - The code is left commented out. Ready to be enabled if a good case is presented. - Remove the support for WORKSPACE. Next steps: - Change download structure so the exeuctable name is the tool, rather than `"executable"`. - Stop passing the attributes of each binary around as json blobs to be encoded and decoded. - Make `:path` feature that will print the full execution path. # This is the commit message #2: Update bazel/multitool/extension.bzl Yeah. That's a better name. Co-authored-by: Joseph Gette <jgettepost@gmail.com> # This is the commit message #3: Update bazel/multitool/extension.bzl Co-authored-by: Joseph Gette <jgettepost@gmail.com> # This is the commit message #4: Update bazel/multitool/private/templates.bzl Co-authored-by: Joseph Gette <jgettepost@gmail.com> # This is the commit message #5: render # This is the commit message #6: iff # This is the commit message #7: commit # This is the commit message #8: lintyfresth

Start readme

# This is the 1st commit message: Keep our own copy of cacert.pem - Replace omnibus fetch from upstream with that static copy. - Include text in the BUILD file about how we check for new upstream versions. - Add explanation of why we have this. https://datadoghq.atlassian.net/browse/ABLD-169 # This is the commit message #2: just use copy for windows # This is the commit message #3: qmarks # This is the commit message #4: omnibus is to blame # This is the commit message #5: maybe # This is the commit message #6: add back in default version # This is the commit message #7: drop livestream on debug # This is the commit message #8: You're kidding, :live_stream? # This is the commit message #9: srsly # This is the commit message #10: just copy on windows # This is the commit message #11: cwd with copy probably does not work # This is the commit message #12: just give up on pkg_install for certs # This is the commit message #13: drop unneded pkg_install targets # This is the commit message #14: - use cwd to make it a little cleaner - update cert to 2025-09-09 # This is the commit message #15: comma # This is the commit message #16: Revert use of cwd on copy. It doesn't matter if it is ugly or not. We are going to delete it this quarter anyway.

### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>

Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com> (cherry picked from commit 40d1f09) ___ Co-authored-by: Théo Putegnat <theo.putegnat@datadoghq.com>

Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: axel.vonengel <axel.vonengel@datadoghq.com>

Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: YoannGh <yoann.ghigoff@datadoghq.com> Co-authored-by: florent.clarret <florent.clarret@datadoghq.com>

### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>

Summary of Changes HIGH Priority Issues Fixed: #1: Write lock held across network I/O (impl/delegatedauth.go:270) - Refactored refreshAndGetAPIKey to release the lock before making network calls (authenticate) - The lock is now only held briefly to check/update state, not during network I/O #2: Context not propagated to signer.SignHTTP (aws.go:195) - Updated generateAwsAuthData to accept a context parameter - Changed signer.SignHTTP(context.Background(), ...) to signer.SignHTTP(ctx, ...) #3: Context not propagated to getCredentials IMDS call (aws.go:119) - Updated getCredentials to accept a context parameter - Removed ctx := context.Background() and now uses the passed context for IMDS calls MEDIUM Priority Issues Fixed: #4: No response body size limit (api/delegated_auth.go:97) - Added maxResponseBodySize = 1 * 1024 * 1024 constant (1 MB) - Wrapped response body with io.LimitReader to prevent memory exhaustion #5: No overall HTTP client timeout (api/delegated_auth.go:82) - Added httpClientTimeout = 30 * time.Second constant - Added Timeout: httpClientTimeout to the HTTP client #6: config.Set called while holding write lock (impl/delegatedauth.go:341) - Moved updateConfigWithAPIKey call outside the lock in startBackgroundRefresh - Captured the API key while holding the lock, then released it before calling config.Set #7: Blocking IMDS calls while holding write lock (impl/delegatedauth.go:127) - Refactored initializeIfNeeded to perform cloud detection without holding locks - IMDS calls now happen outside any lock, then state is updated with a brief write lock #8: Regex fails silently for non-standard formats (api/delegated_auth.go:36) - Added debug log when endpoint doesn't match known Datadog domain pattern - Updated function documentation to clarify behavior #9: Uncached IMDS credential fetch (aws.go:104) - Added documentation explaining the trade-off (refresh interval is typically 60 minutes, so caching is not critical) #10: Auth proof format undocumented (aws.go:98) - Added detailed comment documenting the auth proof format: <base64-body>|<base64-headers>|<method>|<base64-url> LOW Priority Issues Fixed: #11: Unnecessarily exported types (aws.go) - Changed SigningData to signingData (unexported) - Changed AWSAuth.AwsRegion to AWSAuth.region (unexported) - Updated all references in aws.go and aws_test.go #12: Tests exercise copy of goroutine (impl/delegatedauth_test.go:19) - Added documentation explaining why tests use a simplified goroutine pattern - Clarified that integration tests cover the actual startBackgroundRefresh function #13: Subsequent Config param silently ignored (def/delegatedauth.go:24) - Updated documentation to clearly state that only the first Config is used - Added warning log when a different Config is passed on subsequent calls

…list Four persona panels reviewed the harness; 10 cross-cutting findings addressed here. Test count: 101 passing. # Correctness (Scientist panel) - Per-scenario F1 σ calibration. Added `ScenarioResult.f1_sigma` and `measure_sigma.py` helper. `scoring.score_against_baseline` now uses `3·σ_s` per scenario when measured, falls back to `CONFIG.tau_default`. Fixes the "scalar τ=0.05 is smaller than observed per-scenario σ" problem — we were gating on noise. - Rolling "last shipped" reference for regression gates. `db.last_shipped_ per_scenario[detector]` updates after every ship; the strict-regression and recall-floor gates compare against THIS (not the original baseline) so a candidate that regresses from the immediately-prior commit is blocked, even when accumulated prior gains would mask the regression vs the original baseline. Cumulative deltas vs baseline stay visible. Review prompt shows both ("vs baseline" + "vs last-ship") so the reviewer can distinguish "added marginal signal" from "inherited". - Overfit telltale. New `overfit_check.py`: every N ships, Spearman rank-correlation between train-ranking and lockbox-ranking of all shipped candidates. Drift below `CONFIG.overfit_spearman_threshold` emits a `tripwire` coord-out. Lockbox scores never surface in any agent prompt — Python consumes, agents don't. # Brittleness (panel #2) - Inbox orphan recovery. `inbox.recover_orphan_reading` called in `driver.main` startup; archives any `inbox.md.reading` left behind by a prior mid-drain crash with an `orphan-recovery` tag so the original message isn't silently lost. - Dirty-tree check moved BEFORE `sync_from_upstream`. Human-edited working-tree changes under WATCH_PATHS can no longer be silently auto-committed under a candidate id, or wiped by `merge --abort` on upstream conflict. - PendingValidation persisted BEFORE ssh dispatch (new `dispatching` status). A crash between ssh return and db save can't lose track of the in-flight remote tmux session. Ssh failure flips status to `failed` with an audit trail. Startup reaps any orphaned `dispatching` records as `failed`. - Upstream-conflict halts cleanly. New `UpstreamConflictHalt` exception propagates out of the --forever loop instead of re-trying the sync every iteration (which would conflict and emit another coord-out every ~10min, burning tokens + spamming the PR). # Operability (SRE panel) - Hard token-budget ceiling. `sdk.consume_token_count()` accumulates input+output tokens from `ResultMessage.usage`; driver rolls into `BudgetState.api_tokens_used`. When `CONFIG.api_token_ceiling` is exceeded, `BudgetCeilingHalt` halts the loop with a `budget_halt` coord-out. Default None (no ceiling); set it before multi-day runs or an Opus loop edge case can burn $1-5k/day uncontrolled. - Liveness heartbeat in `metrics.md`. Shows last journal event, ISO timestamp, and a `⚠ LIVENESS` banner when stale > 30min. Median iter wall-time over the last 10 iterations. Token % of ceiling when set. # Information flow (Maximalist panel) - Personas collapsed. Replaced Skeptic + Conservative (both re-deriving booleans from numbers scoring.py already computed) with a single `hack_detector` that focuses on the judgment call rules CAN'T make: "does this look like a real improvement or a metric-hack?" Prior- experiment rationales are now in the prompt so the reviewer can see if this approach was already rejected on a different iteration. - Implementation agent gets prior-work context. `sdk.implement_candidate` accepts `prior_experiments` (up to 5 most-recent same-family experiments with rationales). Driver populates via `_recent_same_family`. Agent can learn from past rejects instead of re-exploring dead ends. # Eval-component restructure - Per-ship eval-component dispatch REMOVED. It was "validate every config" dressed as "validate the component" — and it constantly skipped when the workspace was busy. - New policy: dispatch ONCE per new component, on plateau. When a family iterating on a component hits `CONFIG.stuck_threshold` consecutive non-improving experiments AND has ≥1 ship, dispatch eval-component for each target component not yet in `db.components_eval_dispatched`. Matches "certify this new component" semantics; eval-component is a lagging audit, not a per-config check. - `components_eval_dispatched` starts EMPTY on `empty_db()`. Historical reports (in eval-results/ + manually imported into db.validations) validated BASELINE versions — after the coordinator modifies scanmw/scanwelch/bocpd their historical data is stale. Re-run on plateau. # Tests All 101 existing tests still pass. Updated `test_dispatch_fails_soft_when_workspace_unreachable` and `test_dispatch_ssh_failure_does_not_raise` to expect the persist-before-dispatch audit trail (PendingValidation recorded with status=failed on ssh failure, not silently dropped). # What's still deferred - Replicate-on-ship (3-5x rerun before committing; #5 in the panel synthesis). Adds 18-30min per ship but gives seed-sign-consistency before pushing. Straightforward add when desired. - Pre-revert diff archive for rejected candidates (#6). Small, easy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

masci force-pushed the massi/circleci branch from 6141621 to 0040583 Compare June 15, 2016 17:07

configure circleci for testing

ac2b19f

masci force-pushed the massi/circleci branch from 8414418 to ac2b19f Compare June 16, 2016 12:31

masci changed the title ~~[WIP] Setup Circleci~~ [devtools] Setup Circleci Jun 16, 2016

masci added the dev/tooling label Jun 16, 2016

masci merged commit dc265ea into master Jun 16, 2016

masci deleted the massi/circleci branch June 16, 2016 12:58

kamigerami mentioned this pull request Nov 22, 2018

There was an error querying the ntp host #1532

Closed

masci added a commit that referenced this pull request Apr 1, 2019

Merge pull request #6 from Alien1993/two-check-test

06d9c9d

Implemented Two getCheck and runCheck tests

hush-hush pushed a commit that referenced this pull request Apr 17, 2019

Merge pull request #6 from Alien1993/two-check-test

857b33b

Implemented Two getCheck and runCheck tests

safchain pushed a commit to safchain/datadog-agent that referenced this pull request May 13, 2020

Merge pull request DataDog#6 from safchain/backend-format

079784e

Backend format

safchain added a commit to safchain/datadog-agent that referenced this pull request Jun 4, 2020

Merge pull request DataDog#6 from DataDog/ci-unit-tests

22cc30d

Ci unit tests

yutingcaicyt mentioned this pull request Aug 30, 2022

[BUG] can't build datadog-agent image #13283

Closed

s-alad pushed a commit that referenced this pull request Nov 21, 2025

Merge pull request #6 from rapdev-io/README

059f32e

Start readme

matt-dz mentioned this pull request Feb 26, 2026

Implement Agent Safe Shell — POSIX commands as safe builtins #46945

Closed

4 tasks

ellataira mentioned this pull request Apr 25, 2026

Coordinator run log — observer AD iteration #49678

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[devtools] Setup Circleci#6

[devtools] Setup Circleci#6
masci merged 1 commit intomasterfrom
massi/circleci

masci commented Jun 15, 2016 •

edited

Loading

Uh oh!

olivielpeau commented Jun 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

masci commented Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olivielpeau commented Jun 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

masci commented Jun 15, 2016 •

edited

Loading