Add delegatedauth component and AWS creds utility#46272
Add delegatedauth component and AWS creds utility#46272gh-worker-dd-mergequeue-cf854d[bot] merged 20 commits intomainfrom
Conversation
Introduces the comp/core/delegatedauth component for cloud-based API key retrieval using delegated authentication (AWS SigV4). Also adds pkg/util/aws/creds for AWS credential detection and handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
# Conflicts: # go.mod
Module consolidation: - Consolidate 9 separate Go modules into a single comp/core/delegatedauth module - Keep api/cloudauth/aws as a separate module to isolate AWS SDK dependency - This enables future cloud providers (GCP, Azure) to be added as separate modules without bloating the core dependency graph Graceful noop behavior: - Initialize() now succeeds even when no supported cloud provider is detected - AddInstance() becomes a noop on unsupported platforms instead of erroring - Agent gracefully falls back to traditional API key method - Debug logs available for troubleshooting This change makes delegated auth easier to deploy in mixed environments where some agents run on AWS and others run on-prem or other clouds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
5 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8d7bcf5 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.93 | [-4.99, +1.12] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_metrics | memory utilization | +0.59 | [+0.43, +0.75] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.24 | [+0.01, +0.48] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.22 | [+0.02, +0.42] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.08 | [+0.01, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.50, +0.51] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.08, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.13, +0.13] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.00 | [-0.38, +0.37] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.12, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.03 | [-0.44, +0.38] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.13 | [-0.18, -0.08] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.13 | [-0.18, -0.09] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.13 | [-0.17, -0.09] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | -0.16 | [-0.20, -0.12] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.22 | [-0.27, -0.18] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.25 | [-0.47, -0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.26 | [-0.42, -0.10] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.36 | [-0.43, -0.28] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.82 | [-0.92, -0.72] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -1.17 | [-2.67, +0.32] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.89 | [-1.97, -1.80] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -1.93 | [-4.99, +1.12] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -3.14 | [-3.34, -2.94] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
buraizu
left a comment
There was a problem hiding this comment.
Approving with a minor rephrasing requested
…2b.yaml Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>
pgimalac
left a comment
There was a problem hiding this comment.
Not needed for this PR but we'll probably want to have an e2e test for this feature considering how complex it is
There was a problem hiding this comment.
This seems extremely similar to ec2/internal (even the package name and comment, leftover from copypaste i guess 😄), any particular reason to duplicate rather than reuse ?
There was a problem hiding this comment.
Yeah, the idea was to refactor ec2/internal at a later point in time to not have a hard dependency on config. I didn't want to do the refactor now though as it seemed too much for an already large PR.
There was a problem hiding this comment.
Actually since I guess this one won't be merged right away you could do that in parallel, merge the other PR, and pull main in this one 👀
And pkg/util/aws/creds/internal/helpers.go still imports the config so I'm not sure I get what the issue is
| return nil | ||
| } | ||
|
|
||
| // Status Provider implementation for noop |
There was a problem hiding this comment.
You might not need to provide a status provider for the noop, it will only be used for subcommands (which don't need status)
There was a problem hiding this comment.
I was following the pattern for secrets as that is the closest to the delegatedauth component. I also think I ran into some complications if I did not have a noop status provider.
| // domainURLRegexp matches and captures known Datadog domains with optional protocol and trailing characters | ||
| // Captures: protocol (optional), subdomain (ignored), regional prefix + base domain, trailing dot (optional) | ||
| // Examples: https://agent.datad0g.com., http://metrics.us1.datadoghq.com, agent.ddog-gov.com | ||
| var domainURLRegexp = regexp.MustCompile(`^(?:https?://)?[^./]+\.((?:[a-z]{2,}\d{1,2}\.)?)(?:(datadoghq|datad0g)\.(com|eu)|(ddog-gov\.com))(\.)?\/?$`) |
There was a problem hiding this comment.
There is already a similar regex in pkg/config/utils, could we reuse that one to avoid duplication ?
There was a problem hiding this comment.
Potentially modifying it, but at least have a single place with the logic for well known datadog sites
There was a problem hiding this comment.
Yeah, I was debating putting this function in there. Basically the problem I am trying to solve here is that if someone sets the dd_url to something like agent.datadoghq.com. Then GetMainEndpoint(cfg, "https://api.", "dd_url") will return agent.datadoghq.com When we really want to be hitting https://api.datadoghq.com regardless of what they have set in site.
This approach allows us to create a url to the API endpoint regardless of what they have set in site or dd_url. We have the same actual issue in the api/v1/validate endpoint. Except that endpoint does not support the staging domain name (see the following: https://github.com/DataDog/datadog-agent/blob/main/comp/forwarder/defaultforwarder/forwarder_health.go#L204 , https://github.com/DataDog/datadog-agent/blob/main/comp/forwarder/defaultforwarder/forwarder_health.go#L48)
I think generally we should refactor api/v1/validate and this token endpoint resolution to use the same methods but I think that should be for a followup PR.
This reverts commit 34256a3.
rahulkaukuntla
left a comment
There was a problem hiding this comment.
My team just owns the go.mod and go.sum files in pkg/config, so LGTM
| // First, try to get credentials from config | ||
| awsCredentials.AccessKeyID = cfg.GetString(awsAccessKeyIDName) | ||
| awsCredentials.SecretAccessKey = cfg.GetString(awsSecretAccessKeyName) | ||
| awsCredentials.Token = cfg.GetString(awsSessionTokenName) |
There was a problem hiding this comment.
Any setting that you get from the config should be declared with BindEnvAndSetDefault (not done for those keys)
Also it's best to define them in lowercase for the config
There was a problem hiding this comment.
I deferred to just get the creds from the Env or IMDS rather than anything in the config
| // First, try to get credentials from config | ||
| awsCredentials.AccessKeyID = cfg.GetString(awsAccessKeyIDName) | ||
| awsCredentials.SecretAccessKey = cfg.GetString(awsSecretAccessKeyName) | ||
| awsCredentials.Token = cfg.GetString(awsSessionTokenName) | ||
|
|
||
| // Then try environment variables | ||
| if awsCredentials.AccessKeyID == "" { | ||
| awsCredentials.AccessKeyID = os.Getenv(awsAccessKeyIDName) | ||
| } | ||
| if awsCredentials.SecretAccessKey == "" { | ||
| awsCredentials.SecretAccessKey = os.Getenv(awsSecretAccessKeyName) | ||
| } | ||
| if awsCredentials.Token == "" { | ||
| awsCredentials.Token = os.Getenv(awsSessionTokenName) | ||
| } |
There was a problem hiding this comment.
Subjective, but you could use cmp.Or to get the value (eg. cmp.Or(cfg.GetString(awsAccessKeyIDName), os.Getenv(awsAccessKeyIDName))
There was a problem hiding this comment.
I deferred to just get the creds from the Env or IMDS rather than anything in the config
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cced8ccc5e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if updated { | ||
| d.updateConfigWithAPIKey(instance, *lCreds) |
There was a problem hiding this comment.
Skip config writes after refresh context cancellation
When AddInstance replaces an existing entry, the old refresh loop is canceled, but the success path still writes the fetched key unconditionally. If the old goroutine was already in flight, it can return after cancellation and overwrite the new instance’s key for the same config target, causing stale credentials to win during reconfiguration.
Useful? React with 👍 / 👎.
| req, err := http.NewRequest(http.MethodPost, url, bytes.NewBuffer([]byte(""))) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| req.Header.Set(contentTypeHeader, applicationJSON) | ||
| req.Header.Set(authorizationHeader, fmt.Sprintf("%s %s", authorizationType, delegatedAuthProof)) | ||
| resp, err := client.Do(req) |
There was a problem hiding this comment.
Use cancellable request for intake-key exchange
The intake-key HTTP call is built with http.NewRequest and executed without any request context, so refresh cancellation cannot interrupt an in-flight exchange. In this component, refreshAndGetAPIKey holds the component mutex while calling authenticate, so a stuck request can block refreshes for other delegated-auth instances and delay key updates until lower-level transport timeouts.
Useful? React with 👍 / 👎.
Summary of Changes HIGH Priority Issues Fixed: #1: Write lock held across network I/O (impl/delegatedauth.go:270) - Refactored refreshAndGetAPIKey to release the lock before making network calls (authenticate) - The lock is now only held briefly to check/update state, not during network I/O #2: Context not propagated to signer.SignHTTP (aws.go:195) - Updated generateAwsAuthData to accept a context parameter - Changed signer.SignHTTP(context.Background(), ...) to signer.SignHTTP(ctx, ...) #3: Context not propagated to getCredentials IMDS call (aws.go:119) - Updated getCredentials to accept a context parameter - Removed ctx := context.Background() and now uses the passed context for IMDS calls MEDIUM Priority Issues Fixed: #4: No response body size limit (api/delegated_auth.go:97) - Added maxResponseBodySize = 1 * 1024 * 1024 constant (1 MB) - Wrapped response body with io.LimitReader to prevent memory exhaustion #5: No overall HTTP client timeout (api/delegated_auth.go:82) - Added httpClientTimeout = 30 * time.Second constant - Added Timeout: httpClientTimeout to the HTTP client #6: config.Set called while holding write lock (impl/delegatedauth.go:341) - Moved updateConfigWithAPIKey call outside the lock in startBackgroundRefresh - Captured the API key while holding the lock, then released it before calling config.Set #7: Blocking IMDS calls while holding write lock (impl/delegatedauth.go:127) - Refactored initializeIfNeeded to perform cloud detection without holding locks - IMDS calls now happen outside any lock, then state is updated with a brief write lock #8: Regex fails silently for non-standard formats (api/delegated_auth.go:36) - Added debug log when endpoint doesn't match known Datadog domain pattern - Updated function documentation to clarify behavior #9: Uncached IMDS credential fetch (aws.go:104) - Added documentation explaining the trade-off (refresh interval is typically 60 minutes, so caching is not critical) #10: Auth proof format undocumented (aws.go:98) - Added detailed comment documenting the auth proof format: <base64-body>|<base64-headers>|<method>|<base64-url> LOW Priority Issues Fixed: #11: Unnecessarily exported types (aws.go) - Changed SigningData to signingData (unexported) - Changed AWSAuth.AwsRegion to AWSAuth.region (unexported) - Updated all references in aws.go and aws_test.go #12: Tests exercise copy of goroutine (impl/delegatedauth_test.go:19) - Added documentation explaining why tests use a simplified goroutine pattern - Clarified that integration tests cover the actual startBackgroundRefresh function #13: Subsequent Config param silently ignored (def/delegatedauth.go:24) - Updated documentation to clearly state that only the first Config is used - Added warning log when a different Config is passed on subsequent calls
### What does this PR do? Provide secret component implementation in core bundle through a functional option pattern. ### Motivation The core bundle should be self sufficient, importing it and still having to provide some extra modules is surprising and annoying. The delegated auth token (being introduced in #46272) has the exact same issue and will require being included along with secrets, with this change it can just be imported directly within the core bundle like the secret component, requiring no change to any command. ### Describe how you validated your changes CI ### Additional Notes This means all binaries will now import both the "real" and the "noop" implementation of the secret component, potentially resulting in more dependencies, but it should already be the case anyway, the run commands use the real secret component while most other subcommands use the noop, so both are imported. If the size impact is too big, `WithSecrets` can be moved to a different package so that only binaries using it get its dependencies. Co-authored-by: pierre.gimalac <pierre.gimalac@datadoghq.com>
### What does this PR do? Provide secret component implementation in core bundle through a functional option pattern. ### Motivation The core bundle should be self sufficient, importing it and still having to provide some extra modules is surprising and annoying. The delegated auth token (being introduced in #46272) has the exact same issue and will require being included along with secrets, with this change it can just be imported directly within the core bundle like the secret component, requiring no change to any command. ### Describe how you validated your changes CI ### Additional Notes This means all binaries will now import both the "real" and the "noop" implementation of the secret component, potentially resulting in more dependencies, but it should already be the case anyway, the run commands use the real secret component while most other subcommands use the noop, so both are imported. If the size impact is too big, `WithSecrets` can be moved to a different package so that only binaries using it get its dependencies. Co-authored-by: pierre.gimalac <pierre.gimalac@datadoghq.com> 351ba46
…nfig (#46291) ### What does this PR do? This adds a new authentication method for the agent. This give the agent the ability to exchange a AWS Cloud Auth Proof for an API key which is automatically managed and rotated on behalf of the customer. This is essentially extending the https://docs.datadoghq.com/account_management/cloud_provider_authentication into the agent. This PR integrates the following PR into the config #46272 The flow is as so: - Agent get AWS credentials from the environment the agent is running in - Agent signs a request to AWS which will prove it's access to the credentials - Agent passed the signed request to Datadog service - Service validates the signed request against AWS and validates it against Datadog configuration - Service response with a Managed and automatically rotated API key - Agent propagates that API key throughout it's config - If the delegated auth flow fails it will fallback to using the API key provided in the yaml file. The allows customers to onboard with limited risk to their current flow. ### Configuration **Global provider config + per-key-prefix enablement:** ```yaml # Global delegated auth provider settings (used for all instances) delegated_auth: org_uuid: ${YOUR_ORG_UUID} # Optional: explicitly specify provider (auto-detects if omitted) provider: aws aws: region: us-east-1 # Optional: auto-detects from IMDS if omitted # Per-prefix enablement (logs can use a different org) logs_config: delegated_auth: org_uuid: ${LOGS_ORG_UUID} # Can be different org ``` ### Motivation This should enable customers to not need to manage the API used by the agent and instead use the AWS credentials in the AWS environment the agent is deploy to. ### Describe how you validated your changes Ran the agent locally to validate changes and deployed to multiple staging clusters `charcadet` and `entei`. I have deployed it to staging clusters with both the feature enabled and the feature disabled. I have marked this a something for RC testing as it is a large shift in how we can authenticate so it seems like it is more RC related. To test this see the above YAML settings to turn on this method of getting an API key. You will need to ensure the AWS account you are authenticating with is allowed by our configuration. Please reach out to allow us to enable any AWS ARNs to use cloud auth. ### Additional Notes Co-authored-by: wyn.bennett <wyn.bennett@datadoghq.com>
## Summary Add support for AWS delegated authentication, allowing Lambda functions to authenticate with Datadog using their IAM role instead of static API keys. This mirrors the implementation in the main Datadog agent ([PR #46272](DataDog/datadog-agent#46272)). **How it works:** 1. Lambda function's IAM role signs an STS `GetCallerIdentity` request 2. The signed request is sent to Datadog's `/api/v2/intake-key` endpoint as authentication proof 3. If the role is configured in Datadog's intake mapping, a managed API key is returned 4. Falls back to other API key methods (Secrets Manager, KMS, SSM, static) if delegated auth fails **Note**: This function is in preview, customers will currently need to request access to use it. ## Tests - [x] Added an integration test that uses this new auth flow, verifies that we have logs in our Datadog Serverless account for this new lambda. Note that the IAM role needs to be added to a Datadog account mapping so the IAM role we are using for this integ test is hardcoded to be the same regardless of who is running it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
What does this PR do?
This adds a new authentication method for the agent. This give the agent the ability to exchange a AWS Cloud Auth Proof for an API key which is automatically managed and rotated on behalf of the customer. This is essentially extending the https://docs.datadoghq.com/account_management/cloud_provider_authentication into the agent.
Introduces the
comp/core/delegatedauthcomponent for cloud-based API key retrieval using delegated authentication (AWS SigV4).Also adds
pkg/util/aws/credsfor AWS credential detection and handling.This does not yet integrate the delegated auth with the config. This will be in a subsequent PR here
The flow is as so:
Motivation
This should enable customers to not need to manage the API used by the agent and instead use the AWS credentials in the AWS environment the agent is deploy to.
Describe how you validated your changes
Ran the agent locally to validate changes and deployed to multiple staging clusters.
Additional Notes