Skip to content

Add delegatedauth component and AWS creds utility#46272

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 20 commits intomainfrom
wyn.bennett/delegated_auth_comp
Feb 23, 2026
Merged

Add delegatedauth component and AWS creds utility#46272
gh-worker-dd-mergequeue-cf854d[bot] merged 20 commits intomainfrom
wyn.bennett/delegated_auth_comp

Conversation

@wynbennett
Copy link
Copy Markdown
Contributor

@wynbennett wynbennett commented Feb 11, 2026

What does this PR do?

This adds a new authentication method for the agent. This give the agent the ability to exchange a AWS Cloud Auth Proof for an API key which is automatically managed and rotated on behalf of the customer. This is essentially extending the https://docs.datadoghq.com/account_management/cloud_provider_authentication into the agent.

Introduces the comp/core/delegatedauth component for cloud-based API key retrieval using delegated authentication (AWS SigV4).

Also adds pkg/util/aws/creds for AWS credential detection and handling.

This does not yet integrate the delegated auth with the config. This will be in a subsequent PR here

The flow is as so:

  • Agent get AWS credentials from the environment the agent is running in
  • Agent signs a request to AWS which will prove it's access to the credentials
  • Agent passed the signed request to Datadog service
  • Service validates the signed request against AWS and validates it against Datadog configuration
  • Service response with a Managed and automatically rotated API key
  • Agent propagates that API key throughout it's config
  • If the delegated auth flow fails it will fallback to using the API key provided in the yaml file. The allows customers to onboard with limited risk to their current flow.

Motivation

This should enable customers to not need to manage the API used by the agent and instead use the AWS credentials in the AWS environment the agent is deploy to.

Describe how you validated your changes

Ran the agent locally to validate changes and deployed to multiple staging clusters.

Additional Notes

Introduces the comp/core/delegatedauth component for cloud-based
API key retrieval using delegated authentication (AWS SigV4).

Also adds pkg/util/aws/creds for AWS credential detection and handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 11, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@wynbennett
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@wynbennett
Copy link
Copy Markdown
Contributor Author

recheck

@github-actions github-actions Bot added component/system-probe long review PR is complex, plan time to review it labels Feb 11, 2026
@dd-octo-sts dd-octo-sts Bot added internal Identify a non-fork PR team/agent-runtimes labels Feb 11, 2026
@wynbennett wynbennett added the qa/rc-required Only for a PR that requires validation on the Release Candidate label Feb 11, 2026
Module consolidation:
- Consolidate 9 separate Go modules into a single comp/core/delegatedauth module
- Keep api/cloudauth/aws as a separate module to isolate AWS SDK dependency
- This enables future cloud providers (GCP, Azure) to be added as separate
  modules without bloating the core dependency graph

Graceful noop behavior:
- Initialize() now succeeds even when no supported cloud provider is detected
- AddInstance() becomes a noop on unsupported platforms instead of erroring
- Agent gracefully falls back to traditional API key method
- Debug logs available for troubleshooting

This change makes delegated auth easier to deploy in mixed environments
where some agents run on AWS and others run on-prem or other clouds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Feb 11, 2026

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 01cb749
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +623.16 KiB (0.08% increase) 756.007 → 756.616 → 759.670
agent_deb_amd64_fips +25.69 KiB (0.00% increase) 715.097 → 715.122 → 720.180
agent_heroku_amd64 +3.71 KiB (0.00% increase) 323.754 → 323.758 → 329.530
agent_msi +640.37 KiB (0.10% increase) 622.075 → 622.700 → 1073.220
agent_rpm_amd64 +623.16 KiB (0.08% increase) 755.991 → 756.600 → 759.640
agent_rpm_amd64_fips +25.69 KiB (0.00% increase) 715.080 → 715.106 → 720.160
agent_rpm_arm64 +593.56 KiB (0.08% increase) 734.142 → 734.722 → 741.840
agent_rpm_arm64_fips +24.09 KiB (0.00% increase) 696.217 → 696.240 → 703.390
agent_suse_amd64 +623.16 KiB (0.08% increase) 755.991 → 756.600 → 759.640
agent_suse_amd64_fips +25.69 KiB (0.00% increase) 715.080 → 715.106 → 720.160
agent_suse_arm64 +593.56 KiB (0.08% increase) 734.142 → 734.722 → 741.840
agent_suse_arm64_fips +24.09 KiB (0.00% increase) 696.217 → 696.240 → 703.390
docker_agent_amd64 +623.16 KiB (0.07% increase) 817.029 → 817.637 → 821.990
docker_agent_arm64 +593.56 KiB (0.07% increase) 819.914 → 820.494 → 828.520
docker_agent_jmx_amd64 +623.16 KiB (0.06% increase) 1007.940 → 1008.548 → 1012.870
docker_agent_jmx_arm64 +593.56 KiB (0.06% increase) 999.608 → 1000.188 → 1008.120
docker_cluster_agent_amd64 +424.8 KiB (0.20% increase) 202.999 → 203.414 → 204.270
docker_cluster_agent_arm64 +388.81 KiB (0.17% increase) 217.454 → 217.834 → 218.000
dogstatsd_deb_amd64 +4.0 KiB (0.01% increase) 29.720 → 29.724 → 30.610
dogstatsd_rpm_amd64 +4.0 KiB (0.01% increase) 29.720 → 29.724 → 30.610
dogstatsd_suse_amd64 +4.0 KiB (0.01% increase) 29.720 → 29.724 → 30.610
iot_agent_deb_amd64 -8.66 KiB (0.02% reduction) 42.617 → 42.609 → 43.290
iot_agent_deb_arm64 -12.66 KiB (0.03% reduction) 39.723 → 39.710 → 40.920
iot_agent_deb_armhf -8.57 KiB (0.02% reduction) 40.447 → 40.439 → 41.030
iot_agent_rpm_amd64 -8.66 KiB (0.02% reduction) 42.618 → 42.609 → 43.290
iot_agent_suse_amd64 -8.66 KiB (0.02% reduction) 42.618 → 42.609 → 43.290
5 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
docker_cws_instrumentation_amd64 7.135 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 38.500 MiB
docker_dogstatsd_arm64 36.812 MiB
dogstatsd_deb_arm64 27.881 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +255.27 KiB (0.13% increase) 185.440 → 185.689 → 186.090
agent_deb_amd64_fips +12.53 KiB (0.01% increase) 176.297 → 176.309 → 180.330
agent_heroku_amd64 +5.13 KiB (0.01% increase) 87.104 → 87.109 → 88.440
agent_msi +112.0 KiB (0.07% increase) 149.215 → 149.324 → 154.470
agent_rpm_amd64 +250.06 KiB (0.13% increase) 187.340 → 187.585 → 189.170
agent_rpm_amd64_fips -11.91 KiB (0.01% reduction) 178.400 → 178.388 → 181.060
agent_rpm_arm64 +32.93 KiB (0.02% increase) 169.764 → 169.796 → 170.020
agent_rpm_arm64_fips -10.22 KiB (0.01% reduction) 162.486 → 162.476 → 164.130
agent_suse_amd64 +250.06 KiB (0.13% increase) 187.340 → 187.585 → 189.170
agent_suse_amd64_fips -11.91 KiB (0.01% reduction) 178.400 → 178.388 → 181.060
agent_suse_arm64 +32.93 KiB (0.02% increase) 169.764 → 169.796 → 170.020
agent_suse_arm64_fips -10.22 KiB (0.01% reduction) 162.486 → 162.476 → 164.130
docker_agent_amd64 +135.45 KiB (0.05% increase) 277.780 → 277.912 → 279.410
docker_agent_arm64 +138.21 KiB (0.05% increase) 265.053 → 265.188 → 267.960
docker_agent_jmx_amd64 +131.98 KiB (0.04% increase) 346.424 → 346.553 → 348.040
docker_agent_jmx_arm64 +154.5 KiB (0.05% increase) 329.680 → 329.830 → 332.560
docker_cluster_agent_amd64 +64.69 KiB (0.09% increase) 71.131 → 71.194 → 71.920
docker_cluster_agent_arm64 +60.39 KiB (0.09% increase) 66.775 → 66.834 → 67.220
docker_cws_instrumentation_amd64 neutral 2.995 MiB → 3.330
docker_cws_instrumentation_arm64 neutral 2.726 MiB → 3.090
docker_dogstatsd_amd64 neutral 14.901 MiB → 15.820
docker_dogstatsd_arm64 -2.28 KiB (0.02% reduction) 14.239 → 14.237 → 14.830
dogstatsd_deb_amd64 neutral 7.852 MiB → 8.790
dogstatsd_deb_arm64 neutral 6.741 MiB → 7.710
dogstatsd_rpm_amd64 neutral 7.865 MiB → 8.800
dogstatsd_suse_amd64 neutral 7.865 MiB → 8.800
iot_agent_deb_amd64 -5.37 KiB (0.05% reduction) 11.237 → 11.232 → 12.040
iot_agent_deb_arm64 -2.24 KiB (0.02% reduction) 9.602 → 9.600 → 10.450
iot_agent_deb_armhf neutral 9.801 MiB → 10.620
iot_agent_rpm_amd64 -4.75 KiB (0.04% reduction) 11.255 → 11.250 → 12.060
iot_agent_suse_amd64 -4.75 KiB (0.04% reduction) 11.255 → 11.250 → 12.060

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented Feb 11, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 82a802fd-8002-462c-9a3f-b30007ee4a63

Baseline: 8d7bcf5
Comparison: c44a7f2
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization -1.93 [-4.99, +1.12] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
otlp_ingest_metrics memory utilization +0.59 [+0.43, +0.75] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization +0.24 [+0.01, +0.48] 1 Logs
ddot_metrics_sum_delta memory utilization +0.22 [+0.02, +0.42] 1 Logs
ddot_logs memory utilization +0.08 [+0.01, +0.14] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.01 [-0.50, +0.51] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.08, +0.09] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.13, +0.13] 1 Logs
file_to_blackhole_500ms_latency egress throughput -0.00 [-0.38, +0.37] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput -0.01 [-0.12, +0.11] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.03 [-0.44, +0.38] 1 Logs
file_tree memory utilization -0.13 [-0.18, -0.08] 1 Logs
file_to_blackhole_100ms_latency egress throughput -0.13 [-0.18, -0.09] 1 Logs
quality_gate_idle_all_features memory utilization -0.13 [-0.17, -0.09] 1 Logs bounds checks dashboard
quality_gate_idle memory utilization -0.16 [-0.20, -0.12] 1 Logs bounds checks dashboard
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.22 [-0.27, -0.18] 1 Logs
ddot_metrics memory utilization -0.25 [-0.47, -0.03] 1 Logs
ddot_metrics_sum_cumulative memory utilization -0.26 [-0.42, -0.10] 1 Logs
docker_containers_memory memory utilization -0.36 [-0.43, -0.28] 1 Logs
otlp_ingest_logs memory utilization -0.82 [-0.92, -0.72] 1 Logs
quality_gate_logs % cpu utilization -1.17 [-2.67, +0.32] 1 Logs bounds checks dashboard
tcp_syslog_to_blackhole ingress throughput -1.89 [-1.97, -1.80] 1 Logs
docker_containers_cpu % cpu utilization -1.93 [-4.99, +1.12] 1 Logs
quality_gate_metrics_logs memory utilization -3.14 [-3.34, -2.94] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
docker_containers_cpu simple_check_run 10/10
docker_containers_memory memory_usage 10/10
docker_containers_memory simple_check_run 10/10
file_to_blackhole_0ms_latency lost_bytes 10/10
file_to_blackhole_0ms_latency memory_usage 10/10
file_to_blackhole_1000ms_latency lost_bytes 10/10
file_to_blackhole_1000ms_latency memory_usage 10/10
file_to_blackhole_100ms_latency lost_bytes 10/10
file_to_blackhole_100ms_latency memory_usage 10/10
file_to_blackhole_500ms_latency lost_bytes 10/10
file_to_blackhole_500ms_latency memory_usage 10/10
quality_gate_idle intake_connections 10/10 bounds checks dashboard
quality_gate_idle memory_usage 10/10 bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 bounds checks dashboard
quality_gate_logs intake_connections 10/10 bounds checks dashboard
quality_gate_logs lost_bytes 10/10 bounds checks dashboard
quality_gate_logs memory_usage 10/10 bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 bounds checks dashboard
quality_gate_metrics_logs lost_bytes 10/10 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.

@wynbennett wynbennett marked this pull request as ready for review February 11, 2026 20:18
@wynbennett wynbennett requested review from a team as code owners February 11, 2026 20:18
@buraizu buraizu self-assigned this Feb 11, 2026
Copy link
Copy Markdown
Contributor

@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor rephrasing requested

Comment thread releasenotes/notes/cloud-auth-to-get-an-API-key-87334c4f96c3222b.yaml Outdated
…2b.yaml

Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>
Copy link
Copy Markdown
Member

@pgimalac pgimalac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR but we'll probably want to have an e2e test for this feature considering how complex it is

Comment thread pkg/util/aws/creds/detection_test.go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems extremely similar to ec2/internal (even the package name and comment, leftover from copypaste i guess 😄), any particular reason to duplicate rather than reuse ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea was to refactor ec2/internal at a later point in time to not have a hard dependency on config. I didn't want to do the refactor now though as it seemed too much for an already large PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fair enough

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually since I guess this one won't be merged right away you could do that in parallel, merge the other PR, and pull main in this one 👀

And pkg/util/aws/creds/internal/helpers.go still imports the config so I'm not sure I get what the issue is

Comment thread comp/core/delegatedauth/impl/delegatedauth.go Outdated
Comment thread comp/core/delegatedauth/impl/delegatedauth.go Outdated
Comment thread comp/core/delegatedauth/impl/delegatedauth.go Outdated
return nil
}

// Status Provider implementation for noop
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need to provide a status provider for the noop, it will only be used for subcommands (which don't need status)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pattern for secrets as that is the closest to the delegatedauth component. I also think I ran into some complications if I did not have a noop status provider.

Comment thread comp/core/delegatedauth/def/delegatedauth.go
Comment on lines +33 to +36
// domainURLRegexp matches and captures known Datadog domains with optional protocol and trailing characters
// Captures: protocol (optional), subdomain (ignored), regional prefix + base domain, trailing dot (optional)
// Examples: https://agent.datad0g.com., http://metrics.us1.datadoghq.com, agent.ddog-gov.com
var domainURLRegexp = regexp.MustCompile(`^(?:https?://)?[^./]+\.((?:[a-z]{2,}\d{1,2}\.)?)(?:(datadoghq|datad0g)\.(com|eu)|(ddog-gov\.com))(\.)?\/?$`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a similar regex in pkg/config/utils, could we reuse that one to avoid duplication ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially modifying it, but at least have a single place with the logic for well known datadog sites

Copy link
Copy Markdown
Contributor Author

@wynbennett wynbennett Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was debating putting this function in there. Basically the problem I am trying to solve here is that if someone sets the dd_url to something like agent.datadoghq.com. Then GetMainEndpoint(cfg, "https://api.", "dd_url") will return agent.datadoghq.com When we really want to be hitting https://api.datadoghq.com regardless of what they have set in site.

This approach allows us to create a url to the API endpoint regardless of what they have set in site or dd_url. We have the same actual issue in the api/v1/validate endpoint. Except that endpoint does not support the staging domain name (see the following: https://github.com/DataDog/datadog-agent/blob/main/comp/forwarder/defaultforwarder/forwarder_health.go#L204 , https://github.com/DataDog/datadog-agent/blob/main/comp/forwarder/defaultforwarder/forwarder_health.go#L48)

I think generally we should refactor api/v1/validate and this token endpoint resolution to use the same methods but I think that should be for a followup PR.

Copy link
Copy Markdown
Contributor

@rahulkaukuntla rahulkaukuntla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My team just owns the go.mod and go.sum files in pkg/config, so LGTM

@wynbennett wynbennett removed the request for review from louis-cqrl February 19, 2026 16:53
// First, try to get credentials from config
awsCredentials.AccessKeyID = cfg.GetString(awsAccessKeyIDName)
awsCredentials.SecretAccessKey = cfg.GetString(awsSecretAccessKeyName)
awsCredentials.Token = cfg.GetString(awsSessionTokenName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any setting that you get from the config should be declared with BindEnvAndSetDefault (not done for those keys)
Also it's best to define them in lowercase for the config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deferred to just get the creds from the Env or IMDS rather than anything in the config

Comment on lines +100 to +114
// First, try to get credentials from config
awsCredentials.AccessKeyID = cfg.GetString(awsAccessKeyIDName)
awsCredentials.SecretAccessKey = cfg.GetString(awsSecretAccessKeyName)
awsCredentials.Token = cfg.GetString(awsSessionTokenName)

// Then try environment variables
if awsCredentials.AccessKeyID == "" {
awsCredentials.AccessKeyID = os.Getenv(awsAccessKeyIDName)
}
if awsCredentials.SecretAccessKey == "" {
awsCredentials.SecretAccessKey = os.Getenv(awsSecretAccessKeyName)
}
if awsCredentials.Token == "" {
awsCredentials.Token = os.Getenv(awsSessionTokenName)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subjective, but you could use cmp.Or to get the value (eg. cmp.Or(cfg.GetString(awsAccessKeyIDName), os.Getenv(awsAccessKeyIDName))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deferred to just get the creds from the Env or IMDS rather than anything in the config

Comment thread comp/core/delegatedauth/api/cloudauth/aws/aws.go Outdated
Comment thread comp/core/delegatedauth/def/delegatedauth.go Outdated
Comment thread comp/core/delegatedauth/api/cloudauth/config/config.go
Comment thread comp/core/delegatedauth/impl/delegatedauth.go Outdated
Comment thread comp/core/delegatedauth/impl/delegatedauth.go Outdated
Comment thread comp/core/delegatedauth/api/delegated_auth.go Outdated
Comment thread comp/core/delegatedauth/api/delegated_auth.go Outdated
Comment thread pkg/util/aws/creds/credentials.go
@pgimalac

This comment was marked as outdated.

@chatgpt-codex-connector

This comment was marked as outdated.

@pgimalac
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cced8ccc5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +340 to +341
if updated {
d.updateConfigWithAPIKey(instance, *lCreds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip config writes after refresh context cancellation

When AddInstance replaces an existing entry, the old refresh loop is canceled, but the success path still writes the fetched key unconditionally. If the old goroutine was already in flight, it can return after cancellation and overwrite the new instance’s key for the same config target, causing stale credentials to win during reconfiguration.

Useful? React with 👍 / 👎.

Comment on lines +85 to +91
req, err := http.NewRequest(http.MethodPost, url, bytes.NewBuffer([]byte("")))
if err != nil {
return nil, err
}
req.Header.Set(contentTypeHeader, applicationJSON)
req.Header.Set(authorizationHeader, fmt.Sprintf("%s %s", authorizationType, delegatedAuthProof))
resp, err := client.Do(req)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use cancellable request for intake-key exchange

The intake-key HTTP call is built with http.NewRequest and executed without any request context, so refresh cancellation cannot interrupt an in-flight exchange. In this component, refreshAndGetAPIKey holds the component mutex while calling authenticate, so a stuck request can block refreshes for other delegated-auth instances and delay key updates until lower-level transport timeouts.

Useful? React with 👍 / 👎.

wynbennett and others added 2 commits February 23, 2026 12:09
Summary of Changes

  HIGH Priority Issues Fixed:

  #1: Write lock held across network I/O (impl/delegatedauth.go:270)
  - Refactored refreshAndGetAPIKey to release the lock before making network calls (authenticate)
  - The lock is now only held briefly to check/update state, not during network I/O

  #2: Context not propagated to signer.SignHTTP (aws.go:195)
  - Updated generateAwsAuthData to accept a context parameter
  - Changed signer.SignHTTP(context.Background(), ...) to signer.SignHTTP(ctx, ...)

  #3: Context not propagated to getCredentials IMDS call (aws.go:119)
  - Updated getCredentials to accept a context parameter
  - Removed ctx := context.Background() and now uses the passed context for IMDS calls

  MEDIUM Priority Issues Fixed:

  #4: No response body size limit (api/delegated_auth.go:97)
  - Added maxResponseBodySize = 1 * 1024 * 1024 constant (1 MB)
  - Wrapped response body with io.LimitReader to prevent memory exhaustion

  #5: No overall HTTP client timeout (api/delegated_auth.go:82)
  - Added httpClientTimeout = 30 * time.Second constant
  - Added Timeout: httpClientTimeout to the HTTP client

  #6: config.Set called while holding write lock (impl/delegatedauth.go:341)
  - Moved updateConfigWithAPIKey call outside the lock in startBackgroundRefresh
  - Captured the API key while holding the lock, then released it before calling config.Set

  #7: Blocking IMDS calls while holding write lock (impl/delegatedauth.go:127)
  - Refactored initializeIfNeeded to perform cloud detection without holding locks
  - IMDS calls now happen outside any lock, then state is updated with a brief write lock

  #8: Regex fails silently for non-standard formats (api/delegated_auth.go:36)
  - Added debug log when endpoint doesn't match known Datadog domain pattern
  - Updated function documentation to clarify behavior

  #9: Uncached IMDS credential fetch (aws.go:104)
  - Added documentation explaining the trade-off (refresh interval is typically 60 minutes, so caching is not critical)

  #10: Auth proof format undocumented (aws.go:98)
  - Added detailed comment documenting the auth proof format: <base64-body>|<base64-headers>|<method>|<base64-url>

  LOW Priority Issues Fixed:

  #11: Unnecessarily exported types (aws.go)
  - Changed SigningData to signingData (unexported)
  - Changed AWSAuth.AwsRegion to AWSAuth.region (unexported)
  - Updated all references in aws.go and aws_test.go

  #12: Tests exercise copy of goroutine (impl/delegatedauth_test.go:19)
  - Added documentation explaining why tests use a simplified goroutine pattern
  - Clarified that integration tests cover the actual startBackgroundRefresh function

  #13: Subsequent Config param silently ignored (def/delegatedauth.go:24)
  - Updated documentation to clearly state that only the first Config is used
  - Added warning log when a different Config is passed on subsequent calls
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 661587a into main Feb 23, 2026
633 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the wyn.bennett/delegated_auth_comp branch February 23, 2026 20:17
@github-actions github-actions Bot added this to the 7.78.0 milestone Feb 23, 2026
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Feb 25, 2026
### What does this PR do?
Provide secret component implementation in core bundle through a functional option pattern.

### Motivation
The core bundle should be self sufficient, importing it and still having to provide some extra modules is surprising and annoying.

The delegated auth token (being introduced in #46272) has the exact same issue and will require being included along with secrets, with this change it can just be imported directly within the core bundle like the secret component, requiring no change to any command.

### Describe how you validated your changes
CI

### Additional Notes
This means all binaries will now import both the "real" and the "noop" implementation of the secret component, potentially resulting in more dependencies, but it should already be the case anyway, the run commands use the real secret component while most other subcommands use the noop, so both are imported.

If the size impact is too big, `WithSecrets` can be moved to a different package so that only binaries using it get its dependencies.

Co-authored-by: pierre.gimalac <pierre.gimalac@datadoghq.com>
github-actions Bot pushed a commit that referenced this pull request Feb 25, 2026
### What does this PR do?
Provide secret component implementation in core bundle through a functional option pattern.

### Motivation
The core bundle should be self sufficient, importing it and still having to provide some extra modules is surprising and annoying.

The delegated auth token (being introduced in #46272) has the exact same issue and will require being included along with secrets, with this change it can just be imported directly within the core bundle like the secret component, requiring no change to any command.

### Describe how you validated your changes
CI

### Additional Notes
This means all binaries will now import both the "real" and the "noop" implementation of the secret component, potentially resulting in more dependencies, but it should already be the case anyway, the run commands use the real secret component while most other subcommands use the noop, so both are imported.

If the size impact is too big, `WithSecrets` can be moved to a different package so that only binaries using it get its dependencies.

Co-authored-by: pierre.gimalac <pierre.gimalac@datadoghq.com> 351ba46
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Mar 4, 2026
…nfig (#46291)

### What does this PR do?
This adds a new authentication method for the agent. This give the agent the ability to exchange a AWS Cloud Auth Proof for an API key which is automatically managed and rotated on behalf of the customer. This is essentially extending the https://docs.datadoghq.com/account_management/cloud_provider_authentication into the agent.

This PR integrates the following PR into the config #46272

The flow is as so:

- Agent get AWS credentials from the environment the agent is running in
- Agent signs a request to AWS which will prove it's access to the credentials
- Agent passed the signed request to Datadog service
- Service validates the signed request against AWS and validates it against Datadog configuration
- Service response with a Managed and automatically rotated API key
- Agent propagates that API key throughout it's config
- If the delegated auth flow fails it will fallback to using the API key provided in the yaml file. The allows customers to onboard with limited risk to their current flow.

### Configuration

  **Global provider config + per-key-prefix enablement:**

  ```yaml
  # Global delegated auth provider settings (used for all instances)
  delegated_auth:
    org_uuid: ${YOUR_ORG_UUID}
    # Optional: explicitly specify provider (auto-detects if omitted)
    provider: aws
    aws:
      region: us-east-1  # Optional: auto-detects from IMDS if omitted

  # Per-prefix enablement (logs can use a different org)
  logs_config:
    delegated_auth:
      org_uuid: ${LOGS_ORG_UUID}  # Can be different org
```

### Motivation
This should enable customers to not need to manage the API used by the agent and instead use the AWS credentials in the AWS environment the agent is deploy to.

### Describe how you validated your changes
Ran the agent locally to validate changes and deployed to multiple staging clusters `charcadet` and `entei`. I have deployed it to staging clusters with both the feature enabled and the feature disabled.

I have marked this a something for RC testing as it is a large shift in how we can authenticate so it seems like it is more RC related.

To test this see the above YAML settings to turn on this method of getting an API key. You will need to ensure the AWS account you are authenticating with is allowed by our configuration. Please reach out to allow us to enable any AWS ARNs to use cloud auth.

### Additional Notes


Co-authored-by: wyn.bennett <wyn.bennett@datadoghq.com>
jchrostek-dd added a commit to DataDog/datadog-lambda-extension that referenced this pull request Apr 6, 2026
## Summary

Add support for AWS delegated authentication, allowing Lambda functions
to authenticate with Datadog using their IAM role instead of static API
keys. This mirrors the implementation in the main Datadog agent ([PR
#46272](DataDog/datadog-agent#46272)).

**How it works:**
1. Lambda function's IAM role signs an STS `GetCallerIdentity` request
2. The signed request is sent to Datadog's `/api/v2/intake-key` endpoint
as authentication proof
3. If the role is configured in Datadog's intake mapping, a managed API
key is returned
4. Falls back to other API key methods (Secrets Manager, KMS, SSM,
static) if delegated auth fails

**Note**: This function is in preview, customers will currently need to
request access to use it.

## Tests
- [x] Added an integration test that uses this new auth flow, verifies
that we have logs in our Datadog Serverless account for this new lambda.
Note that the IAM role needs to be added to a Datadog account mapping so
the IAM role we are using for this integ test is hardcoded to be the
same regardless of who is running it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/system-probe internal Identify a non-fork PR long review PR is complex, plan time to review it qa/rc-required Only for a PR that requires validation on the Release Candidate team/agent-configuration team/agent-runtimes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants