Skip to content

feat: Lazily resolve api key#717

Merged
lym953 merged 1 commit intomainfrom
yiming.luo/lazy-api-key-3
Jul 17, 2025
Merged

feat: Lazily resolve api key#717
lym953 merged 1 commit intomainfrom
yiming.luo/lazy-api-key-3

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented Jun 24, 2025

Motivation

From @astuyve:

today we basically block/await on that decrypt call before we can call /next
so if we can instead make that async and then resolve the future only when we need to flush data, that can be a big win for many customers.

https://datadoghq.atlassian.net/browse/SVLS-6995

Previous work

DataDog/serverless-components#21, DataDog/serverless-components#24 created ApiKeyFactory, which is a util to enable lazy API key resolution.

This PR

Updates Bottlecap code to use ApiKeyFactory to lazily resolve API key, i.e. instead of resolving it by querying Secret Manager or KMS during init phase, do it at flushing time when api key is actually needed.

Note

This PR changes the behavior when key resolution fails, i.e. when resolve_secrets() returns None.

  • Before: run extension_loop_idle(), which does not stop the runtime
  • After: panic, which will stop the runtime (if I understand correctly). Of course it's not ideal. Any better idea?
    • It's harder now to run extension_loop_idle() because api key resolution code is not in the main loop anymore, but in various consumer code of api key
    • Is there a way to gracefully shut down the extension without affecting the runtime?

Update: Added a PR to address resolution failure: #732
These two PRs should be merged together. Keeping them separate PRs just to make review easier.

Testing

Setup

  • Runtime: Go1 on Amazon Linux 2
  • Architecture: arm64
  • An app with empty implementation code

Result

Below is the Datadog Next-Gen Extension ready in: time logged.

  • Before: (prod extension arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:82)

    • 88.6 ± 1.8 (ms)
  • After: (test extension arn:aws:lambda:us-east-1:425362996713:layer:Datadog-Bottlecap-Beta-ARM-yiming:2)

    • 35.4 ± 5.1 (ms)
    • (-60.0%)
image

Both use 5 samples.

Notes

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998

@lym953 lym953 force-pushed the yiming.luo/separate-aws-creds branch from f749307 to 11c1c77 Compare June 27, 2025 17:35
lym953 added a commit that referenced this pull request Jul 2, 2025
# Problem
Right now `AwsConfig` has a lot of fields, including the ones related to
credential:
```
    pub aws_access_key_id: String,
    pub aws_secret_access_key: String,
    pub aws_session_token: String,
    pub aws_container_credentials_full_uri: String,
    pub aws_container_authorization_token: String,
```

The next PR #717
wants to lazily load API key and the credentials. To do that, for the
resolver function `resolve_secrets()`, I need to change the param
`aws_config` from `&AwsConfig` to `Arc<RwLock<AwsConfig>>`. Because
`aws_config` is passed to many places, this change involves updating
lots of functions, which is formidable.

# This PR
Separates these credential-related fields out from `AwsConfig` and
creates a new struct `AwsCredentials`

Thus, the next PR will only need to change the param `aws_credentials`
from `&AwsCredentials` to `Arc<RwLock<AwsCredentials>>`. Because
`aws_credentials` is not used in lots of places, the next PR becomes
easier.

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998
Base automatically changed from yiming.luo/separate-aws-creds to main July 2, 2025 21:11
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from fdd93a4 to 4ac16ad Compare July 2, 2025 21:31
@lym953 lym953 requested a review from Copilot July 3, 2025 15:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates ApiKeyFactory across Bottlecap to defer DD-API-KEY resolution until flush/send time, reducing init latency. Key changes include:

  • Replace direct API key strings with Arc<ApiKeyFactory> in all flushers, agents, and tests
  • Refactor trace/stat flusher and log flusher to initialize endpoints and headers lazily via OnceCell
  • Update resolve_secrets to use an async RwLock for AWS credentials and adjust related helper signatures

Reviewed Changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/metrics_integration_test.rs Switched FlusherConfig to use ApiKeyFactory
tests/logs_integration_test.rs Switched LogsFlusher instantiation to use factory
src/traces/trace_processor.rs Made process_traces async and use ApiKeyFactory
src/traces/trace_agent.rs Replaced stored key with factory; await per request
src/traces/stats_flusher.rs Swapped in factory and lazily build Endpoint
src/secrets/decrypt.rs Converted credentials to Arc<RwLock<_>> and updated calls
src/proxy/mod.rs Changed should_start_proxy to take Arc<AwsConfig>
src/proxy/interceptor.rs Updated interceptor to use Arc<AwsConfig>
src/otlp/agent.rs Updated OTLP agent to await process_traces
src/logs/flusher.rs Introduced factory plus lazy HeaderMap caching
src/lifecycle/invocation/span_inferrer.rs Updated tests to pass Arc<AwsConfig>
src/lifecycle/invocation/processor.rs Refactored processor to use Arc<AwsConfig>
Cargo.toml Bumped dogstatsd revision

Comment on lines +40 to 42
endpoint: OnceCell<Endpoint>,
}

Copy link

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching the Endpoint with an initial API key may lead to stale keys if ApiKeyFactory rotates or refreshes credentials. Consider regenerating or invalidating the cell when the underlying key changes.

Suggested change
endpoint: OnceCell<Endpoint>,
}
}
impl ServerlessStatsFlusher {
async fn construct_endpoint(&self) -> Endpoint {
let api_key = self.api_key_factory.get_api_key().await.to_string();
let stats_url = trace_stats_url(&self.config.site);
Endpoint {
url: hyper::Uri::from_str(&stats_url)
.expect("can't make URI from stats url, exiting"),
api_key: Some(api_key.clone().into()),
timeout_ms: self.config.flush_timeout * 1_000,
test_token: None,
}
}
}

Copilot uses AI. Check for mistakes.
config: Arc<config::Config>,
headers: HeaderMap,
api_key_factory: Arc<ApiKeyFactory>,
headers: OnceCell<HeaderMap>,
Copy link

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing headers including DD-API-KEY in a OnceCell will cache the first resolved key indefinitely. If the key can change at runtime, you might emit outdated headers; consider refreshing per flush or using a time-to-live.

Suggested change
headers: OnceCell<HeaderMap>,
headers: Arc<Mutex<HeaderMap>>,

Copilot uses AI. Check for mistakes.
Comment thread bottlecap/src/traces/trace_processor.rs Outdated

let received_payload =
if let TracerPayloadCollection::V07(payload) = tracer_payload.get_payloads() {
if let TracerPayloadCollection::V07(payload) = tracer_payload.await.get_payloads() {
Copy link

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Awaiting the future inline in the match expression can reduce readability. It may help to .await process_traces into a local send_data variable first, then call get_payloads() on it.

Copilot uses AI. Check for mistakes.
Comment thread bottlecap/src/bin/bottlecap/main.rs Outdated
Comment on lines +333 to +351
let aws_config = Arc::new(aws_config);
let aws_credentials = Arc::new(RwLock::new(aws_credentials));
let api_key_factory = {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.unwrap_or_else(|| {
error!("Failed to resolve API key");
String::new()
})
})
})))
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During INIT phase, instead of resolving API key, just initialize an API key factory.

#[allow(clippy::too_many_lines)]
async fn extension_loop_active(
aws_config: &AwsConfig,
aws_config: Arc<AwsConfig>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to Arc so it can be passed to ApiKeyFactory and shared across threads.

config: Arc<config::Config>,
headers: HeaderMap,
api_key_factory: Arc<ApiKeyFactory>,
headers: OnceCell<HeaderMap>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazily initialize headers, which includes api key.

aws_config: &AwsConfig,
aws_credentials: &mut AwsCredentials,
aws_config: Arc<AwsConfig>,
aws_credentials: Arc<RwLock<AwsCredentials>>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A core change: added RwLock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah and we need an RwLock here because the factory will lazily write/update this struct member?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The factory fills in aws_access_key_id, aws_secret_access_key and aws_session_token if we are in snap start.

Comment thread bottlecap/src/secrets/decrypt.rs
config: Arc<config::Config>,
endpoint: Endpoint,
api_key_factory: Arc<ApiKeyFactory>,
endpoint: OnceCell<Endpoint>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazily resolve endpoint, which contains api key.

Comment thread bottlecap/src/bin/bottlecap/main.rs Outdated
Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.expect("Failed to resolve API key")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better way to handle this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm ideally, if failing on resolve you'd enter the noop loop, what would happen if you added that here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will panic at flush time and stop the runtime, so I wonder if there's a way for the extension to stop gracefully at that time without stopping the runtime.

@lym953 lym953 marked this pull request as ready for review July 3, 2025 17:09
@lym953 lym953 requested a review from a team as a code owner July 3, 2025 17:09
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from d623d67 to 3724235 Compare July 7, 2025 17:02
Copy link
Copy Markdown
Contributor

@astuyve astuyve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Yiming! Let's get this onto self monitoring to test for a few days

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from cec3247 to 727d04f Compare July 9, 2025 15:47
Comment thread bottlecap/src/bin/bottlecap/main.rs Outdated
Comment on lines +335 to +351
let api_key_factory = {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.expect("Failed to resolve API key")
})
})))
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could make this a method here on in the api key resolver, really don't like the pattern of having the declaration of the same variable 3 times just because we're nesting it, given how main code has increasingly grown, it would be good to hide this in some way and document it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Extracted a function create_api_key_factory()

Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR @lym953 !

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 3be2928 to e820be0 Compare July 10, 2025 20:18
@lym953 lym953 changed the base branch from main to jordan.gonzalez/trace-agent/aggregation-for-proxy July 10, 2025 20:19
@duncanista duncanista force-pushed the jordan.gonzalez/trace-agent/aggregation-for-proxy branch from 4c1162d to 07aa341 Compare July 14, 2025 19:11
Base automatically changed from jordan.gonzalez/trace-agent/aggregation-for-proxy to main July 16, 2025 19:47
lym953 added a commit that referenced this pull request Jul 16, 2025
…745)

# Background
Right now `SendData` is passed around across channels.

# This PR

Instead of passing `SendData`, pass `SendDataBuilderInfo`, which bundles
`SendDataBuilder` and payload size. Just before flush, call
`SendDataBuilder.build()` to build `SendData`.

# Motivation
DataDog/libdatadog#1140 (comment)
It is suggested that the function `set_api_key()` shouldn't be added on
`SendData`, but should be added on `SendDataBuilder`. Because need to
call `set_api_key()` just before flush, we need to make sure the object
is `SendDataBuilder` instead of `SendData` until flush time.

And because we need payload size in Trace Aggregator, and
`SendDataBuilder` doesn't expose this field, we need to pass it
explicitly along with `SendDataBuilder`.

# Next steps
Update #717
#732 so that
`get_api_key()` is called just before flush.

# Dependency
DataDog/libdatadog#1140
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 2ef2a9a to 37caca4 Compare July 16, 2025 20:51
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 37caca4 to ef63759 Compare July 16, 2025 21:46
@lym953 lym953 merged commit a8d05a1 into main Jul 17, 2025
46 checks passed
@lym953 lym953 deleted the yiming.luo/lazy-api-key-3 branch July 17, 2025 19:53
lym953 added a commit that referenced this pull request Jul 17, 2025
# Context
The previous PR
#717 defers API
key resolution from extension init stage to flush time. However, that PR
doesn't well handle the failure case.
- Before that PR, if resolution fails in init stage, the extension will
run an idle loop.
- After that PR, the extension will crash at flush time, which will kill
the runtime as well, which is not desired.

# What does this PR do?
1. For traces, defer key resolution from
`TraceProcessor.process_traces()` to `TraceFlusher.flush()`.
- (This should ideally be in the previous PR, but since that is already
approved, let me add this change in this new PR.)
2. If resolution fails at flush time, then make flush a no-op, so the
extension can keep running and consume events without crashing.

# Dependencies
1. DataDog/serverless-components#25
2. DataDog/libdatadog#1140

# Manual Test

## Steps
1. Create a layer in sandbox
2. Apply the layer to a Lambda function
3. Set the env var `DD_API_KEY_SECRET_ARN` to an invalid value
5. Run the Lambda
6. Then set `DD_API_KEY_SECRET_ARN` to a valid value
7. Run the Lambda

## Result
1. The function was successful
<img width="319" alt="image"
src="https://github.com/user-attachments/assets/f8a5cb36-f678-4643-ba1c-85f41256ffa1"
/>

2. The extension printed some error logs
<img width="737" height="33" alt="image"
src="https://github.com/user-attachments/assets/22553d24-e1f5-4ee5-9a91-0d18e3e2f297"
/>

<img width="603" height="186" alt="image"
src="https://github.com/user-attachments/assets/e797f991-ecba-45f0-8f49-7b7b59dd9e7b"
/>

3. With valid secret ARN, the Lambda runs successfully and reports to
Datadog
<img width="678" height="150" alt="image"
src="https://github.com/user-attachments/assets/073089f8-1e9a-4728-b8d1-1db7aa85d031"
/>

<img width="533" height="96" alt="image"
src="https://github.com/user-attachments/assets/d5f2b81c-5e02-42bc-b3ef-85e611228fc6"
/>


# Automated Test

I didn't add any automated test because from what I see in the codebase,
existing tests are usually unit tests for short functions and not for
long functions that this PR touches. Please let me know if you think I
should add automated tests.
lym953 added a commit that referenced this pull request Sep 19, 2025
# Problem
When a Lambda (1) uses snap start, and (2) specifies Datadog API key
using `DD_API_KEY_SECRET_ARN`, the extension will encounter a deadlock.
For a `RwLock`, the extension first gets a read lock:

https://github.com/DataDog/datadog-lambda-extension/blob/daf633dd003447d78261e7c371838b5af21073a1/bottlecap/src/secrets/decrypt.rs#L45
then tries to get a write lock:

https://github.com/DataDog/datadog-lambda-extension/blob/daf633dd003447d78261e7c371838b5af21073a1/bottlecap/src/secrets/decrypt.rs#L65

which never finishes. This causes the function to time out.

This bug was introduced in
#717.

# This PR
Fix this bug by removing the `RwLock` usage. `AwsCredential` is only
created and used once in `resolve_secrets()`, and `resolve_secrets()` is
only called once, so there's no need to protect this struct with a lock.

# Testing
Tested on a Lambda with:
- Python 3.13 runtime
- snap start
- using `DD_API_KEY_SECRET_ARN`

Before:
- The function timed out.
- Data failed to be sent to Datadog.

After:
- The function finished without timeout.
- Data was sent to Datadog successfully.

# Notes
Jira: https://datadoghq.atlassian.net/browse/SLES-2482
duncanpharvey pushed a commit that referenced this pull request Mar 10, 2026
# Problem
Right now `AwsConfig` has a lot of fields, including the ones related to
credential:
```
    pub aws_access_key_id: String,
    pub aws_secret_access_key: String,
    pub aws_session_token: String,
    pub aws_container_credentials_full_uri: String,
    pub aws_container_authorization_token: String,
```

The next PR #717
wants to lazily load API key and the credentials. To do that, for the
resolver function `resolve_secrets()`, I need to change the param
`aws_config` from `&AwsConfig` to `Arc<RwLock<AwsConfig>>`. Because
`aws_config` is passed to many places, this change involves updating
lots of functions, which is formidable.

# This PR
Separates these credential-related fields out from `AwsConfig` and
creates a new struct `AwsCredentials`

Thus, the next PR will only need to change the param `aws_credentials`
from `&AwsCredentials` to `Arc<RwLock<AwsCredentials>>`. Because
`aws_credentials` is not used in lots of places, the next PR becomes
easier.

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998
duncanpharvey added a commit to DataDog/serverless-components that referenced this pull request Mar 11, 2026
* chore(bottlecap): make config a folder module (#242)

* remove `config.rs` file

* create `config/mod.rs`

* move to `config/flush_strategy.rs`

* move to `config/log_level.rs`

* update imports

* fmt

* feat(bottlecap): add logs processing rules (#243)

* add logs processing rules field

* add `regex` crate

* add `processing_rules.rs` config module

* use `processing_rule` module instead

* update logs `processor` to use compiled rules

* update unit test

* Svls 4825 support encrypted keys manual (#258)

* add plumbing for aws secret manager

* strip as much deps as possible

* fix test

* remove unused warning

* reorg runner for bottlecap

* fix overwriting of arch

* add full error to the panic

* avoid building the go agent all the time

* rename module

* speed up build

* add simple scripts to build and publish

* remove deleted call

* remove changes from common scripts

* resolve import conflicts

* wrong file pushed

* make sure permissions are right

* move secret parsing after log activation

* add some stat to build

* add manual req for secret (still broken)

* rebuild after conflict on cargo loc

* automate update and call

* change headers and fix signature

* fix typo and small refactor

* remove useless thread spawn

* small refactors on deploy scripts

* use access key always for signatures

* the secret has to be used to sign

* fix: missing newline in request

* use only manual decrypt

* add timed steps

* add scripts to force restarts

* fix launch script

* refactor decrypt

* cargo format and clippy

* fix clippy error

add formatting/clippy functinos

---------

Co-authored-by: AJ Stuyvenberg <astuyve@gmail.com>

* add kms handling (#261)

* add kms handling

* fix return value

* fix test

* fix kms

* remove committed test file

* rename

* format

* fmt after fix

* fix conflicts

* await async stuff

* formatting

* bubble up error converting to sdt

* use box dyn for generic errors

* reforamt

* address other comments

* remove old build file added with conflict

* Svls 4978 handle secrets error (#271)

* add kms handling

* fix return value

* fix test

* fix kms

* remove committed test file

* rename

* format

* fmt after fix

* fix conflicts

* await async stuff

* formatting

* bubble up error converting to sdt

* use box dyn for generic errors

* reforamt

* address other comments

* remove old build file added with conflict

* do not pass around the whole config for just the secret

* fix scope and just bubble up erros

* reformat

* renaming

* without api key, just call next loop

* fix types and format

* fix folder path

* fix cd and returns

* resolve conflicts

* formatter

* chore(bottlecap): log failover reason (#292)

* print failover reason as json string

* fmt

* update key to be more verbose

* Add APM tracing support (#294)

* wip: tracing

* feat: tracing WIP

* feat: rename mini agent to trace agent

* feat: fmt

* feat: Fix formatting after rename

* fix: remove extra tokio task

* feat: allow tracing

* feat: working v5 traces

* feat: Update to use my branch of libdatadog so we have v5 support

* feat: Update w/ libdatadog to pass trace encoding version

* feat: update w/ merged libdatadog changes

* feat: Refactor trace agent, reduce code duplication, enum for trace version. Pass trace provider. Manual stats flushing. Custom create endpoint until we clean up that code in libdatadog.

* feat: Unify config, remove trace config. Tests pass

* feat: fmt

* feat: fmt

* clippy fixes

* parse time

* feat: clippy again

* feat: revert dockerfile

* feat: no-default-features

* feat: Remove utils, take only what we need

* feat: fmt moves the import

* feat: replace info with debug. Replace log with tracing lib

* feat: more debug

* feat: Remove call to trace utils

* feat: Allow appsec but in a disabled-only state until we add support for the runtime proxy (#296)

* feat: Allow appsec but in a disabled-only state until we add support for the runtime proxy

* feat: Log failover reason

* fix: serverless_appsec_enabled. Also log the reason

* feat: Require DD_EXTENSION_VERSION: next (#302)

* feat: Require DD_EXTENSION_VERSION: next

* feat: add tests, fix metric tests

* feat: revert metrics test byte changes

* feat: fmt

* feat: remove ref

* feat: honor enhanced metrics bool (#307)

* feat: honor enhanced metrics bool

* feat: add test

* feat: refactor to log instead of return result

* fix: clippy

* feat: warn by default (#316)

* chore(bottlecap): fallback on `datadog.yaml` usage (#326)

* fallback on `datadog.yaml` presence

* add comment

* fix(bottlecap): filter debug logs from external crates (#329)

* remove `tracing-log`

instead, use the `tracing-subscriber` `tracing-log` feature

* capitalize debugs

* remove unnecessary file

* update log formatter prefix

* update log filter

* fmt

* chore(bottlecap): switch flushing strategy to race (#318)

* feat: race flush

* refactor: periodic only when configured

* fmt

* when flushing strategy is default, set periodic flush tick to `1s`

* on `End`, never flush until the end of the invocation

* remove `tokio_unstable` feature for building

* remove debug comment

* remove `invocation_times` mod

* update `flush_control.rs`

* use `flush_control` in main

* allow `end,<ms>` strategy

allows to flush periodically over a given amount of seconds and at the end

* update `debug` comment for flushing

* simplify logic for flush strategy parsing

* remove log that could spam debug

* refactor code and add unit test

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>
Co-authored-by: alexgallotta <5581237+alexgallotta@users.noreply.github.com>

* remove log that might confuse customers (#333)

* Fix dogstatsd multiline (#335)

* test: add invalid string and multi line distro test with empty newline

* test: move unit test to appropriate package

* fix: do not error log for empty and new line strings

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* add env vars to be ignored (#337)

* feat: Open up more env vars which we don't rely on (#344)

* feat: Allow trace disabled plugins (#348)

* feat: Allow trace disabled plugins

* feat: trace debug

* feat: Allowlist additional env vars (#354)

* feat: Allowlist additional env vars

* fix: fmt

* feat: and repo url

* aj/allow apm replace tags array (#358)

* fix: allow objects to be ignored

* feat: specs

* fix(bottlecap): set explicit deny list and allow yaml usage (#363)

* set explicit deny list

also allow `datadog.yaml` usage

* add unit test for parsing rule from yaml

* remove `object_ignore.rs`

* remove import

* remove logging failover reason when user is not opt-in

* chore(bottlecap): fast failover (#371)

* failover fast

* typo

* failover on `/opt/datadog_wrapper` set

* aj/fix log level casing (#372)

* feat: serde's rename_all isn't working, use a custom deserializer to lowercase loglevels

* feat: default is warn

* feat: Allow reptition to clear up imports

* feat: rebase

* feat: failover on dd proxy (#391)

* feat: support HTTPS_PROXY (#381)

* feat: support DD_HTTP_PROXY and DD_HTTPS_PROXY

* fix: remove import

* fix: fmt

* feat: Revert fqdn changes to enable testing

* feat: Use let instead of repeated instantiation

* feat: Rip out proxy stuff we dont need but make sure we dont proxy the telemetry or runtime APIs with system proxies

* feat: remove debug

* fix: no debugs for hyper/h2

* fix: revert cargo changes

* feat: Pin libdatadog deps to v13.1

* fix: rebase with dogstatsd 13.1

* fix: use main for dsdrs

* fix: remove unwrap

* fix: fmt

* fix: licenses

* increase size boo

* fix: size ugh

* fix: install_default() in tests

* aj/honor both proxies in order (#410)

* feat: Honor priority order of DD_PROXY_HTTPS over HTTPS_PROXY

* feat: fmt

* fix: Prefer Ok over some + ok

* Feat: Use tags for proxy support in libdatadog

* fix: no proxy for tests

* fix: license

* all this for a comma

* accept `datadog_wrapper`

* Revert "accept `datadog_wrapper`"

This reverts commit 9560657582f2f22c8e68af5d0bb9d7d2b0765650.

* accept `datadog_wrapper` (#373)

* feat(bottlecap): create Inferred Spans baseline + infer API Gateway HTTP spans (#405)

* add `Trigger` trait for inferred spans

* add `ApiGatewayHttpEvent` trigger

* add `SpanInferrer`

* make `invocation::processor` to use `SpanInferrer`

* send `aws_config` to `invocation::processor`

* use incoming payload for `invocation::processor` for span inferring

* add `api_gateway_http_event.json` for testing

* add `api_gateway_proxy_event.json` for testing

* fix: Convert tag hashmap to sorted vector of tags

* fix: fmt

---------

Co-authored-by: AJ Stuyvenberg <astuyve@gmail.com>

* feat(bottlecap): Add Composite Trace Propagator (#413)

* add `trace_propagation_style.rs`

* add Trace Propagation to `config.rs`

also updated unit tests, as we have custom behavior, we should check only the fields we care about in the tests

* add `links` to `SpanContext`

* add composite propagator

also known as our internal http propagator, but in reality, http doesnt make any sense to me, its just a composite propagator which we used based on our configuration

* update `TextMapPropagator`s to comply with interface

also updated the naming

* fmt

* add unit testing for `config.rs`

* add `PartialEq` to `SpanContext`

* correct logic from `text_map_propagator.rs`

logic was wrong in some parts, this was discovered through unit tests

* add unit tests for `DatadogCompositePropagator`

also corrected some logic

* feat(bottlecap): add capture lambda payload (#454)

* add `tag_span_from_value`

* add `capture_lambda_payload` config

* add unit testing for `tag_span_from_value`

* update listener `end_invocation_handler`

parsing should not be handled here

* add capture lambda payload feature

also parse body properly, and handle `statusCode`

* feat(bottlecap): add Cold Start Span + Tags (#450)

* add some helper functions to `invocation::lifecycle` mod

* create cold start span on processor

* move `generate_span_id` to father module

* send `platform_init_start` data to processor

* send `PlatformInitStart` to main bus

* update cold start `parent_id`

* fix start time of cold start span

* enhanced metrics now have a `dynamic_value_tags` for tags which we have to calculate at points in time

* `AwsConfig` now has a `sandbox_init_time` value

* add `is_empty` to `ContextBuffer`

* calculate init tags on invoke

also add a method to reset processor invocation state

* restart init tags on set

* set tags properly for proactive init

* fix unit test

* remove debug line

* make sure `cold_start` tag is only set in one place

* feat(bottlecap): support service mapping and `peer.service` tag (#455)

* add some helper functions to `invocation::lifecycle` mod

* create cold start span on processor

* move `generate_span_id` to father module

* send `platform_init_start` data to processor

* send `PlatformInitStart` to main bus

* update cold start `parent_id`

* fix start time of cold start span

* enhanced metrics now have a `dynamic_value_tags` for tags which we have to calculate at points in time

* `AwsConfig` now has a `sandbox_init_time` value

* add `is_empty` to `ContextBuffer`

* calculate init tags on invoke

also add a method to reset processor invocation state

* restart init tags on set

* set tags properly for proactive init

* fix unit test

* remove debug line

* make sure `cold_start` tag is only set in one place

* add service mapping config serializer

* add `service_mapping.rs`

* add `ServiceNameResolver` interface

for service mapping

* implement interface in every trigger

* send `service_mapping` lookup table to span enricher

* create `SpanInferrer` with `service_mapping` config

* fmt

* rename failover to fallback (#465)

* fix(bottlecap): fallback when otel set (#470)

* fallback on otel

* add unit test

* feat(bottlecap): fallback on opted out only (#473)

* fallback on opted out only

* log on opted out

* fix(bottlecap): fallback on yaml otel config (#474)

* fallback on opted out only

* fallback on yaml otel config

* switch `legacy` to `compatibility`

* feat: honor serverless_logs (#475)

* feat: honor serverless_logs

* fmt

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* feat: Flush timeouts (#480)

* fix version parsing for number (#492)

* fix: fallback on intake urls (#495)

* fallback on `dd_url`, `dd_url`, and, apm and logs intake urls

* fix env var for apm url

* grammar

* set dogstatsd timeout (#497)

* set dogstatsd timeout

* add todo for other edge case

* add comment on jitter. Likely not required for lambda

* fmt

* update license

* update sha for dogstatsd

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* fix: set right domain and arn by region on secrets manager (#511)

* check whether the region is in China and use the appropriated domain

* correct arn for lambda in chinese regions

* fix: typo in china arn

* fix: reuse function to detect right aws partition and support gov too

* nest and rearrange imports

* fix imports again

* fix: Honor noproxy and skip proxying if ddsite is in the noproxy list (#520)

* fix: Honor noproxy and skip proxying if ddsite is in the noproxy list

* feat: specs

* feat: Oneline check, add comment

* Support proxy yaml config (#523)

* fix: Honor noproxy and skip proxying if ddsite is in the noproxy list

* feat: specs

* feat: yaml proxy had a different format

* feat: Oneline check, add comment

* feat: Support nonstandard proxy config

* feat: specs

* fix: bad merge whoops

* feat: Support snapstart's vended credentials (#532)

* feat: Support snapstart's vended credentials

* feat: Add snapstart events

* fix: specs

* feat: Mutable config as we consume it entirely by the secrets module.

* fix: needless borrow

* feat: add zstd and compress (#558)

* feat: add zstd and compress

* hack: skip clippy for a sec

* feat: Honor logs config settings.

* fix: dont set zstd header unless we compress

* fmt

* clippy

* fmt

* fix: ints

* licenses

* remove debug code

* wtf clippy and fmt, pick one

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* Svls 6036 respect timeouts (#537)

* log shipping times

* set flush timeout for traces

* remove retries

* fix conflicts

* address comments

* Fallback on gov regions (#550)

* Aj/support pci and custom endpoints (#585)

* feat: logs_config_logs_dd_url

* feat: apm pci endpoints

* feat: metrics

* feat: support metrics using dogstatsd methods

* fix: use the right var

* tests: use server url override

* feat: refactor into flusher method

* feat: clippy

* Aj/yaml apm replace tags (#602)

* feat: yaml APM replace tags rule parsing

* feat: Custom deserializer for replace tags. yaml -> JSON so we can rely on the same method because ReplaceRule is totally private

* remove aj

* feat: merge w/ libdatadog main

* feat: Parse http obfuscation config from yaml

* feat: licenses

* feat: parse env and service as strings or ints (#608)

* feat: parse env and service as strings or ints

* feat: add service test

* fmt

* Add DSM and Profiling endpoints (#622)

- **feat: Support DSM proxy endpoint**
- **feat: profiling support**
- **feat: add additional tags**

* chore(config): parse config only twice  (#651)

# What?

Removes `FallbackConfig` and `FallbackYamlConfig` in favor of the
existing configurations.

# How?

1. Using only the known places where we are going to fallback from the
available configs.
2. Moved environment variables and yaml config to its own file for
readability.

# Notes

- Added fallbacks for OTLP (in preparation for that PR, allowed some
fields to not fallback).

* fix: Parse DD_APM_REPLACE_TAGS env var (#656)

Fixes an issue where we didn't parse `DD_APM_REPLACE_TAGS` because the
yaml block includes an additional `config` word after APM, which is not
present in the env var.

As usual, env vars override config file settings

* feat: Optionally disable proc enhanced metrics (#663)

Fixes #648

For customers using very very fast/small lambda functions (usually just
rust), there can be a small 1-2ms increase in runtime duration when
collecing metrics like open file descriptors or tmp file usage.

We still enable these by default, but customers can now optionally
disable them

* fix(config): serialize booleans from anything (#657)

# What?

Serializes any boolean with values `0|1|true|TRUE|False|false` to its
boolean part.

# How?

Using `serde-aux` crate to leverage the unit testing and ownership.

# Motivation

Some products at Datadog allow this values as they coalesce them –
[SVLS-6687](https://datadoghq.atlassian.net/browse/SVLS-6687)

[SVLS-6687]:
https://datadoghq.atlassian.net/browse/SVLS-6687?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

* chore(config): create `aws` module (#659)

# What?

Refactors methods related to AWS config into its own module

# Motivation

Just cleaning and removing stuff from main
– [SVLS-6686](https://datadoghq.atlassian.net/browse/SVLS-6686)

[SVLS-6686]:
https://datadoghq.atlassian.net/browse/SVLS-6686?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

* feat: [SVLS-6242] bottlecap fips builds (#644)

Building bottlecap with fips mode.

This is entirely focused on removing `ring` (and other
non-FIPS-compliant dependencies from our `fips`-featured builds.)

* fix(config): remove `apm_ignore_resources` check in OTEL (#676)

# What?

Removes usage of `DD_APM_IGNORE_RESOURCES` in the OTEL span transform.

# Why?

1. The implementation was incorrect and shouldn't check for resources to
ignore in the transformation step.
2. It was not properly used in the `apm_config` for YAML files.

# Notes:

- Follow up PR to implement `APM_IGNORE_RESOURCES` properly in the Trace
Agent.

# More

Learn about ignoring resources:
https://docs.datadoghq.com/tracing/guide/ignoring_apm_resources/?tab=datadogyaml#ignoring-based-on-resources

`DD_APM_IGNORE_RESOURCES` is specified as:

```
A list of regular expressions can be provided to exclude certain traces based on their resource name.
All entries must be surrounded by double quotes and separated by commas.
```

A correct usage would be:

```env
DD_APM_IGNORE_RESOURCES="(GET|POST) /healthcheck,API::NotesController#index"
```

or in yaml
```yaml
apm_config:
  ignore_resources: ["(GET|POST) /healthcheck","API::NotesController#index"]
```

* feat(proxy): abstract lambda runtime api proxy (#669)

# What?

Abstracts the concept of the `proxy` from the Lambda Web Adapter
implementation.
This will unlock the usage of ASM.

# How?

Using `axum` crate, we create a new server proxy with specific routes
from the Lambda Runtime API which we are interested in proxying.

# Motivation

ASM and [SVLS-6760](https://datadoghq.atlassian.net/browse/SVLS-6760)



[SVLS-6760]:
https://datadoghq.atlassian.net/browse/SVLS-6760?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

* fix(config): fix otlp trace agent to start when right configuration is set (#680)

# What?

Ensures that OTLP agent is only enabled when the
`otlp_config_receiver_protocols_http_endpoint` is set, and when
`otlp_config_traces_enabled` is `true`

 # Motivation

#678 

# Notes

OTEL agent should only spin up when receiver protocols endpoint is set,
so this was a miss on my side.

* feat: continuous flushing strategy for high throughput functions (#684)

This is a heavy refactor and new feature.
- Introduces FlushDecision and separates it from FlushStrategy
- Cleans up FlushControl logic and methods

It also adds the ability to flush telemetry across multiple serial
lambda invocations. This is done using the `continuous` strategy.

This is a huge win for busy functions as seen in our test fleet, where
the p99/max drops precipitously, which also causes the average to
plummet. This also helps reduce the number of cold starts encountered
during scaleup events, which further reduces latency along with costs:

![image](https://github.com/user-attachments/assets/14851e22-327d-43b0-8246-5780cfbf6ef7)

Technical implementation:
We spawn the task and collect the flush handles, then in the two
periodic strategies we check if there were any errors or unresolved
futures in the next flush cycle. If so, we switch to the `periodic`
strategy to ensure flushing completes successfully.

We don't adapt to the periodic strategy unless the last 20 invocations
occurred within the `config.flush_timeout` value, which has been
increased by default. This is a naive implementation. A better one would
be to calculate the first derivative of the invocation periodicity. If
the rate is increasing, we can adapt to the continuous strategy. If the
rate slows, we should fall back to the periodic strategy.
<img width="807" alt="image"
src="https://github.com/user-attachments/assets/d3c25419-f1da-4774-975f-0e254047b9b7"
/>

The existing implementation is cautious in that we could definitely
adapt sooner but don't.


Todo: add a feature flag for continuous flushing?

* fix: bump flush_timeout default (#697)

A little goofy because we use this to determine when/how to move over to
continuous flushing, but the gist is that our invocation context tracks
the start time of each invocation. Because it's all local to a single
sandbox, this means that the time diff between invocations includes post
runtime duration, so it's very common to have 20 invocations greater
than 10s if there are even a couple of periodic/end flushes in there.

This customizable with `DD_FLUSH_TIMEOUT` so if people want to set it to
a very short timeout, they are able to.

* feat: Allow users to specify continuous strategy (#701)

https://datadoghq.atlassian.net/browse/SVLS-6994

* feat: Use http2 unless overridden or using a proxy (#706)

We rolled out HTTP/2 support for logs in v81, which seems to have broken
logs for some users relying on proxies which may not support http2.

This change introduces a new configuration option called `use_http1`.

1. If `DD_HTTP_PROTOCOL` is explicitly set to http1, we'll use it
2. If `DD_HTTP_PROTOCOL` is not set and the user is using a proxy, we'll
use http1 unless overridden by the `DD_HTTP_PROTOCOL` flag being set to
anything other than `http1`.

fixes #705

* Dual shipping metrics support (#704)

Adds support for dual shipping metrics to endpoints configured using the
`additional_endpoints` YAML or `DD_ADDITIONAL_ENDPOINTS` env var config.

For each configured endpoint/API key combination, we now create a
separate `MetricsFlusher` to flush the same batch of metrics to multiple
endpoints in parallel. Also, updates the retry logic to retry flushing
for the specific flusher that encountered an error.

Tested dual shipping metrics to 2 additional orgs/endpoints including
eu1.

Depends on dogstatsd changes:
https://github.com/DataDog/serverless-components/pull/20

* chore: Separate AwsCredentials from AwsConfig (#716)

# Problem
Right now `AwsConfig` has a lot of fields, including the ones related to
credential:
```
    pub aws_access_key_id: String,
    pub aws_secret_access_key: String,
    pub aws_session_token: String,
    pub aws_container_credentials_full_uri: String,
    pub aws_container_authorization_token: String,
```

The next PR https://github.com/DataDog/datadog-lambda-extension/pull/717
wants to lazily load API key and the credentials. To do that, for the
resolver function `resolve_secrets()`, I need to change the param
`aws_config` from `&AwsConfig` to `Arc<RwLock<AwsConfig>>`. Because
`aws_config` is passed to many places, this change involves updating
lots of functions, which is formidable.

# This PR
Separates these credential-related fields out from `AwsConfig` and
creates a new struct `AwsCredentials`

Thus, the next PR will only need to change the param `aws_credentials`
from `&AwsCredentials` to `Arc<RwLock<AwsCredentials>>`. Because
`aws_credentials` is not used in lots of places, the next PR becomes
easier.

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998

* chore(config): separate config from sources (#709)

# What?

Separates the configuration from sources, allowing it to be used in more
use cases.

# How?

Creates new default configuration and separates the environment
variables and YAML sources from the default.

# Why?

Make it easier to track changes in every source, as the field names
might be different to what they are used at the configuration level.

# Notes

I expect to abstract this even more by providing it as a crate which can
have features, that way customers can only use the sources and product
specific fields they need.

---------

Co-authored-by: Aleksandr Pasechnik <aleksandr.pasechnik@datadoghq.com>
Co-authored-by: Florentin Labelle <florentin.labelle@outlook.fr>

* Dual Shipping Logs Support (#718)

Adds support for dual shipping metrics to endpoints configured using the
`logs_config` YAML or `DD_LOGS_CONFIG_ADDITIONAL_ENDPOINTS` env var
config.

Implemented a `LogsFlusher` as a wrapper to all the `Flusher` instances
to manages flushing to all configured endpoints.

Moved retry logic to `LogsFlusher`, as the retry request contains the
endpoint details and does not have to be tied to a particular flusher.

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* chore: upgrade rust version for toolchain to 1.84.1 (#743)

# This PR
1. In `rust-toolchain.toml`, upgrade Rust version from `1.81.0` to
`1.84.1`.
2. Fix/mute clippy errors caused by the upgrade
- some errors require non-trivial code changes, so I muted them for now
and added a TODO to fix them in separate PRs.

# Motivation
`libdatadog` now uses `1.84.1`
https://github.com/DataDog/libdatadog/blame/main/Cargo.toml#L62

To test changes on `libdatadog`, I need to change the Rust version in
`datadog-lambda-extension` to 1.84.1 as well.

Making this a separate PR:
1. so it's easier to test multiple PRs that depend on changes on
`libdatadog` in parallel after I merge this PR to main.
4. because this PR also involves lots of code changes needed to make
clippy happy

* feat: dual shipping APM support (#735)

Adds support for dual shipping traces to endpoints configured using the
`apm_config` YAML or `DD_APM_CONFIG_ADDITIONAL_ENDPOINTS` env var
config.

#### Additional Notes:
- Bumped libdatadog (and serverless-components) to include
https://github.com/DataDog/libdatadog/pull/1139
- Adds configuration option to set compression level for trace payloads

* chore: Add doc and rename function for flushing strategy (#740)

# Motivation

It took me quite some effort to understand flushing strategies. I want
to make it easier to understand for me and future developers.

# This PR
Tries to make flushing strategy code more readable:
1. Add/move comments
2. Create an enum `ConcreteFlushStrategy`, which doesn't contain
`Default` because it is required to be resolved to a concrete strategy
3. Rename `should_adapt` to `evaluate_concrete_strategy()`

# To reviewers
There are still a few things I don't understand, which are marked with
`TODO`. Appreciate explanation!
Also correct me if any comment I added is wrong.

* chore: upgrade to edition 2024 and fix all linter warnings (#754)

Also updates CI to run `clippy` on `--all-targets` so that linter errors
aren't ignored on side targets such as tests.

* fix(apm): Enhance Synthetic Span Service Representation (#751)

<!--- Please remember to review the [contribution
guidelines](https://github.com/DataDog/datadog-lambda-python/blob/main/CONTRIBUTING.md)
if you have not yet done so._ --->

### What does this PR do?
<!--- A brief description of the change being made with this pull
request. --->

Rollout of span naming changes to align serverless product with tracer
to create streamlined Service Representation for Serverless

Key Changes:

- Change service name to match instance name for all managed services
(aws.lambda -> lambda name, etc) (breaking)
- Opt out via `DD_TRACE_AWS_SERVICE_REPRESENTATION_ENABLED`

- Add `span.kind:server` on synthetic spans made via span-inferrer, cold
start and lambda invocation spans

- Remove `_dd.base_service` tags on synthetic spans to avoid
unintentional service override

### Motivation

<!--- What inspired you to submit this pull request? --->

Improve Service Map for Serverless. This allows for synthetic spans to
have their own service on the map which connects with the inferred spans
from the tracer side.

* feat: port of Serverless AAP from Go to Rust (#755)

# What?

Ports the Serverless App & API Protection feature (AAP, also known as
Serverless AppSec) from the Go extension to Rust.

This is using https://github.com/DataDog/libddwaf-rust to provide
bindings to the in-app WAF.

This provides enhanced support for API Protection (notably, response
schema collection) compared to the Go version.

Tradeoff is that XML request and response security processing is not
currently supported in this version (it was in Go, but likely seldom
used).

This introduces a `bottlecap::appsec::processor::Processor` that is
integrated in the `bottlecap::proxy::Interceptor` (for request &
response acquisition) as well as in the
`bottlecap::trace_processor::TraceProcessor` (to decorate the
`aws.lambda` span with security data).

# Why?

We plan on decommissioning the Go version of the agent and a tracer-side
version of the Serverless AAP feature will not be available across all
supported language runtimes before several weeks/months.

Also [SVLS-5286](https://datadoghq.atlassian.net/browse/SVLS-5286)

# Notes


[SVLS-5286]:
https://datadoghq.atlassian.net/browse/SVLS-5286?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* feat: No longer launch Go-based agent for compatibility/OTLP/AAP config (#788)

https://datadoghq.atlassian.net/browse/SVLS-7398

- As part of coming release, bottlecap agent no longer launches Go-based
agent when compatibility/AAP/OTLP features are active
- Emit the same metric when detecting any of above configuration
- Update corresponding unit tests

Manifests:
- [Test lambda
function](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/ltn1-fullinstrument-bn-cold-python310-lambda?code=&subtab=envVars&tab=testing)
with
[logs](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fltn1-fullinstrument-bn-cold-python310-lambda/log-events/2025$252F08$252F21$252F$255B$2524LATEST$255Df3788d359677452dad162488ff15456f$3FfilterPattern$3Dotel)
showing compatibility/AAP/OTPL are enabled
<img width="2260" height="454" alt="image"
src="https://github.com/user-attachments/assets/5dfd4954-5191-4390-83f5-a8eb3bffb9d3"
/>

-
[Logging](https://app.datadoghq.com/logs/livetail?query=functionname%3Altn1-fullinstrument-bn-cold-python310-lambda%20Metric&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&fromUser=true&messageDisplay=inline&refresh_mode=paused&storage=driveline&stream_sort=desc&viz=stream&from_ts=1755787655569&to_ts=1755787689060&live=false)
<img width="1058" height="911" alt="image"
src="https://github.com/user-attachments/assets/629f75d1-e115-4478-afac-ad16d9369fa7"
/>

-
[Metric](https://app.datadoghq.com/screen/integration/aws_lambda_enhanced_metrics?fromUser=false&fullscreen_end_ts=1755788220000&fullscreen_paused=true&fullscreen_refresh_mode=paused&fullscreen_section=overview&fullscreen_start_ts=1755787200000&fullscreen_widget=2&graph-explorer__tile_def=N4IgbglgXiBcIBcD2AHANhAzgkAaEAxgK7ZIC2A%2BhgHYDWmcA2gLr4BOApgI5EfYOxGoTphRJqmDhQBmSNmQCGOeJgIK0CtnhA8ObCHyagAJkoUVMSImwIc4IMhwT6CDfNQWP7utgE8AjNo%2BvvaYRGSwpggKxkgA5gB0kmxgemh8mAkcAB4IHBIQ4gnSChBoSKlswAAkCgDumBQKBARW1Ai41ZxxhdSd0kTUBAi9AL4ABABGvuPAA0Mj4h6OowkKja2DCAAUAJTaCnFx3UpyoeEgo6wgsvJEGgJCN3Jk9wrevH6BV-iWbMqgTbtOAAJgADPg5MY9BRpkZEL4UHZ4LdXhptBBqNDsnAISAoXp7NDVJdmKMfiBsL50nBgOSgA&refresh_mode=sliding&from_ts=1755783890661&to_ts=1755787490661&live=true)
<img width="1227" height="1196" alt="image"
src="https://github.com/user-attachments/assets/2922eb54-9853-4512-a902-dfa97916b643"
/>

* Revert "feat: No longer launch Go-based agent for compatibility/OTLP/AAP config (#788)"

This reverts commit 0f5984571eb842e5ce1cbadbec0f92d73befcd08.

* Ignoring Unwanted Resources in APM (#794)

## Task
https://datadoghq.atlassian.net/browse/SVLS-6846

## Overview
We want to allow users to set filter tags which drops traces with root
spans that match specified span tags. Specifically, users can set
`DD_APM_FILTER_TAGS_REQUIRE` or `DD_APM_FILTER_TAGS_REJECT`.

More info
[here](https://docs.datadoghq.com/tracing/guide/ignoring_apm_resources/?tab=datadogyaml#trace-agent-configuration-options).

## Testing
Deployed changes to Lambda. Invoked Lambda directly and through API
Gateway to check with different root spans. Set the tags to either be
REQUIRE or REJECT with value `name:aws.lambda`. Confirmed in logs and UI
that we were dropping spans.

* feat: eat: Add hierarchical configurable compression levels (#800)

feat: Add hierarchical configurable compression levels

- Add global compression_level config parameter (0-9, default: 6) with
fallback hierarchy
- Support 2-level compression configuration: global level first, then
module-specific
- This makes configuration more convenient - set once globally or
override per module
- Apply compression configuration to metrics flushers and trace
processor
  - Add environment variable DD_COMPRESSION_LEVEL for global setting

Test
- Configuration:
<img width="966" height="742" alt="image"
src="https://github.com/user-attachments/assets/b33c0fd3-2b02-4838-8660-fc9ea9493998"
/>
-
([log](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fltn1-fullinstrument-bn-cold-python310-lambda/log-events/2025$252F08$252F25$252F$255B$2524LATEST$255D9c19719435bc48839f6f005d2b58b552))
Configuration:
<img width="965" height="568" alt="image"
src="https://github.com/user-attachments/assets/dfef594a-549f-4773-879d-549234f03fb7"
/>

* cherry pick: No longer launch Go-based agent for compatibility/OTLP/AAP config (#817)

Cherry pick of previously reverted #788 

https://datadoghq.atlassian.net/browse/SVLS-7398

- As part of coming release, bottlecap agent no longer launches Go-based
agent when compatibility/AAP/OTLP features are active
- Emit the same metric when detecting any of above configuration
- Update corresponding unit tests

Attention: it is an known issue with .Net
https://github.com/aws/aws-lambda-dotnet/issues/2093

Manifests:
- [Test lambda
function](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/ltn1-fullinstrument-bn-cold-python310-lambda?code=&subtab=envVars&tab=testing)
with

[logs](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fltn1-fullinstrument-bn-cold-python310-lambda/log-events/2025$252F08$252F21$252F$255B$2524LATEST$255Df3788d359677452dad162488ff15456f$3FfilterPattern$3Dotel)
showing compatibility/AAP/OTPL are enabled
<img width="2260" height="454" alt="image"

src="https://github.com/user-attachments/assets/5dfd4954-5191-4390-83f5-a8eb3bffb9d3"
/>

-

[Logging](https://app.datadoghq.com/logs/livetail?query=functionname%3Altn1-fullinstrument-bn-cold-python310-lambda%20Metric&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&fromUser=true&messageDisplay=inline&refresh_mode=paused&storage=driveline&stream_sort=desc&viz=stream&from_ts=1755787655569&to_ts=1755787689060&live=false)
<img width="1058" height="911" alt="image"

src="https://github.com/user-attachments/assets/629f75d1-e115-4478-afac-ad16d9369fa7"
/>

-

[Metric](https://app.datadoghq.com/screen/integration/aws_lambda_enhanced_metrics?fromUser=false&fullscreen_end_ts=1755788220000&fullscreen_paused=true&fullscreen_refresh_mode=paused&fullscreen_section=overview&fullscreen_start_ts=1755787200000&fullscreen_widget=2&graph-explorer__tile_def=N4IgbglgXiBcIBcD2AHANhAzgkAaEAxgK7ZIC2A%2BhgHYDWmcA2gLr4BOApgI5EfYOxGoTphRJqmDhQBmSNmQCGOeJgIK0CtnhA8ObCHyagAJkoUVMSImwIc4IMhwT6CDfNQWP7utgE8AjNo%2BvvaYRGSwpggKxkgA5gB0kmxgemh8mAkcAB4IHBIQ4gnSChBoSKlswAAkCgDumBQKBARW1Ai41ZxxhdSd0kTUBAi9AL4ABABGvuPAA0Mj4h6OowkKja2DCAAUAJTaCnFx3UpyoeEgo6wgsvJEGgJCN3Jk9wrevH6BV-iWbMqgTbtOAAJgADPg5MY9BRpkZEL4UHZ4LdXhptBBqNDsnAISAoXp7NDVJdmKMfiBsL50nBgOSgA&refresh_mode=sliding&from_ts=1755783890661&to_ts=1755787490661&live=true)
<img width="1227" height="1196" alt="image"

src="https://github.com/user-attachments/assets/2922eb54-9853-4512-a902-dfa97916b643"
/>
====
Another manifest for .Net:
- [Lambda
function](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/ltn1-fullinstrument-bn-cold-dotnet6-lambda?code=&subtab=envVars&tab=testing)
-
[Log](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fltn1-fullinstrument-bn-cold-dotnet6-lambda/log-events/2025$252F08$252F29$252F$255B$2524LATEST$255D15ca867ee94049129ed461283ae46f01$3FfilterPattern$3Dfailover)
- Configuration
<img width="1490" height="902" alt="image"
src="https://github.com/user-attachments/assets/b070e5e1-8335-4494-877f-6475d9959af2"
/>
- Log shows the issue reasons
<img width="990" height="536" alt="image"
src="https://github.com/user-attachments/assets/5503de33-ea92-401c-a595-c165e39b0c6e"
/>
<img width="848" height="410" alt="image"
src="https://github.com/user-attachments/assets/54d1e87c-93e7-4084-8a9a-173cb7d0c4a7"
/>
<img width="938" height="458" alt="image"
src="https://github.com/user-attachments/assets/4f205ec2-d923-47d1-9005-762650798894"
/>

---------

Co-authored-by: Tianning Li <tianning.li@datadoghq.com>

* feat: [Trace Stats] Add feature flag DD_COMPUTE_TRACE_STATS (#841)

## This PR

Adds a feature flag `DD_COMPUTE_TRACE_STATS`.
- If true, trace stats will be computed from the extension side. In this
case, we set `_dd.compute_stats` to `0`, so trace stats won't be
computed on the backend.
- If false, trace stats will NOT be computed from the extension side. In
this case, we set `_dd.compute_stats` to `1`, so trace stats will be
computed on the backend.
- Defaults to false for now, so `_dd.compute_stats` still defaults to
`1`, i.e. default behavior is not changed.
- After we fully support computing trace stats on extension side, I will
change the default to true then delete the flag.

Jira: https://datadoghq.atlassian.net/browse/SVLS-7593

* fix: use tokio time instead of std time because tokio time can be frozen (#846)

Tokio time allows us to pause or sleep without blocking the runtime. It
also allows time to be paused (mainly for tests). I think we may need
the sleep to force blocking code to yield

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

* add support for observability pipeline (#826)

## Task

https://datadoghq.atlassian.net/jira/software/c/projects/SVLS/boards/5420?quickFilter=7573&selectedIssue=SVLS-7525

## Overview
* Add support for sending logs to an Observability Pipeline instead of
directly to Datadog.
* To enable, customers must set
`DD_ENABLE_OBSERVABILITY_PIPELINE_FORWARDING` to true, and
`DD_LOGS_CONFIG_LOGS_DD_URL` to their Observability Pipeline endpoint.
Will fast follow and update docs to reflect this.
* Initially, I was using setting up the observability pipeline with
'Datadog Agent' as the source. This required us to format the log
message in a certain format. However, chatting with the Observability
Pipeline Team, they actually recommend we use 'Http Server' as the
source for our pipeline setup instead since this just accepts any json.

## Testing
Created an [observability
pipeline](https://ddserverless.datadoghq.com/observability-pipelines/b15e4a64-880d-11f0-b622-da7ad0900002/view)
and deployed a lambda function with the changes. Triggered the lambda
function and confirmed we see it in our
[logs](https://ddserverless.datadoghq.com/logs?query=function_arn%3A%22arn%3Aaws%3Alambda%3Aus-east-1%3A425362996713%3Afunction%3Aobcdkstackv3-hellofunctionv3ec5a2fbe-l9qvtrowzb5q%22&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&messageDisplay=inline&refresh_mode=sliding&storage=hot&stream_sort=desc&viz=stream&from_ts=1758196420534&to_ts=1758369220534&live=true).
We know it is going through the observability pipeline because we can
see an attached 'http_server' attached as the source type.

* feat: lower zstd default compression (#867)

A quick test run showed our max duration skews on smaller lambda sizes
with lots of data setting the zstd compression level to 6. Looks like we
start to block the CPU at around thi smark.

Gonna default it to 3, as tested below with 3 500k runs.
<img width="1293" height="319" alt="image"
src="https://github.com/user-attachments/assets/d1224676-f14f-4a55-8440-089bb9ff91d0"
/>

* revert(#817): reverts fallback config  (#871)

# What?

This reverts commit 2396c4fe102677179c834c2dd65cb5b2ea79ca8f from #817 

# Why?

Need a release

# Notes

We'll cherry pick and bring it back at some point

* chore: [Trace Stats] Rename env var DD_COMPUTE_TRACE_STATS (#875)

# This PR
As @apiarian-datadog suggested in
https://github.com/DataDog/datadog-lambda-extension/pull/841#discussion_r2376111825,
rename the feature flag `DD_COMPUTE_TRACE_STATS` to
`DD_COMPUTE_TRACE_STATS_ON_EXTENSION` for clarity.

# Notes
Jira: https://datadoghq.atlassian.net/browse/SVLS-7593

* feat: remove failover to go (#882)

Removes the failover to Go. If we can't parse any of the config options
we log the failing value and move on with the default specified.

* fix: use datadog as default propagation style if supplied version is malformed (#891)

Fixes an issue where config parsing fails if this is invalid

* fix: use None if propagation style is invalid (#895)

After internal discussion we determined that the tracing libraries use
None of the trace propagation style is invalid or malformed.

This brings us into alignment.

* feat: Support periodic reload for api key secret (#893)

# This PR
Supports the env var `DD_API_KEY_SECRET_RELOAD_INTERVAL`, in seconds. It
applies when Datadog API Key is set using `DD_API_KEY_SECRET_ARN`. For
example:
- if it's `120`, then api key will be reloaded about every 120 seconds.
Note that reload can only be triggered when api key is used, usually
when data is being flushed. If there is no invocation and no data needs
to be flushed, then reload won't happen.
- If it's not set or set to `0`, then api key will only be loaded once
the first time it is used, and won't be reloaded.

# Motivation
Some customers regularly rotate their api key in a secret. We need to
provide a way for them to update our cached key.
https://github.com/DataDog/datadog-lambda-extension/issues/834

# Testing
## Steps
1. Set the env var `DD_API_KEY_SECRET_RELOAD_INTERVAL` to `120`

2. Invoke the Lambda every minute

## Result
The reload interval is passed to the `ApiKeyFactory`
<img width="711" height="25" alt="image"
src="https://github.com/user-attachments/assets/6fcc5081-accb-4928-8fa7-094d36aa2fa1"
/>

Reload happens roughly every 120 seconds. It's sometimes longer than 120
seconds due to the reason explained above.
<img width="554" height="252" alt="image"
src="https://github.com/user-attachments/assets/3fa78249-ff98-47d2-a953-f090630bbeb1"
/>

# Notes to Users
When you use this env var, please also keep a grace period for the old
api key after you update the secret to the new key, and make the grace
period longer than the reload interval to give the extension sufficient
time to reload the secret.

# Internal Notes
Jira: https://datadoghq.atlassian.net/browse/SVLS-7572

* [SVLS-7885] update tag splitting to allow for ',' and ' ' (#916)

## Overview
We currently split the`DD_TAGS` only by `,`. Customer is asking if we
can also split by spaces since that is common for container images and
lambda lets you deploy images.
(https://docs.datadoghq.com/getting_started/tagging/assigning_tags/?tab=noncontainerizedenvironments)

* [SLES-2547] add metric namespace for DogStatsD (#920)

Follow up from https://github.com/DataDog/serverless-components/pull/48

What does this PR do?
Add support for DD_STATSD_METRIC_NAMESPACE.

Motivation
This was brought up by a customer, they noticed issues migrating to
bottlecap. Our docs show we should support this, but we currently don't
have it implemented -
https://docs.datadoghq.com/serverless/guide/agent_configuration/#dogstatsd-custom-metrics.

Additional Notes
Requires changes in agent/extension. Will follow up with those PRs.

Describe how to test/QA your changes
Deployed changes to extension and tested with / without the custom
namespace env variable. Confirmed that metrics are getting the prefix
attached,
[metrics](https://ddserverless.datadoghq.com/metric/explorer?fromUser=false&graph_layout=stacked&start=1762783238873&end=1762784138873&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGABuMMiZeBoIOKoAdPJYTFJNcMRwtRIdmfgiCMAAVDwgfKCR0bmxWABMickIqel4WTl5iIXFIHPlVcgAVjiMIk3TmvIY2U219Y0tbYwdXT0EkucDeEOj4zwAXSornceEwoXCINUYIwMVK8QmFFAUJhcJ0CwmQJA9SwaByoGueIQCE2UBwMCcmXBGggmUSaFEcCcckUynSDKg9MZTnoTGUIjcHjQiKSEHsmCwzIUmwZIiUgJ4fGx8gZCAAwlJhDAUCIwWgeEA)

* refactor: Move metric namespace validation to dogstatsd util (#921)

https://datadoghq.atlassian.net/browse/SLES-2547

- Updates dependency to use centralized parse_metric_namespace function.
 - Removes duplicate code in favor of the shared implementation.


Test:
- Deploy the extension and config w/
[DD_STATSD_METRIC_NAMESPACE](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/ltn-fullinstrument-bn-10bst-node22-lambda?subtab=envVars&tab=configure)
<img width="964" height="290" alt="image"
src="https://github.com/user-attachments/assets/94836a3a-9905-44b4-9565-185745e47981"
/>
- Invoke the function and expect to see the metric using this custom
prefix namespace
<img width="1170" height="516" alt="Screenshot 2025-11-11 at 4 59 57 PM"
src="https://github.com/user-attachments/assets/0bf4ac5e-ac1c-4cfe-817e-89b004717caf"
/>

[Metric
link](https://ddserverless.datadoghq.com/metric/explorer?fromUser=true&graph_layout=stacked&start=1762897808375&end=1762898083375&paused=true#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGABuMMhsaGg4YG5oUAB0WmiCLapS4m6iMMAAVDwgPAC6VBpyaDmg8hgzCAg5STgwTpmYGhoQmYloonBOcorK6QdQ+4dO9EzKIm4eaKP8EPaYWMcKKwciSuM8Pggd7iADCUmEMBQIjwdR4QA)

* [SVLS-7704] add support for SSM Parameter API key (#924)

## Overview
* Add support for customers storing Datadog API Key in SSM Parameter
Store.

## Testing
* Deployed changes and confirmed this work with Parameter Store String
and SecureString.

* feat: Add support for DD_LOGS_ENABLED as alias for DD_SERVERLESS_LOGS_ENABLED (#928)

https://datadoghq.atlassian.net/browse/SVLS-7818

  ## Overview
Add DD_LOGS_ENABLED environment variable and YAML config option as an
alias for DD_SERVERLESS_LOGS_ENABLED. Both variables now use OR logic,
meaning logs are enabled if either variable is set to true.

  Changes:
  - Add logs_enabled field to EnvConfig and YamlConfig structs
- Implement OR logic in merge_config functions: logs are enabled if
either DD_LOGS_ENABLED or DD_SERVERLESS_LOGS_ENABLED is true
- Add comprehensive test coverage with 9 test cases covering all
combinations of the two variables
  - Maintain backward compatibility with existing configurations
  - Default value remains true when neither variable is set


## Testing 
Set DD_LOGS_ENABLED and DD_SERVERLESS_LOGS_ENABLED to false and expect:
- [Log can be found in AWS
console](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fltn-fullinstrument-bn-cold-node22-lambda/log-events/2025$252F11$252F13$252F$255B$2524LATEST$255D455478dcbc944055b5be933e2e099f6a$3FfilterPattern$3DREPORT+RequestId)
- [Log could NOT be found in DD
console](https://ddserverless.datadoghq.com/logs?query=source%3Alambda%20%40lambda.arn%3A%22arn%3Aaws%3Alambda%3Aus-east-1%3A425362996713%3Afunction%3Altn-fullinstrument-bn-cold-node22-lambda%22%20AND%20%22REPORT%20RequestId%22&agg_m=count&agg_m_source=base&agg_t=count&clustering_pattern_field_path=message&cols=host%2Cservice%2C%40lambda.request_id&fromUser=true&messageDisplay=inline&refresh_mode=paused&storage=hot&stream_sort=desc&viz=stream&from_ts=1763063694206&to_ts=1763065424700&live=false)

Otherwise the log should be available in DD console.

* chore: Upgrade libdatadog and construct http client for traces (#917)

Upgrade libdatadog. Including:
- Rename a few creates:
  - `ddcommon` -> `libdd-common`
  - `datadog-trace-protobuf` -> `libdd-trace-protobuf`
  - `datadog-trace-utils` -> `libdd-trace-utils`
  - `datadog-trace-normalization` -> `libdd-trace-normalization`
  - `datadog-trace-stats` -> `libdd-trace-stats`
- Use the new API to send traces, which takes in an http_client object
instead of proxy url string

GitHub issue:
https://github.com/DataDog/datadog-lambda-extension/issues/860
Jira: https://datadoghq.atlassian.net/browse/SLES-2499
Slack discussion:
https://dd.slack.com/archives/C01TCF143GB/p1762526199549409

* Merge Lambda Managed Instance feature branch (#947)

https://datadoghq.atlassian.net/browse/SVLS-8080

## Overview
Merge Lambda Managed Instance feature branch

## Testing 
Covered by individual commits

Co-authored-by: shreyamalpani <shreya.malpani@datadoghq.com>
Co-authored-by: duncanista <30836115+duncanista@users.noreply.github.com>
Co-authored-by: astuyve <aj.stuyvenberg@datadoghq.com>
Co-authored-by: jchrostek-dd <john.chrostek@datadoghq.com>
Co-authored-by: tianning.li <tianning.li@datadoghq.com>

* fix(config): support colons in tag values (URLs, etc.) (#953)

https://datadoghq.atlassian.net/browse/SVLS-8095

## Overview
Tag parsing previously used split(':') which broke values containing colons like URLs (git.repository_url:https://...). Changed to usesplitn(2, ':') to split only on the first colon, preserving the rest as the value.

Changes:
 - Add parse_key_value_tag() helper to centralize parsing logic
 - Refactor deserialize_key_value_pairs to use helper
 - Refactor deserialize_key_value_pair_array_to_hashmap to use helper
 - Add comprehensive test coverage for URL values and edge cases

## Testing 
unit test and expect e2e tests to pass

Co-authored-by: tianning.li <tianning.li@datadoghq.com>

* [SVLS-7934] feat: Support TLS certificate for trace/stats flusher (#961)

## Problem
A customer reported that their Lambda is behind a proxy, and the
Rust-based extension can't send traces to Datadog via the proxy, while
the previous go-based extension worked.

## This PR
Supports the env var `DD_TLS_CERT_FILE`: The path to a file of
concatenated CA certificates in PEM format.
Example: `DD_TLS_CERT_FILE=/opt/ca-cert.pem`, so the when the extension
flushes traces/stats to Datadog, the HTTP client created can load and
use this cert, and connect the proxy properly.

## Testing
### Steps
1. Create a Lambda in a VPC with an NGINX proxy.
2. Add a layer to the Lambda, which includes the CA certificate
`ca-cert.pem`
3. Set env vars:
    - `DD_TLS_CERT_FILE=/opt/ca-cert.pem`
- `DD_PROXY_HTTPS=http://10.0.0.30:3128`, where `10.0.0.30` is the
private IP of the proxy EC2 instance
    - `DD_LOG_LEVEL=debug`
4. Update routing rules of security groups so the Lambda can reach
`http://10.0.0.30:3128`
5. Invoke the Lambda
### Result
**Before**
Trace flush failed with error logs:
> DD_EXTENSION | ERROR | Max retries exceeded, returning request error
error=Network error: client error (Connect) attempts=1
DD_EXTENSION | ERROR | TRACES | Request failed: No requests sent

**After**
Trace flush is successful:
> DD_EXTENSION | DEBUG | TRACES | Flushing 1 traces
DD_EXTENSION | DEBUG | TRACES | Added root certificate from
/opt/ca-cert.pem
DD_EXTENSION | DEBUG | TRACES | Proxy connector created with proxy:
Some("http://10.0.0.30:3128")
DD_EXTENSION | DEBUG | Sending with retry
url=https://trace.agent.datadoghq.com/api/v0.2/traces payload_size=1120
max_retries=1
DD_EXTENSION | DEBUG | Received response status=202 Accepted attempt=1
DD_EXTENSION | DEBUG | Request succeeded status=202 Accepted attempts=1
DD_EXTENSION | DEBUG | TRACES | Flushing took 1609 ms

## Notes
This fix only covers trace flusher and stats flusher, which use
`ServerlessTraceFlusher::get_http_client()` to create the HTTP client.
It doesn't cover logs flusher and proxy flusher, which use a different
function (http.rs:get_client()) to create the HTTP client. However, logs
flushing was successful in my tests, even if no certificate was added.
We can come back to logs/proxy flusher if someone reports an error.

* chore: Upgrade libdatadog (#964)

## Overview
The crate `datadog-trace-obfuscation` has been renamed as
`libdd-trace-obfuscation`. This PR updates this dependency.

## Testing

* [SVLS-8211] feat: Add timeout for requests to span_dedup_service (#986)

## Problem
Span dedup service sometimes fails to return the result and thus logs
the error:
> DD_EXTENSION | ERROR | Failed to send check_and_add response: true

I see this error in our Self Monitoring and a customer's account.
Also I believe it causes extension to fail to receive traces from the
tracer, causing missing traces. This is because the caller of span dedup
is in `process_traces()`, which is the function that handles the
tracer's HTTP request to send traces. If this function fails to get span
dedup result and gets stuck, the HTTP request will time out.

## This PR
While I don't yet know what causes the error, this PR adds a patch to
mitigate the impact:
1. Change log level from `error` to `warn`
2. Add a timeout of 5 seconds to the span dedup check, so that if the
caller doesn't get an answer soon, it defaults to treating the trace as
not a duplicate, which is the most common case.

## Testing
To merge this PR then check log in self monitoring, as it's hard to run
high-volume tests in self monitoring from a non-main branch.

* [SVLS-8150] fix(config): ensure logs intake URL is correctly prefixed (#1021)

## Overview

Ensures `DD_LOGS_CONFIG_LOGS_DD_URL` is correctly prefixed with
`https://`

## Testing 

Manually tested that logs get sent to alternate logs intake

* chore(deps): upgrade dogstatsd (#1020)

## Overview

Continuation of #1018 removing unnecessary mut lock on callers for
dogstatsd

* chore(deps): upgrade rust to `v1.93.1` (#1034)

## What?

Upgrade rust to latest stable 1.93.1

## Why?

`time` vulnerability fix is only available on rust >= 1.88.0

* feat(http): allow skip ssl validation (#1064)

## Overview

Add DD_SKIP_SSL_VALIDATION support, parsed from both env and YAML,
matching the datadog-agent's behavior — applied to all outgoing HTTP
clients (reqwest via danger_accept_invalid_certs, hyper via custom
  ServerCertVerifier).

## Motivation

Customers in environments with corporate proxies or custom CA setups
need the ability to disable TLS certificate validation, matching the
existing datadog-agent config option. The Go agent applies
tls.Config{InsecureSkipVerify: true} to all HTTP transports via a
central CreateHTTPTransport() — we mirror this by wiring the config
through to both client builders.

And [SLES-2710](https://datadoghq.atlassian.net/browse/SLES-2710)

## Changes

  Config (config/mod.rs, config/env.rs, config/yaml.rs):
- Add skip_ssl_validation: bool to Config, EnvConfig, and YamlConfig
with default false

  reqwest client (http.rs):
- .danger_accept_invalid_certs(config.skip_ssl_validation) on the shared
client builder

  hyper client (traces/http_client.rs):
- Custom NoVerifier implementing
rustls::client::danger::ServerCertVerifier that accepts all certificates
- Uses CryptoProvider::get_default() (not hardcoded aws_lc_rs) for
FIPS-safe signature scheme reporting
  - New skip_ssl_validation parameter on create_client()

## Testing 

Unit tests and self monitoring

[SLES-2710]:
https://datadoghq.atlassian.net/browse/SLES-2710?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

* add Cargo.toml for datadog-agent-config

* update licenses

* remove aws.rs from datadog-agent-config

* chore: upgrade workspace rust edition to 2024 (#96)

* upgrade rust edition to 2024 for workspace

* apply formatting

---------

Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>
Co-authored-by: alexgallotta <5581237+alexgallotta@users.noreply.github.com>
Co-authored-by: AJ Stuyvenberg <astuyve@gmail.com>
Co-authored-by: Nicholas Hulston <nicholashulston@gmail.com>
Co-authored-by: Aleksandr Pasechnik <aleksandr.pasechnik@datadoghq.com>
Co-authored-by: shreyamalpani <shreya.malpani@datadoghq.com>
Co-authored-by: Yiming Luo <yiming.luo@datadoghq.com>
Co-authored-by: Florentin Labelle <florentin.labelle@outlook.fr>
Co-authored-by: Romain Marcadier <romain.muller@telecomnancy.net>
Co-authored-by: Zarir Hamza <zarir.hamza@datadoghq.com>
Co-authored-by: Romain Marcadier <romain.marcadier@datadoghq.com>
Co-authored-by: Tianning Li <tianning.li@datadoghq.com>
Co-authored-by: jchrostek-dd <john.chrostek@datadoghq.com>
Co-authored-by: astuyve <aj.stuyvenberg@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants