Skip to content

feat: Handle API key resolution failure#732

Merged
lym953 merged 2 commits intomainfrom
yiming.luo/lazy-api-key-error
Jul 17, 2025
Merged

feat: Handle API key resolution failure#732
lym953 merged 2 commits intomainfrom
yiming.luo/lazy-api-key-error

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented Jul 7, 2025

Context

The previous PR #717 defers API key resolution from extension init stage to flush time. However, that PR doesn't well handle the failure case.

  • Before that PR, if resolution fails in init stage, the extension will run an idle loop.
  • After that PR, the extension will crash at flush time, which will kill the runtime as well, which is not desired.

What does this PR do?

  1. For traces, defer key resolution from TraceProcessor.process_traces() to TraceFlusher.flush().
    • (This should ideally be in the previous PR, but since that is already approved, let me add this change in this new PR.)
  2. If resolution fails at flush time, then make flush a no-op, so the extension can keep running and consume events without crashing.

Dependencies

  1. feat: Make ApiKeyFactory return Option<String> serverless-components#25
  2. Add functions to SendDataBuilder libdatadog#1140

Manual Test

Steps

  1. Create a layer in sandbox
  2. Apply the layer to a Lambda function
  3. Set the env var DD_API_KEY_SECRET_ARN to an invalid value
  4. Run the Lambda
  5. Then set DD_API_KEY_SECRET_ARN to a valid value
  6. Run the Lambda

Result

  1. The function was successful
image
  1. The extension printed some error logs
image image
  1. With valid secret ARN, the Lambda runs successfully and reports to Datadog
image image

Automated Test

I didn't add any automated test because from what I see in the codebase, existing tests are usually unit tests for short functions and not for long functions that this PR touches. Please let me know if you think I should add automated tests.

debug!("Failed to send context spans to agent: {e}");
}
} else {
error!("Failed to process traces, skipping send");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processor won't send spans to TraceAggregator

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just avoid the extra allocation by doing if let Some(send_data) = trace_processor... else {} ?

Comment thread bottlecap/src/logs/flusher.rs Outdated
if let Some(req) = self.create_request(batch.clone()).await {
set.spawn(async move { Self::send(req).await });
} else {
error!("Failed to create request");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flusher won't create HTTP requests to send to data to Datadog at /api/v2/logs

Comment thread bottlecap/src/otlp/agent.rs Outdated
}
}
} else {
error!("Failed to process traces, skipping send");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLTP Agent won't send traces to TraceFlusher

Comment thread bottlecap/src/traces/stats_flusher.rs Outdated
}
};
} else {
error!("Failed to create endpoint");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerlessStatsFlusher won't send stats to Datadog's endpoint.

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
),
}
} else {
error!("Failed to process traces, skipping send");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceAgent won't send traces to TraceFlusher

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
),
}
} else {
error_response(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceAgent proxy won't send data to Datadog

Comment thread bottlecap/src/traces/trace_processor.rs Outdated
));
Some(send_data)
} else {
error!("Failed to resolve API key");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceProcessor won't process traces

@lym953 lym953 changed the title feat: Properly handle API key resolution failure feat: Handle API key resolution failure Jul 7, 2025
@lym953 lym953 requested a review from Copilot July 8, 2025 19:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances resilience by turning API key resolution failures into no-ops instead of crashing, allowing the extension to continue running. Key changes include:

  • Converting process_traces to return Option<SendData> and guarding all flush/send paths across multiple components.
  • Adding if let Some checks around API key resolution in trace, stats, logs, OTLP, and invocation processors.
  • Updating the dogstatsd dependency revision in Cargo.toml.

Reviewed Changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
trace_processor.rs Changed return type to Option<SendData> and added if let Some(api_key) guard around endpoint construction.
trace_agent.rs Added initial None check for send_data and unified error responses when API key or send_data is missing.
stats_flusher.rs Changed endpoint cell to OnceCell<Option<Endpoint>> and wrapped stats send logic in if let Some for API key and endpoint.
otlp/agent.rs Wrapped process_traces result in if let Some(send_data) to skip sending when API key resolution fails.
logs/flusher.rs Changed cached headers to OnceCell<Option<HeaderMap>> and made create_request return Option<…> to skip sends.
lifecycle/invocation/processor.rs Updated invocation processor to skip sending when process_traces returns None.
Cargo.toml Bumped dogstatsd revision to 0add16260cca1ec01729a3d99f5a40cf246a2c38.
Comments suppressed due to low confidence (2)

bottlecap/src/traces/trace_processor.rs:170

  • The call to to_string().clone() is redundant; to_string() already returns a String. You can simplify to api_key: Some(api_key.into()),.
                api_key: Some(api_key.to_string().into()),

bottlecap/src/traces/trace_processor.rs:130

  • Consider adding a unit test for the new None return path when API key resolution fails, to ensure that process_traces correctly returns None and skips sending.
    ) -> Option<SendData>;

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
Comment on lines +511 to +508
} else {
error!("Failed to process traces, skipping send");
error_response(
StatusCode::INTERNAL_SERVER_ERROR,
format!("Error sending traces to the trace flusher: {err}"),
),
"Failed to process traces, skipping send",
)
Copy link

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else branch is unreachable because you return above when send_data is None. Consider removing the second if let/else and unifying error handling for clarity.

Suggested change
} else {
error!("Failed to process traces, skipping send");
error_response(
StatusCode::INTERNAL_SERVER_ERROR,
format!("Error sending traces to the trace flusher: {err}"),
),
"Failed to process traces, skipping send",
)

Copilot uses AI. Check for mistakes.
Comment thread bottlecap/src/traces/stats_flusher.rs Outdated
Some(Endpoint {
url: hyper::Uri::from_str(&stats_url)
.expect("can't make URI from stats url, exiting"),
api_key: Some(api_key.to_string().clone().into()),
Copy link

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other location, the .clone() on the result of to_string() is redundant. Use api_key.into() to simplify.

Suggested change
api_key: Some(api_key.to_string().clone().into()),
api_key: Some(api_key.into()),

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't do. I would get an error after doing this:
image

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 1b2e85f to b60bd54 Compare July 8, 2025 19:19
@lym953 lym953 marked this pull request as ready for review July 8, 2025 19:23
@lym953 lym953 requested a review from a team as a code owner July 8, 2025 19:23
Comment thread bottlecap/src/logs/flusher.rs Outdated
Comment on lines +64 to +69
if let Some(req) = self.create_request(batch.clone()).await {
set.spawn(async move { Self::send(req).await });
} else {
error!("Failed to create request");
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let-else is the normal way to avoid extra indentation

Suggested change
if let Some(req) = self.create_request(batch.clone()).await {
set.spawn(async move { Self::send(req).await });
} else {
error!("Failed to create request");
continue;
}
let Some(req) = self.create_request(batch.clone()).await else {
error!("Failed to create request");
continue;
}
set.spawn(async move { Self::send(req).await });

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the perfect answer!!

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from cec3247 to 727d04f Compare July 9, 2025 15:47
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch from e64164e to 29a299d Compare July 9, 2025 15:49
Comment thread bottlecap/src/logs/flusher.rs Outdated
let headers = self.get_headers().await;
self.client
let Some(headers) = self.get_headers().await else {
return None;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, although I understand the code, I find it a little confusing that get_headers is responsible for deciding wether or not we're creating a request.

Would it make more sense to rearchitect this so that whenever we definitely know we are about to flush, let's say in flush() method, we try to get the API Key?

Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment which I would like to see if we can work around. The main idea is, could we rearchitect so that whenever we hit flush we try to resolve the API key and then start doing later work based on it? Instead, we're failing in headers when trying to get an API key, but this looks like they should be separated 🤔

LMK what you think

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch 2 times, most recently from 66676fe to d10fb07 Compare July 9, 2025 17:59
/// These tags are used to capture runtime and initialization.
dynamic_tags: HashMap<String, String>,
/// Function to resolve Datadog API key.
api_key_factory: Arc<ApiKeyFactory>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it to an outer struct Processor

trace_processor: &Arc<dyn TraceProcessor + Send + Sync>,
trace_agent_tx: &Sender<SendData>,
) {
let Some(api_key) = self.api_key_factory.get_api_key().await else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... so we can abort earlier here, without needing to touch many functions

Comment thread bottlecap/src/logs/flusher.rs Outdated
pub async fn flush(&self, batches: Option<Arc<Vec<Vec<u8>>>>) -> Vec<reqwest::RequestBuilder> {
let mut set = JoinSet::new();
let api_key = self.api_key_factory.get_api_key().await;
let Some(api_key) = api_key else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort early at the beginning of Flusher.flush()

vec![traces],
body_size,
self.inferrer.span_pointers.clone(),
api_key,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing api_key to process_traces(), so process_traces() won't need to handle failure inside.

}

async fn get_headers(&self) -> &HeaderMap {
async fn get_headers(&self, api_key: &str) -> &HeaderMap {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in api_key, so get_headers() won't need to handle the failure

Comment thread bottlecap/src/otlp/agent.rs Outdated
OtlpProcessor,
Arc<dyn TraceProcessor + Send + Sync>,
Sender<SendData>,
Arc<ApiKeyFactory>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it to AgentState and Agent

Comment thread bottlecap/src/otlp/agent.rs Outdated
State((config, tags_provider, processor, trace_processor, trace_tx, api_key_factory)): State<AgentState>,
request: Request,
) -> Response {
let Some(api_key) = api_key_factory.get_api_key().await else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort at the beginning of v1_traces API handler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the customer's code calls /v1/traces api synchronously, and we slow down the customer's Lambda by doing the heavy operation of resolving api key here?
If so, it might be better to further defer key resolution by moving it out of the API handler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called by an exporter, probably at the end of the function, so yeah, it would be done in runtime time

return;
}

let Some(api_key) = self.api_key_factory.get_api_key().await else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort at the beginning of StatsFlusher.send()

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
version: ApiVersion,
api_key_factory: Arc<ApiKeyFactory>,
) -> Response {
let Some(api_key) = api_key_factory.get_api_key().await else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort at the beginning of v0.4 and v0.5 traces API handler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dit: Is it possible that the customer's code calls traces api synchronously, and we slow down the customer's Lambda by doing the heavy operation of resolving api key here?
If so, it might be better to further defer key resolution by moving it out of the API handler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need to defer this to the flusher as this API is called synchronously

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
Err(e) => return error_response(StatusCode::INTERNAL_SERVER_ERROR, e),
};

let Some(api_key) = api_key_factory.get_api_key().await else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort at the beginning of handle_proxy()

@lym953
Copy link
Copy Markdown
Contributor Author

lym953 commented Jul 9, 2025

Left a comment which I would like to see if we can work around. The main idea is, could we rearchitect so that whenever we hit flush we try to resolve the API key and then start doing later work based on it? Instead, we're failing in headers when trying to get an API key, but this looks like they should be separated

@duncanista Good point! Made a lot of changes.

One concern is this PR (and the last one) only defers key resolution from init time to trace API handler (if trace API handler is called), not to flush time. Although it can shorten cold start time, it can make invoke phase slower. Is that a problem? (Correct me if my understanding is wrong.)

Comment thread bottlecap/src/otlp/agent.rs Outdated
traces::trace_processor::TraceProcessor,
};

use dogstatsd::api_key::ApiKeyFactory;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, do we want to move this to its own file? Wondering if all other components should be relying in dogstatsd as a dependency just for an ApiKeyFactory

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed to be done now, but would be good to not make them dependent on a metrics module

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 3be2928 to e820be0 Compare July 10, 2025 20:18
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch 2 times, most recently from 4ff6353 to aaeefdd Compare July 10, 2025 22:40
if self.config.proxy_https.is_some() {
let site_in_no_proxy = std::env::var("NO_PROXY")
.map_or(false, |no_proxy| no_proxy.contains(&self.config.site))
.is_ok_and(|no_proxy| no_proxy.contains(&self.config.site))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes a new clippy error due to upgrade

struct KeyValueVisitor;

impl<'de> serde::de::Visitor<'de> for KeyValueVisitor {
impl serde::de::Visitor<'_> for KeyValueVisitor {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes a new clippy error due to upgrade

use datadog_trace_protobuf::pb::ClientStatsPayload;
use std::collections::VecDeque;

#[allow(clippy::empty_line_after_doc_comments)]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New clippy error

pub struct ProxyState {
pub config: Arc<config::Config>,
pub proxy_aggregator: Arc<Mutex<proxy_aggregator::Aggregator>>,
pub api_key_factory: Arc<ApiKeyFactory>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to trace_flusher.

Comment thread bottlecap/src/traces/trace_flusher.rs Outdated
let mut traces = guard.get_batch();
// Lazily set the API key
for trace in &mut traces {
trace.get_target_mut().api_key = Some(api_key.to_string().into());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add get_target_mut() in DataDog/libdatadog#1140

lym953 added a commit that referenced this pull request Jul 16, 2025
…745)

# Background
Right now `SendData` is passed around across channels.

# This PR

Instead of passing `SendData`, pass `SendDataBuilderInfo`, which bundles
`SendDataBuilder` and payload size. Just before flush, call
`SendDataBuilder.build()` to build `SendData`.

# Motivation
DataDog/libdatadog#1140 (comment)
It is suggested that the function `set_api_key()` shouldn't be added on
`SendData`, but should be added on `SendDataBuilder`. Because need to
call `set_api_key()` just before flush, we need to make sure the object
is `SendDataBuilder` instead of `SendData` until flush time.

And because we need payload size in Trace Aggregator, and
`SendDataBuilder` doesn't expose this field, we need to pass it
explicitly along with `SendDataBuilder`.

# Next steps
Update #717
#732 so that
`get_api_key()` is called just before flush.

# Dependency
DataDog/libdatadog#1140
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 2ef2a9a to 37caca4 Compare July 16, 2025 20:51
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 7397582 to 43030bf Compare July 16, 2025 21:39
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 37caca4 to ef63759 Compare July 16, 2025 21:46
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 13449b8 to d10e087 Compare July 16, 2025 21:47
lym953 added a commit that referenced this pull request Jul 17, 2025
# Motivation
From @astuyve:
> today we basically block/await on that decrypt call before we can call
/next
so if we can instead make that async and then resolve the future only
when we need to flush data, that can be a big win for many customers.

https://datadoghq.atlassian.net/browse/SVLS-6995

# Previous work
DataDog/serverless-components#21,
DataDog/serverless-components#24 created
`ApiKeyFactory`, which is a util to enable lazy API key resolution.

# This PR

Updates Bottlecap code to use `ApiKeyFactory` to lazily resolve API key,
i.e. instead of resolving it by querying Secret Manager or KMS during
init phase, do it at flushing time when api key is actually needed.

# Note

This PR changes the behavior when key resolution fails, i.e. when
`resolve_secrets()` returns `None`.
- Before: run `extension_loop_idle()`, which does not stop the runtime
- After: panic, which will stop the runtime (if I understand correctly).
Of course it's not ideal. Any better idea?
- It's harder now to run `extension_loop_idle()` because api key
resolution code is not in the main loop anymore, but in various consumer
code of api key
- Is there a way to gracefully shut down the extension without affecting
the runtime?

Update: Added a PR to address resolution failure:
#732
These two PRs should be merged together. Keeping them separate PRs just
to make review easier.

# Testing
## Setup
- Runtime: Go1 on Amazon Linux 2
- Architecture: arm64
- An app with empty implementation code

## Result
Below is the `Datadog Next-Gen Extension ready in:` time logged.

- Before: (prod extension
`arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:82`)
  - 88.6 ± 1.8 (ms)

- After: (test extension
`arn:aws:lambda:us-east-1:425362996713:layer:Datadog-Bottlecap-Beta-ARM-yiming:2`)
  - 35.4 ± 5.1 (ms)
  - (-60.0%)

<img width="461" alt="image"
src="https://github.com/user-attachments/assets/b2973aae-d8f2-4003-a37f-6af05a42e059"
/>

Both use 5 samples.

# Notes
https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998
Base automatically changed from yiming.luo/lazy-api-key-3 to main July 17, 2025 19:53
lym953 added 2 commits July 17, 2025 15:54
Simplify logic in StatsFlusher

Move api_key_factory out of TraceProcessor

Move some code

Avoid resolving key in trace api and proxy

Apply to proxy flusher

Resolve conflicts

Make trace flusher resolve api key

Fix Clippy lint

Format

Use SendData.set_api_key()

Fix errors

Improve comments
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 7fe772c to 5ebd135 Compare July 17, 2025 19:54
@lym953 lym953 merged commit 8bdd819 into main Jul 17, 2025
46 checks passed
@lym953 lym953 deleted the yiming.luo/lazy-api-key-error branch July 17, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants