Skip to content

feat: Defer DD API key resolution to flushing time#21

Merged
lym953 merged 9 commits intomainfrom
yiming.luo/api-key-future-3
Jun 13, 2025
Merged

feat: Defer DD API key resolution to flushing time#21
lym953 merged 9 commits intomainfrom
yiming.luo/api-key-future-3

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented Jun 11, 2025

What does this PR do?

Make dogstatsd Flusher accept a future for the API key instead of the resolved API key string.
The API key future is awaited/resolved at flushing time, instead of by the caller when flusher is initialized.

Motivation

From @astuyve:

today we basically block/await on that decrypt call before we can call /next
so if we can instead make that async and then resolve the future only when we need to flush data, that can be a big win for many customers.

https://datadoghq.atlassian.net/browse/SVLS-6995

Additional Notes

This is my first Rust PR. Feel free to throw tons of comments at me!

Describe how to test/QA your changes

Passed the automated tests.
No manual test for now. Will test it after Bottlecap code is updated as well.

@astuyve astuyve requested a review from Copilot June 12, 2025 15:50

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@astuyve astuyve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say copilot did well on this CR, but other than those 3 changes this PR is good to go

Comment on lines +52 to 71
async fn get_dd_api(&mut self) -> &DdApi {
if let Some(dd_api) = self.dd_api.get() {
return dd_api;
}

let api_key = (self.api_key_factory)().await;

// Initialize the OnceLock with a new DdApi instance
// If another thread initialized it while we were getting the API key,
// we'll get that instance instead
self.dd_api.get_or_init(|| {
DdApi::new(
api_key,
self.metrics_intake_url_prefix.clone(),
self.https_proxy.clone(),
self.timeout,
self.retry_strategy.clone(),
)
})
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method takes a mutable receiver, so there shouldn't be any need for OnceLock. 🤔

Just store Option<DdApi> on the struct and populate it if need be.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changed.

@lym953 lym953 requested a review from Copilot June 12, 2025 21:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR defers the DogStatsD API key resolution until flush time to improve performance and responsiveness. Key changes include:

  • Switching from a pre-resolved API key string to a deferred future via an ApiKeyFactory.
  • Implementing lazy initialization for DdApi in Flusher using an async function.
  • Updating tests and serverless compatibility code to utilize the new ApiKeyFactory.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
crates/dogstatsd/tests/integration_test.rs Updated test to use an ApiKeyFactory closure instead of a static api_key string.
crates/dogstatsd/src/flusher.rs Modified Flusher to lazily instantiate DdApi by awaiting a future for the API key.
crates/datadog-serverless-compat/src/main.rs Updated FlusherConfig initialization to use a closure that returns the API key asynchronously.
Comments suppressed due to low confidence (2)

crates/dogstatsd/src/flusher.rs:51

  • [nitpick] Consider expanding the documentation for get_dd_api to clarify the benefits and thread-safety assumptions of lazy initialization, as well as its caching behavior.
async fn get_dd_api(&mut self) -> &DdApi {

crates/dogstatsd/tests/integration_test.rs:43

  • [nitpick] Consider adding a brief comment to explain why the api_key_factory closure is used here, to improve clarity for future maintainers.
let api_key_factory: ApiKeyFactory = Arc::new(|| Box::pin(async move { "mock-api-key".to_string() }));

Comment thread crates/datadog-serverless-compat/src/main.rs Outdated
@lym953 lym953 merged commit fb0a663 into main Jun 13, 2025
26 checks passed
@lym953 lym953 deleted the yiming.luo/api-key-future-3 branch June 13, 2025 14:49
lym953 added a commit that referenced this pull request Jul 10, 2025
lym953 added a commit that referenced this pull request Jul 10, 2025
lym953 added a commit to DataDog/datadog-lambda-extension that referenced this pull request Jul 17, 2025
# Motivation
From @astuyve:
> today we basically block/await on that decrypt call before we can call
/next
so if we can instead make that async and then resolve the future only
when we need to flush data, that can be a big win for many customers.

https://datadoghq.atlassian.net/browse/SVLS-6995

# Previous work
DataDog/serverless-components#21,
DataDog/serverless-components#24 created
`ApiKeyFactory`, which is a util to enable lazy API key resolution.

# This PR

Updates Bottlecap code to use `ApiKeyFactory` to lazily resolve API key,
i.e. instead of resolving it by querying Secret Manager or KMS during
init phase, do it at flushing time when api key is actually needed.

# Note

This PR changes the behavior when key resolution fails, i.e. when
`resolve_secrets()` returns `None`.
- Before: run `extension_loop_idle()`, which does not stop the runtime
- After: panic, which will stop the runtime (if I understand correctly).
Of course it's not ideal. Any better idea?
- It's harder now to run `extension_loop_idle()` because api key
resolution code is not in the main loop anymore, but in various consumer
code of api key
- Is there a way to gracefully shut down the extension without affecting
the runtime?

Update: Added a PR to address resolution failure:
#732
These two PRs should be merged together. Keeping them separate PRs just
to make review easier.

# Testing
## Setup
- Runtime: Go1 on Amazon Linux 2
- Architecture: arm64
- An app with empty implementation code

## Result
Below is the `Datadog Next-Gen Extension ready in:` time logged.

- Before: (prod extension
`arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:82`)
  - 88.6 ± 1.8 (ms)

- After: (test extension
`arn:aws:lambda:us-east-1:425362996713:layer:Datadog-Bottlecap-Beta-ARM-yiming:2`)
  - 35.4 ± 5.1 (ms)
  - (-60.0%)

<img width="461" alt="image"
src="https://github.com/user-attachments/assets/b2973aae-d8f2-4003-a37f-6af05a42e059"
/>

Both use 5 samples.

# Notes
https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants