Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces changes to enable optionally redriving failed metrics and trace requests in serverless environments by returning the unsent data from the flush operations for retry.
- Updated the dogstatsd flusher to return failed Series and SketchPayload vectors via a new flush_with_retries function.
- Modified the trace flusher to accept and return failed trace data for potential retransmission.
- Added Clone derivations in the Datadog-related types to support safe data retries.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| crates/dogstatsd/src/flusher.rs | Adjusted flush logic to optionally return failed metrics for retry and improved error logging. |
| crates/dogstatsd/src/datadog.rs | Added Clone trait derivations to resource and metric types to support retry functionality. |
| crates/datadog-trace-agent/src/trace_flusher.rs | Updated send and flush to return unsent trace data for redriving failed requests. |
| // Return the failed metrics for potential retry | ||
| Some((series_failed, sketches_failed)) | ||
| } else { | ||
| debug!("Some metrics were not sent but no errors occurred"); |
There was a problem hiding this comment.
[nitpick] Consider adding a comment explaining why failed batches that did not encounter a shipping error are not returned for retry. This clarification would help future maintainers understand the intended behavior.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This reverts commit e943bd1.
|
|
||
| async fn should_try_next_batch(resp: Result<Response, ShippingError>) -> bool { | ||
| /// Returns a tuple (continue_to_next_batch, should_retry_this_batch) | ||
| async fn should_try_next_batch(resp: Result<Response, ShippingError>) -> (bool, bool) { |
There was a problem hiding this comment.
this probably should just return an enum instead of a tuple of bools
| .last_result | ||
| { | ||
| Ok(_) => debug!("Successfully flushed traces"), | ||
| Err(e) => { |
There was a problem hiding this comment.
Is it possible to only clone it after it fails here? Or is coalesce directly taking ownership of it?
There was a problem hiding this comment.
Yeah unfortunately coalesce_and_send takes ownership. This can be improved inside libdatadog
What does this PR do?
Modifies the trace flusher and metric flusher to return vecs of data which can be resubmitted in the event of an intermittent failure due to the start/stop behavior of serverless functions.
Motivation
Allows us to redrive data outside of a typical retry loop
Additional Notes
Describe how to test/QA your changes