Skip to content

Add retry logic to trace_utils::SendData#433

Merged
ekump merged 6 commits intomainfrom
ekump/APMSP-1020-sidecar-trace-export-retry-logic
May 29, 2024
Merged

Add retry logic to trace_utils::SendData#433
ekump merged 6 commits intomainfrom
ekump/APMSP-1020-sidecar-trace-export-retry-logic

Conversation

@ekump
Copy link
Copy Markdown
Contributor

@ekump ekump commented May 17, 2024

What does this PR do?

Introduces retry logic to send traces in trace_utils::SendData. The retry strategy is configurable via the RetryStrategy struct. SendData will use a Default strategy, but callers have the option to configure the strategy via the SendData.set_retry_strategy() function. TraceExporter, Mini-agent, and SIdecar will have the default retry strategy behavior.

Motivation

In order to turn the sidecar on by default for PHP there needs to be support for retries when flushing traces fail. Adding the retry logic to trace_utils made the most sense (and can be used outside of the sidecar)

Additional Notes

The RetryStrategy struct should live in another file. And some of the test helper functions should move to a common place. But, there are other PRs in flight that will be modifying send_data.rs. So we can do that work in a follow-up PR to minimize disruption. TODO comments with Jira IDs have been added to the code in these spots.

How to test the change?

Unit tests have been added. Longer-term we may want to consider using the test-agent.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 17, 2024

Codecov Report

❌ Patch coverage is 98.61660% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.56%. Comparing base (b462829) to head (7049170).
⚠️ Report is 1147 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #433      +/-   ##
==========================================
+ Coverage   67.88%   68.56%   +0.68%     
==========================================
  Files         193      193              
  Lines       24643    25086     +443     
==========================================
+ Hits        16728    17201     +473     
+ Misses       7915     7885      -30     
Components Coverage Δ
crashtracker 19.34% <ø> (ø)
datadog-alloc 98.76% <ø> (ø)
data-pipeline 51.45% <ø> (ø)
data-pipeline-ffi 0.00% <ø> (ø)
ddcommon 85.24% <ø> (ø)
ddcommon-ffi 74.93% <ø> (ø)
ddtelemetry 56.09% <ø> (ø)
ipc 81.69% <ø> (ø)
profiling 77.98% <ø> (ø)
profiling-ffi 60.05% <ø> (ø)
serverless 0.00% <ø> (ø)
sidecar 37.89% <100.00%> (ø)
sidecar-ffi 0.00% <ø> (ø)
spawn-worker 54.98% <ø> (ø)
trace-mini-agent 69.12% <ø> (ø)
trace-normalization 97.79% <ø> (ø)
trace-obfuscation 95.74% <ø> (ø)
trace-protobuf 30.76% <ø> (ø)
trace-utils 90.00% <98.59%> (+10.34%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch 4 times, most recently from 4e750da to 3c993d6 Compare May 21, 2024 20:18
@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch 4 times, most recently from 7a45559 to 87f7441 Compare May 24, 2024 00:42
@ekump ekump changed the title WIP - Add retry logic to sidecar trace flushing Add retry logic to trace_utils::SendData May 24, 2024
@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch from 87f7441 to 7bca31b Compare May 24, 2024 00:51
@ekump ekump marked this pull request as ready for review May 24, 2024 00:58
@ekump ekump requested review from a team as code owners May 24, 2024 00:58
@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch 2 times, most recently from 9b363c5 to f0dd686 Compare May 24, 2024 01:31
Copy link
Copy Markdown
Contributor

@pierotibou pierotibou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Nice and simple solution!

pub delay_ms: Duration,
/// The type of backoff to use for the delay between retries.
pub backoff_type: RetryBackoffType,
/// An optional jitter to add randomness to the delay.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really needed the jitter? I don't mind leaving it but I'm a big fan of simplicity and implementing only what we need at a given time (as I saw too many times "vital" features nobody was using :p)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say with confidence that we need jitter. Perhaps someone from the trace-agent team would have a better sense of how important it is. It was a suggestion made in a libdatadog weekly sync and was simple enough to implement.

There is nothing blocking the sidecar from continually flushing during connectivity issues, so it is conceivable that there are several SendData objects attempting retries concurrently and having jitter in place may help with this.

I'm happy to defer to others on this and remove it if we think it's overkill.

Copy link
Copy Markdown
Contributor

@pierotibou pierotibou May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave it for now anyway and we can reconsider later.

About the trace agent, it has a backoff mechanism, not yet implemented by any tracer and we have a ticket for it

Comment thread trace-utils/src/send_data.rs Outdated
retry_strategy.delay(2).await;
let elapsed = start.elapsed();

// For the Exponential strategy, the delay for the second attempt should be double the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking - we should maybe add one other iteration as double is also the case for the double strategy, to make sure we don't mix up anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

// result. Otherwise, delay and try again.
match &result {
Ok(response) => {
if response.status().is_client_error() || response.status().is_server_error() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you asked in standup, this is good for now. We have one story - APMSP-1054 - to improve this down the road (as nobody implemented it yet)

Copy link
Copy Markdown
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general perf comments. I'll continue looking at it.

Comment thread trace-utils/src/send_data.rs Outdated
Comment thread trace-utils/src/send_data.rs Outdated
Comment thread trace-utils/src/send_data.rs

if target.api_key.is_some() {
req = req.header("Content-type", "application/x-protobuf");
let agent_payload = construct_agent_payload(self.tracer_payloads.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out a way to avoid this clone 😢 The raw mapping of proto definitions to rust structs make this messy.

ekump added 3 commits May 28, 2024 16:21
Adding retry logic as-is would be a bit problematic. Hyper doesn't allow
the use of references and we can't clone request objects so we need to
refactor where we construct the objects to support retries.
Add RetryStrategy to send_data to support retrying requests.
@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch from f0dd686 to d37f762 Compare May 28, 2024 20:33
@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

Software Composition Analysis

✅ No library vulnerabilities found (compared d37f762 against b462829).

1 similar comment
@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

Software Composition Analysis

✅ No library vulnerabilities found (compared d37f762 against b462829).

@ekump ekump force-pushed the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch from d37f762 to 7b7225c Compare May 28, 2024 22:52
Copy link
Copy Markdown
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great tests. This looks good, but I'm confused by the RetryBackoffType naming.

Comment thread trace-utils/src/send_data.rs Outdated
@ekump ekump merged commit de8d3b6 into main May 29, 2024
@ekump ekump deleted the ekump/APMSP-1020-sidecar-trace-export-retry-logic branch May 29, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants