Skip to content

feat: [Trace Stats] Move stats generation after trace obfuscation#855

Merged
lym953 merged 12 commits intomainfrom
yiming.luo/trace-stats-6
Sep 22, 2025
Merged

feat: [Trace Stats] Move stats generation after trace obfuscation#855
lym953 merged 12 commits intomainfrom
yiming.luo/trace-stats-6

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented Sep 19, 2025

This PR

  1. Move stats generation after trace obfuscation, which is the correct order as suggested by Trace Agent team. Right now stats generation is before trace obfuscation.
  2. Also generate trace stats for OTLP agent. Right now we only do it for trace agent.

Architecture

Copied from #842
image

Testing

Tested in the next PR #856, which implements stats concentrator. Trace stats appeared in Datadog.
image

Next steps

  1. Implement StatsConcentrator
  2. Rename for clarity:
  • SendingTraceStatsProcessor -> TraceStatsGenerator
  • stats_sender -> stats_generator
  1. Small refactor: consider passing around stats_sender instead of stats_concentrator_handle. Right now SendingTraceStatsProcessor::new() is called in three places. It might be possible to call it only once then pass it around.

Notes

Jira: https://datadoghq.atlassian.net/browse/SVLS-7593

Comment thread bottlecap/src/otlp/agent.rs Outdated
Comment on lines +201 to +207
if compute_trace_stats {
if let Err(err) = stats_sender.send(&processed_traces) {
error!("OTLP | Error sending traces to the stats concentrator: {err}");
return (
StatusCode::INTERNAL_SERVER_ERROR,
json!({ "message": format!("Error sending traces to the stats concentrator: {err}") }).to_string()
).into_response();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core change: Add a stats generation hook in OTLP agent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for traces

Comment on lines -526 to -534
if config.compute_trace_stats {
if let Err(err) = stats_sender.send(&traces) {
return error_response(
StatusCode::INTERNAL_SERVER_ERROR,
format!("Error sending stats to the stats aggregator: {err}"),
);
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this into send_processed_traces() below

Comment on lines +474 to +479
if config.compute_trace_stats {
if let Err(err) = self.stats_sender.send(&processed_traces) {
error!("TRACE_PROCESSOR | Error sending traces to the stats concentrator: {err}");
return Err(SendingTraceProcessorError::SendStatsError(err));
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core change: this is moved into send_processed_traces() after obfuscation is done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow to continue even if trace stats forwarding failed? Wouldn't it be better to keep having some data being forwarded still?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fundamental question is whether trace stats are as important as traces themselves. If yes, then when stats fail to be sent, we should probably return error to let the caller handle this case. Otherwise we can return ok in this case. Seems you think stats are less important than traces, so let me swallow the error then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think this would directly impede sending traces, I wouldn't expect this error to happen a lot of times, but ideally we'd like to have some data as opposed to none

I'm not sure if the tracer would send the data back if we respond with an error, is that the case?

Because this error in theory is not related to any datadog logic, but just a failed channel forwarding

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd like to have some data as opposed to none

What do you mean by "have some data"? Do you mean traces should still be sent to Datadog even if stats fail to send?

I'm not sure if the tracer would send the data back if we respond with an error, is that the case?

What do you mean by "send the data back"?
I'm not sure how tracers work, but if we think this error is critical, the extension should surface it to the caller.

Copy link
Copy Markdown
Contributor

@duncanista duncanista Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "have some data"? Do you mean traces should still be sent to Datadog even if stats fail to send?

Correct, before, WDYT?

What do you mean by "send the data back"?

As in, if you reply back to the tracer with a 400/500 status, will it re-send the tracer payload we failed to process?

I'm not sure how tracers work, but if we think this error is critical, the extension should surface it to the caller.

I agree, but in this case, it's not a processing error, it's more of a critical error on the extension, right? So we could say it's not the tracers fault, but ours(?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, if you reply back to the tracer with a 400/500 status, will it re-send the tracer payload we failed to process?

I spot checked dd-trace-py. It doesn't retry as long as it gets a valid response, even if the status is 400/500.

I'm okay either way. I pushed a commit to swallow and log the error. Could you review it?

#[allow(clippy::too_many_arguments)]
#[async_trait]
pub trait TraceProcessor {
async fn process_traces(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be async.

self.stats_concentrator.add(stats)?;
pub fn send(
&self,
traces: &TracerPayloadCollection,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating stats from processed traces instead of raw traces

};

let builder = SendDataBuilder::new(body_size, payload, header_tags, &endpoint)
let builder = SendDataBuilder::new(body_size, payload.clone(), header_tags, &endpoint)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to clone the processed trace payload so I can return it to generate stats. Let me know if you have a more efficient approach.

@lym953 lym953 marked this pull request as ready for review September 19, 2025 20:14
@lym953 lym953 requested a review from a team as a code owner September 19, 2025 20:14
}

// Extracts information from traces related to stats and sends it to the stats concentrator
impl SendingTraceStatsProcessor {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SendingTraceStatsProcessor name confuses me on what it does

Copy link
Copy Markdown
Contributor Author

@lym953 lym953 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's modified from SendingTraceProcessor. As the PR summary says, I plan to rename it to TraceStatsGenerator. Does this sound good to you?

@lym953 lym953 merged commit 13c8b7b into main Sep 22, 2025
46 checks passed
@lym953 lym953 deleted the yiming.luo/trace-stats-6 branch September 22, 2025 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants