feat: [Trace Stats] Move stats generation after trace obfuscation#855
feat: [Trace Stats] Move stats generation after trace obfuscation#855
Conversation
| if compute_trace_stats { | ||
| if let Err(err) = stats_sender.send(&processed_traces) { | ||
| error!("OTLP | Error sending traces to the stats concentrator: {err}"); | ||
| return ( | ||
| StatusCode::INTERNAL_SERVER_ERROR, | ||
| json!({ "message": format!("Error sending traces to the stats concentrator: {err}") }).to_string() | ||
| ).into_response(); |
There was a problem hiding this comment.
Core change: Add a stats generation hook in OTLP agent.
There was a problem hiding this comment.
Same comment as for traces
| if config.compute_trace_stats { | ||
| if let Err(err) = stats_sender.send(&traces) { | ||
| return error_response( | ||
| StatusCode::INTERNAL_SERVER_ERROR, | ||
| format!("Error sending stats to the stats aggregator: {err}"), | ||
| ); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Moving this into send_processed_traces() below
| if config.compute_trace_stats { | ||
| if let Err(err) = self.stats_sender.send(&processed_traces) { | ||
| error!("TRACE_PROCESSOR | Error sending traces to the stats concentrator: {err}"); | ||
| return Err(SendingTraceProcessorError::SendStatsError(err)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Core change: this is moved into send_processed_traces() after obfuscation is done.
There was a problem hiding this comment.
Should we allow to continue even if trace stats forwarding failed? Wouldn't it be better to keep having some data being forwarded still?
There was a problem hiding this comment.
I think the fundamental question is whether trace stats are as important as traces themselves. If yes, then when stats fail to be sent, we should probably return error to let the caller handle this case. Otherwise we can return ok in this case. Seems you think stats are less important than traces, so let me swallow the error then.
There was a problem hiding this comment.
Ideally, I think this would directly impede sending traces, I wouldn't expect this error to happen a lot of times, but ideally we'd like to have some data as opposed to none
I'm not sure if the tracer would send the data back if we respond with an error, is that the case?
Because this error in theory is not related to any datadog logic, but just a failed channel forwarding
There was a problem hiding this comment.
we'd like to have some data as opposed to none
What do you mean by "have some data"? Do you mean traces should still be sent to Datadog even if stats fail to send?
I'm not sure if the tracer would send the data back if we respond with an error, is that the case?
What do you mean by "send the data back"?
I'm not sure how tracers work, but if we think this error is critical, the extension should surface it to the caller.
There was a problem hiding this comment.
What do you mean by "have some data"? Do you mean traces should still be sent to Datadog even if stats fail to send?
Correct, before, WDYT?
What do you mean by "send the data back"?
As in, if you reply back to the tracer with a 400/500 status, will it re-send the tracer payload we failed to process?
I'm not sure how tracers work, but if we think this error is critical, the extension should surface it to the caller.
I agree, but in this case, it's not a processing error, it's more of a critical error on the extension, right? So we could say it's not the tracers fault, but ours(?
There was a problem hiding this comment.
As in, if you reply back to the tracer with a 400/500 status, will it re-send the tracer payload we failed to process?
I spot checked dd-trace-py. It doesn't retry as long as it gets a valid response, even if the status is 400/500.
I'm okay either way. I pushed a commit to swallow and log the error. Could you review it?
| #[allow(clippy::too_many_arguments)] | ||
| #[async_trait] | ||
| pub trait TraceProcessor { | ||
| async fn process_traces( |
There was a problem hiding this comment.
This doesn't need to be async.
| self.stats_concentrator.add(stats)?; | ||
| pub fn send( | ||
| &self, | ||
| traces: &TracerPayloadCollection, |
There was a problem hiding this comment.
Generating stats from processed traces instead of raw traces
| }; | ||
|
|
||
| let builder = SendDataBuilder::new(body_size, payload, header_tags, &endpoint) | ||
| let builder = SendDataBuilder::new(body_size, payload.clone(), header_tags, &endpoint) |
There was a problem hiding this comment.
I have to clone the processed trace payload so I can return it to generate stats. Let me know if you have a more efficient approach.
| } | ||
|
|
||
| // Extracts information from traces related to stats and sends it to the stats concentrator | ||
| impl SendingTraceStatsProcessor { |
There was a problem hiding this comment.
SendingTraceStatsProcessor name confuses me on what it does
There was a problem hiding this comment.
It's modified from SendingTraceProcessor. As the PR summary says, I plan to rename it to TraceStatsGenerator. Does this sound good to you?
This PR
Architecture
Copied from #842

Testing
Tested in the next PR #856, which implements stats concentrator. Trace stats appeared in Datadog.

Next steps
StatsConcentratorSendingTraceStatsProcessor->TraceStatsGeneratorstats_sender->stats_generatorstats_senderinstead ofstats_concentrator_handle. Right nowSendingTraceStatsProcessor::new()is called in three places. It might be possible to call it only once then pass it around.Notes
Jira: https://datadoghq.atlassian.net/browse/SVLS-7593