feat: [Trace Stats] Implement stats concentrator#856
Conversation
## This PR 1. Move stats generation after trace obfuscation, which is the correct order as suggested by Trace Agent team. Right now stats generation is before trace obfuscation. 2. Also generate trace stats for OTLP agent. Right now we only do it for trace agent. ## Architecture Copied from #842 <img width="1296" height="674" alt="image" src="https://github.com/user-attachments/assets/2d4cb925-6cfc-4581-8ed6-6bd87cf0d87a" /> ## Testing Tested in the next PR #856, which implements stats concentrator. Trace stats appeared in Datadog. <img width="538" height="317" alt="image" src="https://github.com/user-attachments/assets/48b849cc-2413-41d5-8576-5ff657c21a0f" /> ## Next steps 1. Implement `StatsConcentrator` 2. Rename for clarity: - `SendingTraceStatsProcessor` -> `TraceStatsGenerator` - `stats_sender` -> `stats_generator` 3. Small refactor: consider passing around `stats_sender` instead of `stats_concentrator_handle`. Right now `SendingTraceStatsProcessor::new()` is called in three places. It might be possible to call it only once then pass it around. ## Notes Jira: https://datadoghq.atlassian.net/browse/SVLS-7593
d6b18a8 to
ef07b85
Compare
| stats: Stats, | ||
| ) -> pb::ClientStatsPayload { | ||
| pb::ClientStatsPayload { | ||
| // TODO: handle this |
There was a problem hiding this comment.
Marking many fields with TODO, so I can keep this PR small, iterate fast and handle them in future PRs.
Some of them may need code changes, while some may need just understanding work and updating the comment.
018242c to
c27a768
Compare
| use std::{ | ||
| collections::HashMap, | ||
| sync::Arc, | ||
| time::{SystemTime, UNIX_EPOCH}, |
There was a problem hiding this comment.
Should we switch to the tokio::time module to align more closely with Tokio's scheduling?
There was a problem hiding this comment.
Could you elaborate?
There was a problem hiding this comment.
For example, what's the problem if we use std::time?
There was a problem hiding this comment.
Per tokio-rs/tokio#4633, it is better to use tokio's time package if tokio scheduling is involved.
There was a problem hiding this comment.
Talked offline. We will keep using std::time because tokio::time is mainly used for handling time intervals, but for trace stats we need to work with absolute timestamps.
| pub struct Stats { | ||
| pub hits: i32, | ||
| pub duration: i64, // in nanoseconds | ||
| pub error: i32, |
There was a problem hiding this comment.
Does its int type stand for an error code or error count? Can we clarify it in comment or rename it?
| pub name: String, | ||
| // e.g. "my-lambda-function-name", "datadog_lambda.handler", "urllib.request" | ||
| pub resource: String, | ||
| // e.g. "aws.lambda.load", "aws.lambda.import" |
There was a problem hiding this comment.
| // e.g. "aws.lambda.load", "aws.lambda.import" | |
| // e.g. "serverless" |
not sure if the comment is correct for r#type
There was a problem hiding this comment.
You are right. Will fix.
| // This is to reduce the chance of flushing stats that are still being collected to save some cost. | ||
| const NO_FLUSH_BUCKET_COUNT: u64 = 2; | ||
|
|
||
| const S_TO_NS: u64 = 1_000_000_000; |
There was a problem hiding this comment.
probably already exists somewhere?
There was a problem hiding this comment.
It only exists in f64:
I need to define another one in
u64.
c3fcc18 to
0bf2d8b
Compare
0bf2d8b to
c099094
Compare
This PR
Implements trace stats concentrator, which aggregates trace stats by time slots and aggregation keys.
Now we have minimal working support for trace stats. You can use it by setting env var
DD_COMPUTE_TRACE_STATStotrue.Testing
Steps:
trace.aws.lambda.hitstrace metricResult
The trace metric is accurate (5000) for most of the runtimes, but there's undercounting for some runtimes. Will debug it as a next step.
Thanks @purple4reina for testing.
Next steps
hits) can be reported correctly.Note
Jira: https://datadoghq.atlassian.net/browse/SVLS-7593