Skip to content

feat: [Trace Stats] Implement stats concentrator#856

Merged
lym953 merged 9 commits intomainfrom
yiming.luo/trace-stats-7
Sep 25, 2025
Merged

feat: [Trace Stats] Implement stats concentrator#856
lym953 merged 9 commits intomainfrom
yiming.luo/trace-stats-7

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented Sep 19, 2025

This PR

Implements trace stats concentrator, which aggregates trace stats by time slots and aggregation keys.

Now we have minimal working support for trace stats. You can use it by setting env var DD_COMPUTE_TRACE_STATS to true.

Testing

Steps:

  1. Invoke a function 5000 times
  2. Check the sum of trace.aws.lambda.hits trace metric

Result

The trace metric is accurate (5000) for most of the runtimes, but there's undercounting for some runtimes. Will debug it as a next step.

  • dotnet6: 4999
  • dotnet8: 5000
  • golang: 4750
  • java 11: 5000
  • java 17: 5000
  • java21: 5000
  • node18: 5000
  • node20: 5000
  • node22: 5000
  • python39: 5000
  • python310: 5000
  • python311: 5000
  • python312: 5000
  • ruby32: 4753

Thanks @purple4reina for testing.

Next steps

  1. Investigate into the undercounting issue
  2. Handle the data fields marked as TODO right now, so the metrics can be grouped properly, and other metrics (other than hits) can be reported correctly.
  3. At the same time, evaluate whether we can reuse the existing Rust-based solution.

Note

Jira: https://datadoghq.atlassian.net/browse/SVLS-7593

lym953 added a commit that referenced this pull request Sep 22, 2025
## This PR
1. Move stats generation after trace obfuscation, which is the correct
order as suggested by Trace Agent team. Right now stats generation is
before trace obfuscation.
2. Also generate trace stats for OTLP agent. Right now we only do it for
trace agent.

## Architecture
Copied from #842
<img width="1296" height="674" alt="image"
src="https://github.com/user-attachments/assets/2d4cb925-6cfc-4581-8ed6-6bd87cf0d87a"
/>

## Testing
Tested in the next PR
#856, which
implements stats concentrator. Trace stats appeared in Datadog.
<img width="538" height="317" alt="image"
src="https://github.com/user-attachments/assets/48b849cc-2413-41d5-8576-5ff657c21a0f"
/>


## Next steps
1. Implement `StatsConcentrator`
2. Rename for clarity:
  - `SendingTraceStatsProcessor` -> `TraceStatsGenerator`
  - `stats_sender` -> `stats_generator`
3. Small refactor: consider passing around `stats_sender` instead of
`stats_concentrator_handle`. Right now
`SendingTraceStatsProcessor::new()` is called in three places. It might
be possible to call it only once then pass it around.

## Notes
Jira: https://datadoghq.atlassian.net/browse/SVLS-7593
Base automatically changed from yiming.luo/trace-stats-6 to main September 22, 2025 20:02
@lym953 lym953 force-pushed the yiming.luo/trace-stats-7 branch from d6b18a8 to ef07b85 Compare September 23, 2025 17:21
stats: Stats,
) -> pb::ClientStatsPayload {
pb::ClientStatsPayload {
// TODO: handle this
Copy link
Copy Markdown
Contributor Author

@lym953 lym953 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking many fields with TODO, so I can keep this PR small, iterate fast and handle them in future PRs.
Some of them may need code changes, while some may need just understanding work and updating the comment.

@lym953 lym953 force-pushed the yiming.luo/trace-stats-7 branch from 018242c to c27a768 Compare September 23, 2025 18:46
@lym953 lym953 marked this pull request as ready for review September 23, 2025 18:49
@lym953 lym953 requested a review from a team as a code owner September 23, 2025 18:49
use std::{
collections::HashMap,
sync::Arc,
time::{SystemTime, UNIX_EPOCH},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we switch to the tokio::time module to align more closely with Tokio's scheduling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, what's the problem if we use std::time?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per tokio-rs/tokio#4633, it is better to use tokio's time package if tokio scheduling is involved.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline. We will keep using std::time because tokio::time is mainly used for handling time intervals, but for trace stats we need to work with absolute timestamps.

pub struct Stats {
pub hits: i32,
pub duration: i64, // in nanoseconds
pub error: i32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does its int type stand for an error code or error count? Can we clarify it in comment or rename it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

pub name: String,
// e.g. "my-lambda-function-name", "datadog_lambda.handler", "urllib.request"
pub resource: String,
// e.g. "aws.lambda.load", "aws.lambda.import"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// e.g. "aws.lambda.load", "aws.lambda.import"
// e.g. "serverless"

not sure if the comment is correct for r#type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Will fix.

// This is to reduce the chance of flushing stats that are still being collected to save some cost.
const NO_FLUSH_BUCKET_COUNT: u64 = 2;

const S_TO_NS: u64 = 1_000_000_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably already exists somewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only exists in f64:

pub const S_TO_NS: f64 = 1_000_000_000.0;

I need to define another one in u64.

@lym953 lym953 force-pushed the yiming.luo/trace-stats-7 branch from c3fcc18 to 0bf2d8b Compare September 25, 2025 17:34
@lym953 lym953 force-pushed the yiming.luo/trace-stats-7 branch from 0bf2d8b to c099094 Compare September 25, 2025 18:59
@lym953 lym953 merged commit 53659e5 into main Sep 25, 2025
46 checks passed
@lym953 lym953 deleted the yiming.luo/trace-stats-7 branch September 25, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants