feat(tracer): compute span stats by Kyle-Verhoog · Pull Request #2915 · DataDog/dd-trace-py

Kyle-Verhoog · 2021-10-14T19:52:45Z

Introduce span metrics computation to the trace client.

Motivation

Today, metrics computation on traces sent via the /v0.4/traces endpoint is performed in the agent. This means that all traces need to be sent to the agent in order to have metrics computed even if the trace is unsampled. Trace data is big relative to metric data so this means that a lot of unnecessary work is done to encode and transmit traces just to get metrics.

Performing metrics computation in the client enables dropping unsampled traces and savings subsequent client and agent processing.

Datadog Agent PR introducing support for client stats computation: DataDog/datadog-agent#7875 (released in v7.28.0)

Implementation

Stats computation is done by introducing a new SpanProcessor, SpanStatsProcessorV06. This processor handles span finish events, computes the metrics for measured spans and periodically (every 10 seconds) flushes the metrics to the agent via the /v0.6/stats endpoint.

Measured spans are spans that are sampled (either automatically or manually) and are

Service entry spans - the highest level span for a service.
Marked to have metric computed (have the _dd.measured: 1 tag).

Enabling

The feature is enabled via the DD_TRACE_COMPUTE_STATS environment variable or config._compute_stats. The feature is disabled by default as we have no endpoint discovery mechanism implemented in the client, otherwise it should be enabled by default.

done

Indicating that metrics computation has been performed

The agent needs to know whether or not to compute metrics for a given payload so that it doesn't compute the metrics again. This is done by sending the Datadog-Client-Computed-Stats header with a value of yes (it is implicitly no in the agent) with requests to the /v0.4/traces and /v0.5/traces endpoints.

Aggregation

The metrics computed are aggregated. This saves considerable resources both in the client and the backend. The aggregation performed over the following span attributes name, service, resource, type, http.status_code and synthetics (whether or not the span is from synthetics).

done

Normalization

As the metrics are aggregated by the following attributes it is crucial that they effectively slice the data into meaningful, low cardinality segments. For example, if the resource name used is the raw URL of each request of a service that handles /users/<user_id> requests then the computed metrics would be aggregated by /users/1234, /users/23, etc. This isn't useful for users and is expensive to compute.

name, service, type, http.status_code and synthetics don't require normalization as they should already be low cardinality. The one that needs addressing is resource.

done*: logic is performed in the agent, implementation would be required in order to go agentless.

Errors

The Java client and Datadog agent perform logic to sample all error spans even if they are marked as unsampled by the sampler or user.

done*: opted to not implement this logic for now.

Computed metrics

For each measured span the following aggregated metrics are computed:

hits (count): a count of how many times the aggregated span has been created.
top level hits (count): if the aggregated span is top level, how many times it has been created.
duration (count): the cumulative duration of the aggregated span.
ok summary (DDSketch distribution): a distribution of the aggregated span duration, if successful.
error summary (DDSketch distribution): a distribution of the aggregated span duration, if erroneous.

DDSketch

The ok summary and error summary distributions are DDSketchs configured as LogCollapsingLowestDenseDDSketchs with a relative error rate of 0.775% and max bin size of 2048. This configuration matches the implementation used on the backend which avoids the error introduced when converting the sketches. The sketches are serialized to protobuf and included in the msgpack payload as raw bytes which are forwarded by the agent to the backend.

done

`v0.6/stats` payload

The computed metrics are placed into time buckets. A bucket is a start time and a duration. Each bucket is 10 seconds in size and metrics are placed into buckets according to the measured span's end time. 10 seconds is chosen arbitrarily but with the rough motivation of timeliness (less is better) and aggregation (more is better). The idea is to aggregate the metrics into buckets and flush them every bucket duration.

The payload sent to v0.6/stats is a list of buckets.

headers:

"Datadog-Meta-Lang": "python",
"Datadog-Meta-Tracer-Version": ddtrace.__version__,
"Content-Type": "application/msgpack",

body (msgpack):

{                                                                                                                                       
  "Env": ...                                                                                                                            
  "Hostname": ...                                                                                                                       
  "Version": ...                                                                                                                        
  "Stats": [   // list of bucket                                                                                                        
     {                                                                                                                                  
        "Start": ...   // start time in ns                                                                                              
        "Duration": ...  // duration in ns                                                                                              
        "Stats": [  // list of aggregated metrics                                                                                       
             {                                                                                                                          
                 "Name": ... // span name                                                                                               
                 "Resource": ... // span resource                                                                                       
                 "Service": ... // span service                                                                                         
                 "Type": ... // span type                                                                                               
                 "Synthetics": ... // if span was generated from synthetics                                                             
                 "Hits": ...                                                                                                            
                 "TopLevelHits": ...                                                                                                    
                 "Duration": ...                                                                                                        
                 "Errors": ...                                                                                                          
                 "OkSummary": ... // protobuf serialized (bytes) distribution                                                           
                 "ErrorSummary": ... // protobuf serialized (bytes) distribution                                                        
             },                                                                                                                         
             ...                                                                                                                        
        ]                                                                                                                               
     },                                                                                                                                 
     ...                                                                                                                                
  ]                                                                                                                                     
}

done

Testing/verification

Performance

Doing manual performance testing showed that there was negligible improvements to the trace client but substantial improvements to the Datadog Agent.

System tests/test agent snapshots

Metrics should be serialized properly in msgpack format with the required fields
Metrics should consist of an error count, a hit count (all hits, including the errors), and separate ok and error latency distributions, and the sum of all durations (includes error and ok) for each unique aggregation key.
Metrics should be reported on a configurable interval.
- TODO: is there a specified env variable for this?
- Also use this to quicker report stats payloads for system tests
Metrics should be computed for each distinct combination of [service, resource, operation, type, http status code] (non HTTP spans ignore the status code)
- Create endpoint that supports custom values for each of the fields
- Send variety of requests to test web app and verify the stats
Metrics should be computed for measured spans
Confirm: create two span traces with distinct operation names, mark one as measured. Pass these traces through the metrics aggregation system.
Verify that no metrics are computed for the unmeasured span’s operation name.
Verify that metrics are computed for the measured span operation name.
Metrics should be computed for top level spans
- Test endpoint: create two span traces with distinct operation names, mark one as top level. Pass these traces through the metrics aggregation system.
  Verify that no metrics are computed for the top level span operation name.
  Verify that metrics are computed for the top level span operation name.
Some metrics should be recorded separately for errors and successful spans
Should count error spans, ok latency sketch, error latency sketch separately
Total duration, total hit count should include latencies from all spans
Confirm: pass traces with a known mix of ok and error spans, verify counts tracked separately
Metrics must be computed before traces can be dropped or sampled.
Confirm: this can be verified by inspection, and how this is achieved depends on tracer architecture, but a tracer level test with a trace reporter which always drops can be verified to produce metrics. Test metrics are produced with sample rates set to 0%.
Metrics must be computed after spans are finished, otherwise components of the aggregation key may change after contribution to aggregates.

The latency at each quantile of [p50, p75, p95, p99, max] should be within 1% of the actual latency at each quantile.

Test endpoint: generate 1000 non-error spans in the same bucket with latencies of various distributions, and compute the empirical quantiles from these distributions (so sort the latencies, take the 500th for p50, the 990th for p99, the last for max, and so on) then pass these spans through the aggregator. Force publication of the aggregates, intercept the publication and verify the statistics of the ok latencies sketch.
Metrics storage should be finite
- Test endpoint: present enough distinct combinations of [service, resource, operation, type, http status code] to the aggregator within a reporting interval, and verify that one of them is dropped. The Java tracer uses a LRU eviction policy, so verifies the first metric is dropped. The Java tracer will track up to 1000 buckets.
Metrics tracking should have low CPU overhead.
Confirm: run the tracer with and without metrics enabled for a range of applications in the reliability environment and track the CPU overhead. Aim for at most a 5% overhead over disabled metrics.
Metrics should be buffered for short periods of time if the agent is unavailable
Confirm: this can easily be simulated with a mock agent which responds slowly and an accelerated reporting interval.
It should be possible to disable metrics aggregation
Confirm the environment variable DD_TRACE_TRACER_METRICS_ENABLED is used to toggle the feature.
Obfuscation occurs in the client

Manual testing

Testing notebook: https://app.datadoghq.com/notebook/2036264/client-trace-stats-testing

Metrics reporting should work end to end and be accurate

Reliability Env

Metrics storage should have reasonable footprint in RAM
- Confirm: construct spans with various cardinalities of combinations of [service, resource, operation, type, http status code] and vary latency distributions to stress sketch size, vary error rates, and use a memory footprint analysis tool (Java has JOL) to measure the size of the aggregates storage in memory. 10MB is a reasonable target to aim for.
  Metrics should be automatically disabled if an incompatible agent is detected
  Confirm unit test with a mock which yields 404 for the stats endpoint

Links and references

Computing stats spec: https://docs.google.com/document/d/1xEJzAW_z4H4slBC4E0rafadX69lHuCHJ43c60eSgWr0/edit#heading=h.aohkev5b3a3m
Context on implementing stats in clients: https://docs.google.com/document/d/10dzyHupAOmHorGubyPTEyyQE6EAyxs1T5f7bzMmpxfk/edit#heading=h.upodtjsy7g56
Gaps in APM histograms: https://docs.google.com/document/d/1zZ-WPXPTcegUDwC9EBAI7cngqf3LGSTCFbqPIjpV-0k/edit#heading=h.50ldylsk3sb2
Client stats processing code in the Datadog agent: https://cs.github.com/DataDog/datadog-agent/blob/fc26ba4839963979964bd4ade8c0991a75076b33/pkg/trace/agent/agent.go#L353

Follow up

Disable retrying in trace writer when stats computation is enabled.
Obfuscation: 1.x...Kyle-Verhoog:obfuscators
Always sampling errors.
Enable stats computation by default in 1.0 when an agent minimum version is defined.

mergify · 2021-11-12T21:29:10Z

@Kyle-Verhoog this pull request is now in conflict 😩

thankfully this isn't needed, but would be for agentless with the current intake.

Having a bucket class really wasn't providing much other than some added indirection and extra memory usage.

Turns out all the fields in the payload match the go struct fields of the data structure in the agent...

It's handy to have a general msgpack encoder that can be used for encoding arbitrary payloads of primitive Python types (see DataDog#2915). It might be useful to use as a fallback for encoding traces as well if an issue with the custom msgpack encoder is suspected.

It's handy to have a general msgpack encoder that can be used for encoding arbitrary payloads of primitive Python types (see #2915). The encoder added here is based off the one added in #1491. It might be useful to use as a fallback for encoding traces as well if an issue with the custom msgpack encoder is suspected. The relevant tests from the msgpack-python implementation are included as well to ensure that the implementation is correct.

This avoids having the module (and dependencies) be imported which should avoid any errors that might exist when installing them (especially when stats is not enabled). Should also give a bit of a performance improvement.

The msgpack encoder under Python 2 requires everything to be unicode else strings are interpreted as raw bytes which the agent doesn't like.

This is the minimum version that ddsketch supports.

Kyle-Verhoog · 2022-04-08T00:58:27Z

The profiling tests will fail until #3559 and DataDog/sketches-py#48 are merged and ddsketch 2.1 is released.

brettlangdon reviewed Oct 14, 2021

View reviewed changes

Kyle-Verhoog force-pushed the stats branch from 8536b98 to ded892e Compare November 11, 2021 21:28

mergify Bot added the conflict label Nov 12, 2021

Kyle-Verhoog commented Nov 12, 2021

View reviewed changes

Comment thread ddtrace/internal/processor/stats.py Outdated

Kyle-Verhoog commented Nov 12, 2021

View reviewed changes

Comment thread ddtrace/internal/processor/stats.py Outdated

Kyle-Verhoog added 9 commits November 15, 2021 16:58

Generate trace stats

1c267fa

Add protobuf, aggregate stats

d96dc67

Add setting to enable stats

4af048d

Remove unnecessary protobuf stuff

8c8c8e2

thankfully this isn't needed, but would be for agentless with the current intake.

Document trace sampling processor

b187c3c

Threadsafety, docs, cleanup

d4b3226

Simplify, optimize data structures

c37879d

Having a bucket class really wasn't providing much other than some added indirection and extra memory usage.

Fix some payload fields, don't flush current bucket

946d2a5

Fix payload

3d1bab0

Turns out all the fields in the payload match the go struct fields of the data structure in the agent...

Kyle-Verhoog force-pushed the stats branch from e155c2d to d432306 Compare November 15, 2021 21:59

mergify Bot removed the conflict label Nov 15, 2021

Add temporary deps for perf testing

ecfa0c9

Kyle-Verhoog force-pushed the stats branch from d432306 to ecfa0c9 Compare November 15, 2021 22:21

Kyle-Verhoog mentioned this pull request Nov 23, 2021

Add msgpack Packer based off msgpack-python 0.6.2 #3028

Merged

4 tasks

Vendor ddsketch

4097ce1

Merge remote-tracking branch 'origin/1.x' into stats

2ea0552

mabdinur previously approved these changes Apr 5, 2022

View reviewed changes

Kyle-Verhoog added 2 commits April 6, 2022 14:47

Inline processor import

30c0b8f

This avoids having the module (and dependencies) be imported which should avoid any errors that might exist when installing them (especially when stats is not enabled). Should also give a bit of a performance improvement.

Merge remote-tracking branch 'origin/1.x' into stats

abeaba6

Kyle-Verhoog dismissed mabdinur’s stale review via abeaba6 April 6, 2022 18:50

majorgreys previously approved these changes Apr 6, 2022

View reviewed changes

Kyle-Verhoog dismissed majorgreys’s stale review via c5ffcfd April 7, 2022 05:14

Compat with Python 2, unicode everything

51f6241

The msgpack encoder under Python 2 requires everything to be unicode else strings are interpreted as raw bytes which the agent doesn't like.

Kyle-Verhoog force-pushed the stats branch from c5ffcfd to 51f6241 Compare April 7, 2022 05:18

Include HTTPStatusCode, update snapshot

6de3624

Kyle-Verhoog force-pushed the stats branch from c7da65c to 6de3624 Compare April 7, 2022 16:00

Kyle-Verhoog added 2 commits April 7, 2022 20:52

Bump minimum supported protobuf version to 3.6.

0a92cb8

This is the minimum version that ddsketch supports.

Merge remote-tracking branch 'origin/1.x' into stats

7dad973

Kyle-Verhoog force-pushed the stats branch from 0127d07 to 7dad973 Compare April 8, 2022 00:52

Kyle-Verhoog mentioned this pull request Apr 8, 2022

chore(protobuf): bump minimum version to 3.6.0 #3559

Closed

P403n1x87 reviewed Apr 8, 2022

View reviewed changes

Comment thread ddtrace/internal/processor/stats.py Outdated

P403n1x87 reviewed Apr 8, 2022

View reviewed changes

Comment thread ddtrace/internal/processor/stats.py Outdated

Kyle-Verhoog added 2 commits April 8, 2022 11:53

Bump ddsketch version, cache rather than compute properties

9e53673

Merge remote-tracking branch 'origin/1.x' into stats

49366c6

Kyle-Verhoog force-pushed the stats branch from 0aaf8da to 49366c6 Compare April 8, 2022 15:57

Kyle-Verhoog requested review from brettlangdon, mabdinur and majorgreys April 8, 2022 18:30

majorgreys approved these changes Apr 8, 2022

View reviewed changes

Merge branch '1.x' into stats

726e2d4

mabdinur approved these changes Apr 8, 2022

View reviewed changes

mergify Bot merged commit 6ee52fe into DataDog:1.x Apr 8, 2022

Kyle-Verhoog deleted the stats branch April 8, 2022 21:47

dgoffredo mentioned this pull request Jul 19, 2022

ddtrace/tracer: adding support for single span matching DataDog/dd-trace-go#1357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tracer): compute span stats#2915

feat(tracer): compute span stats#2915
mergify[bot] merged 47 commits intoDataDog:1.xfrom
Kyle-Verhoog:stats

Kyle-Verhoog commented Oct 14, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Nov 12, 2021

Uh oh!

Uh oh!

Uh oh!

Kyle-Verhoog commented Apr 8, 2022

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Kyle-Verhoog commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Implementation

Enabling

Indicating that metrics computation has been performed

Aggregation

Normalization

Errors

Computed metrics

DDSketch

v0.6/stats payload

Testing/verification

Performance

System tests/test agent snapshots

Manual testing

Reliability Env

Links and references

Follow up

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Nov 12, 2021

Uh oh!

Uh oh!

Uh oh!

Kyle-Verhoog commented Apr 8, 2022

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Kyle-Verhoog commented Oct 14, 2021 •

edited

Loading

`v0.6/stats` payload