feat(tracer): compute span stats#2915
Merged
mergify[bot] merged 47 commits intoDataDog:1.xfrom Apr 8, 2022
Merged
Conversation
Contributor
|
@Kyle-Verhoog this pull request is now in conflict 😩 |
Kyle-Verhoog
commented
Nov 12, 2021
Kyle-Verhoog
commented
Nov 12, 2021
thankfully this isn't needed, but would be for agentless with the current intake.
Having a bucket class really wasn't providing much other than some added indirection and extra memory usage.
Turns out all the fields in the payload match the go struct fields of the data structure in the agent...
4 tasks
Kyle-Verhoog
added a commit
to Kyle-Verhoog/dd-trace-py
that referenced
this pull request
Nov 24, 2021
It's handy to have a general msgpack encoder that can be used for encoding arbitrary payloads of primitive Python types (see DataDog#2915). It might be useful to use as a fallback for encoding traces as well if an issue with the custom msgpack encoder is suspected.
mergify Bot
pushed a commit
that referenced
this pull request
Nov 24, 2021
It's handy to have a general msgpack encoder that can be used for encoding arbitrary payloads of primitive Python types (see #2915). The encoder added here is based off the one added in #1491. It might be useful to use as a fallback for encoding traces as well if an issue with the custom msgpack encoder is suspected. The relevant tests from the msgpack-python implementation are included as well to ensure that the implementation is correct.
mabdinur
previously approved these changes
Apr 5, 2022
This avoids having the module (and dependencies) be imported which should avoid any errors that might exist when installing them (especially when stats is not enabled). Should also give a bit of a performance improvement.
majorgreys
previously approved these changes
Apr 6, 2022
The msgpack encoder under Python 2 requires everything to be unicode else strings are interpreted as raw bytes which the agent doesn't like.
This is the minimum version that ddsketch supports.
Member
Author
|
The profiling tests will fail until #3559 and DataDog/sketches-py#48 are merged and ddsketch 2.1 is released. |
P403n1x87
reviewed
Apr 8, 2022
P403n1x87
reviewed
Apr 8, 2022
majorgreys
approved these changes
Apr 8, 2022
mabdinur
approved these changes
Apr 8, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce span metrics computation to the trace client.
Motivation
Today, metrics computation on traces sent via the
/v0.4/tracesendpoint is performed in the agent. This means that all traces need to be sent to the agent in order to have metrics computed even if the trace is unsampled. Trace data is big relative to metric data so this means that a lot of unnecessary work is done to encode and transmit traces just to get metrics.Performing metrics computation in the client enables dropping unsampled traces and savings subsequent client and agent processing.
Datadog Agent PR introducing support for client stats computation: DataDog/datadog-agent#7875 (released in v7.28.0)
Implementation
Stats computation is done by introducing a new SpanProcessor,
SpanStatsProcessorV06. This processor handles span finish events, computes the metrics for measured spans and periodically (every 10 seconds) flushes the metrics to the agent via the/v0.6/statsendpoint.Measured spans are spans that are sampled (either automatically or manually) and are
_dd.measured: 1tag).Enabling
The feature is enabled via the
DD_TRACE_COMPUTE_STATSenvironment variable orconfig._compute_stats. The feature is disabled by default as we have no endpoint discovery mechanism implemented in the client, otherwise it should be enabled by default.Indicating that metrics computation has been performed
The agent needs to know whether or not to compute metrics for a given payload so that it doesn't compute the metrics again. This is done by sending the
Datadog-Client-Computed-Statsheader with a value ofyes(it is implicitlynoin the agent) with requests to the/v0.4/tracesand/v0.5/tracesendpoints.Aggregation
The metrics computed are aggregated. This saves considerable resources both in the client and the backend. The aggregation performed over the following span attributes
name,service,resource,type,http.status_codeandsynthetics(whether or not the span is from synthetics).Normalization
As the metrics are aggregated by the following attributes it is crucial that they effectively slice the data into meaningful, low cardinality segments. For example, if the resource name used is the raw URL of each request of a service that handles
/users/<user_id>requests then the computed metrics would be aggregated by/users/1234,/users/23, etc. This isn't useful for users and is expensive to compute.name,service,type,http.status_codeandsyntheticsdon't require normalization as they should already be low cardinality. The one that needs addressing isresource.Errors
The Java client and Datadog agent perform logic to sample all error spans even if they are marked as unsampled by the sampler or user.
Computed metrics
For each measured span the following aggregated metrics are computed:
DDSketch
The
ok summaryanderror summarydistributions areDDSketchs configured asLogCollapsingLowestDenseDDSketchs with a relative error rate of 0.775% and max bin size of 2048. This configuration matches the implementation used on the backend which avoids the error introduced when converting the sketches. The sketches are serialized to protobuf and included in the msgpack payload as raw bytes which are forwarded by the agent to the backend.v0.6/statspayloadThe computed metrics are placed into time buckets. A bucket is a start time and a duration. Each bucket is 10 seconds in size and metrics are placed into buckets according to the measured span's end time. 10 seconds is chosen arbitrarily but with the rough motivation of timeliness (less is better) and aggregation (more is better). The idea is to aggregate the metrics into buckets and flush them every bucket duration.
The payload sent to
v0.6/statsis a list of buckets.headers:
body (msgpack):
Testing/verification
Performance
Doing manual performance testing showed that there was negligible improvements to the trace client but substantial improvements to the Datadog Agent.
System tests/test agent snapshots
Metrics should be serialized properly in msgpack format with the required fields
Metrics should consist of an error count, a hit count (all hits, including the errors), and separate ok and error latency distributions, and the sum of all durations (includes error and ok) for each unique aggregation key.
Metrics should be reported on a configurable interval.
Metrics should be computed for each distinct combination of [service, resource, operation, type, http status code] (non HTTP spans ignore the status code)
Metrics should be computed for measured spans
Confirm: create two span traces with distinct operation names, mark one as measured. Pass these traces through the metrics aggregation system.
Verify that no metrics are computed for the unmeasured span’s operation name.
Verify that metrics are computed for the measured span operation name.
Metrics should be computed for top level spans
Verify that no metrics are computed for the top level span operation name.
Verify that metrics are computed for the top level span operation name.
Some metrics should be recorded separately for errors and successful spans
Should count error spans, ok latency sketch, error latency sketch separately
Total duration, total hit count should include latencies from all spans
Confirm: pass traces with a known mix of ok and error spans, verify counts tracked separately
Metrics must be computed before traces can be dropped or sampled.
Confirm: this can be verified by inspection, and how this is achieved depends on tracer architecture, but a tracer level test with a trace reporter which always drops can be verified to produce metrics. Test metrics are produced with sample rates set to 0%.
Metrics must be computed after spans are finished, otherwise components of the aggregation key may change after contribution to aggregates.
The latency at each quantile of [p50, p75, p95, p99, max] should be within 1% of the actual latency at each quantile.
Test endpoint: generate 1000 non-error spans in the same bucket with latencies of various distributions, and compute the empirical quantiles from these distributions (so sort the latencies, take the 500th for p50, the 990th for p99, the last for max, and so on) then pass these spans through the aggregator. Force publication of the aggregates, intercept the publication and verify the statistics of the ok latencies sketch.
Metrics storage should be finite
Metrics tracking should have low CPU overhead.
Confirm: run the tracer with and without metrics enabled for a range of applications in the reliability environment and track the CPU overhead. Aim for at most a 5% overhead over disabled metrics.
Metrics should be buffered for short periods of time if the agent is unavailable
Confirm: this can easily be simulated with a mock agent which responds slowly and an accelerated reporting interval.
It should be possible to disable metrics aggregation
Confirm the environment variable DD_TRACE_TRACER_METRICS_ENABLED is used to toggle the feature.
Obfuscation occurs in the client
Manual testing
Testing notebook: https://app.datadoghq.com/notebook/2036264/client-trace-stats-testing
Reliability Env
Metrics storage should have reasonable footprint in RAM
Metrics should be automatically disabled if an incompatible agent is detected
Confirm unit test with a mock which yields 404 for the stats endpoint
Links and references
Follow up