-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
Right now our metrics pipeline consists of the various DogStatsD client libraries, The Datadog Agent, and then straight to Datadogs backend.
We would like to implement Vector as an aggregation layer due to growing metric costs from services with a large number of host tags so we can effectively aggregate those tags (replacing with some type of vector instance id tag or something) away without risking data loss from Datadog.
The problem is that when using DogStatsD, all metrics which are submitted in app as a count are converted to a rate within the agents processes when it flushes the data together. These rate metrics are interpreted by Vector as a count metric with the interval_ms field set to the value that the agent set the interval to and when it reaches the Datadog Sink, it submits it as a rate.
All of that works fine until you try to aggregate the rate metrics. The way Vector seems to handle aggregation of multiple points is by summing the total of all the count values and setting the interval_ms to the total interval of all points included in the window. This results in inconsistency within a given timeseries of what the interval_ms value is. Datadog is unable to handle inconsistency and is designed to assume all datapoints for a given metric name always have the same interval (though you can change what it is in the backend).
The inconsistency arises from the fact that hosts will have offset between submissions from eachother, and because the Datadog agent has an unconfigurable 10s flush and 15s report interval, each host alternates between sending 1 or 2 datapoints per time series. We also need to ensure no two time series can be sent with the same timestamp else datadog will drop them
Attempted Solutions
We have tried a number of things that somewhat work for other metric types but not rates.
Changing the timestamp to the vector clock time does help with misattribution and data duplication for non rate types, but for rates you still have inconsistency in the interval reported.
We have also tried modifying the interval_ms field with VRL but it seems this field is not allowed to be edited as it is not in the VRL object model or something like that.
Proposal
There are a few things that would allow us to get the outcome we want:
- A setting on the Aggregate transform to always assign
interval_msto the same value as the transforms parameters - The ability to modify this field via VRL
- A timestamp based bucketing transform for aggregation which also allows a window to stay open for some set duration (For example, aggregate events based on their timestamp in a 30s interval but hold the window open for 60s)
- The same as the above as a standalone transform similar to
windowbut purely time based might help too.
I am curious what workaround might already exist or if people have dealt with aggregating rate metrics from DogStatsD before, or if there is some other pattern that is recommended to accomplish the same goal.
References
No response
Version
No response