Add sample rate field to the spec#76
Conversation
a870bc8 to
a31680e
Compare
| ### sampled-rate | ||
| * Type: Positive Integer | ||
| * Description: The rate at which this event has already been sampled. Represents | ||
| the denominator of a fraction: when `1/n` events are sent, this field holds |
There was a problem hiding this comment.
would we need to say what the time-frame of this sampling is? seconds? minutes?
There was a problem hiding this comment.
I haven't needed the timeframe in the past. The way I've used this field in Scuba (the technology HoneyComb is based on), the sample effectively means "weigh this data point N times heavier". The meaning of that weight depends on whether the summary is summing, averaging, etc.
There was a problem hiding this comment.
Sorry for being dense, but "N times heavier" than what? Got a link to Scuba so I can see how they use it?
There was a problem hiding this comment.
In the most general sense, a sample rate of 5 means that this event represents 5 events - that the system originating the events chose not to send 4 of them and is sending this one saying "there were actually 5, but I'm just giving you this one." It doesn't say anything about the time range across which those other events may have occurred. When taking an action based on this event, the target system then has enough information to extrapolate based on the received event and take an action appropriate for 5 events, despite having received only one.
Presenting a graph based on the content of the event is an easy way to describe what this looks like in practice - if one were to ask "how many events happened", the total count would have to be scaled according to the sample rate of each event received. How this calculation plays out depends on the action to take (a count of events is obvious, an average or percentile is a little more complicated, but not terribly so) as well as the goals of the receiving system. The description "weight this data point N times heavier" is clear in the case of calculating an average - the sampled data point should effectively count N times towards the average instead of just once.
The Scuba whitepaper does mention sampling, but doesn't cover it in great detail. Here's a link to the paper: https://research.fb.com/publications/scuba-diving-into-data-at-facebook/
There was a problem hiding this comment.
The prose sounds fairly good. I think we can improve the description a bit further if we describe what scope is being sampled. E.g. if I ask for sampled events from MySQL, is the sampling based on row, table, or database? That would certainly impact any sort of visualization or metrics I wanted to generate.
There was a problem hiding this comment.
@duglin Yeah, that wording change LGTM. I'll update the diff.
@inlined what sort of events are you imagining you're getting from a MySQL database? That seems to be a larger influence on how sampling is done. The events I usually expect to get from my DB are "a query happened" (whether that was a read or write request) and the sampling would clearly be "I'm sending you 1 out of the 5 queries that happened", and not anything about rows or tables or databases (beyond that a specific query will be on a database/table). You can always choose a sample rate based on something more specific, of course, and count up reads / writes differently, choosing to emit different numbers of each (all writes and 1/10 reads?) by setting the sampled-rate field appropriately on each event you emit.
There was a problem hiding this comment.
hmm... thinking about this a bit more, "This value represents the number of times this event occurred." sounds more like a counter that one might imagine only increments. Maybe "This value represents the number of similar events that were not sent"?
|
I'm really glad to have Honeycomb onboard & helping us see how to make Cloud Events work well with observability systems. In your use-case, would you want the sampled event's data or just the context? |
|
Should this be part of |
Unquestionably both. If a sampled event is intended to communicate both that a thing happened and that a number of statistically similar things also happened, the data and context are both equally valuable in deciding what action to take based on that event (whether it's goal is observability or any other action). Thanks for the clarifying question. |
While my examples are based on using events in an observability context to understand the operation of a system, I don't think the concept of sampling in order to manage resources is tied to the specific use case. I chose those as they're what I'm currently closest to, but one could imagine adjusting routing tables based on sampled data from a switch or DNS cache contents results based on sampled network traffic and so on. Any event-based system that takes actions based on events emitted from a high throughput environment can benefit from understanding any sampling done by the emitting system. To put it another way, while using a sampled-rate attribute to create graphs of the data is a specific use case, the act of including that sampled-rate attribute in the event itself is (in my opinion) not. |
|
Personally I think this is a more of a use case specific. |
|
@oritnm are you suggesting a rewording? not adding the field? something else? If a rewording, perhaps you could suggest something that we can agree to before the call today. |
|
I don't think this field belongs at the envelope level, which is really all we're discussing in this initiative. This is more a quality of the data, not the delivery. |
|
@djrosanova do you think it would be part of the |
|
@duglin Personally I'd prefer to not specify the data field. Within Google I'm trying to reuse the types in request/response APIs as much as possible without modification because I think this will simplify the learning curve for developers. RE envelope vs extension: I don't think the |
|
This is something that is data specific, not event envelope specific. If your event was a 'something happening average event' then it would make a lot more sense - in the data section of that event type. If we keep adding edges like this to what is meant to be a very light and basic envelope it will become unwieldy. Also, why couldn't this just go in a generic property? This also leads down the slippery IoT slope towards telemetry. |
|
@djrosanova re:
The spec is clear that intermediary systems passing along the event may modify the envelope but not the data. When an intermediary system imposes additional sampling (by, say, dropping 50% of transmitted events), it makes sense that it may adjust the sampled-rate in the envelope to indicate that this happened. In order to enable this behavior, it must be in the envelope and not the data. |
|
@djrosanova re:
It's specifically because it's not an average that this makes sense. A sampled event contains the complete detail for that specific occurrence, along with the information that there were other events in the population that were not reported. The purpose of a sample is to get that complete detail of a specific occurrence, not an average of many. Sampling is inherently not part of the data itself but something external to the data. It is information about how the data is being collected and transferred, rather than a property of the content of the data. The event includes both the occurrence as well as context about the data - and while a recipient of the event may choose to combine the data with a sample rate in order to decide upon an action, that's no different from combining the source or event-type with the data to decide upon an action. Imagine approaching a group of 5 people and giving one of them a survey. You wouldn't take averaged answers from all 5 people and call it a sample - that's something different. You ask that one person all the questions in the survey and consider that your sampled data. The important part is that you keep all the answers from that one person together, and indicate how many other people were there that could have taken the survey but didn't - sample size vs. population size (on a per-collection basis). Clearly there are differences between polling human populations and sampling machine events, but there are clearly similarities as well. Consider an HTTP server responding to requests sending along one out of every 30 requests that it receives, along with a sample rate of 30. You wouldn't want an average of all the source IP addresses as part of that event - you'd expect that one event to be a representative sample - one specific request that has a single requesting IP address, a URL requested, a return status code, a duration, etc. Nothing about adding a sampled-rate field implies there should be aggregation or averaging of the data somewhere down the line - it is adding context to better understand the occurrence being emitted under conditions in which not every event is sent. It might be reasonable to also report averages or other metrics about the service as well, but that is independent from indicating that a specific event is one of many and that some of them were not passed along, and unrelated to this spec. |
Signed-off-by: Ben Hartshorne <ben@honeycomb.io>
Signed-off-by: Ben Hartshorne <ben@honeycomb.io>
ec848d0 to
3e5a940
Compare
3e5a940 to
a8cf97d
Compare
|
(push history a little weird because I |
| ### sampled-rate | ||
| * Type: Positive Integer | ||
| * Description: The rate at which this event has already been sampled. Represents | ||
| the number of similar events that happened but were not sent. |
There was a problem hiding this comment.
I like this much better. One nit... its the # of events not sent, minus 1, right? Meaning, in your example below wouldn't the rate be 29, not 30 if this is the # of events NOT sent? I'm counting the current event as "the one".
|
@maplebed Your suggestion for middleware/a relay that applies sampling is interesting and actually sways me to thinking this is another "well known" extension. In the very early days of this working group (pre-CNCF), extensions was explicitly created to be a stomping ground where the source, a relay/broker, or even software frameworks in the event handler could add fields. I personally even believe (controversially) that the existing AWS Lambda context param should be moved into I've always had an assumption that because |
|
"It's specifically because it's not an average that this makes sense" doesn't convince me. This IS a special event, it's a decision, by the sender, for some reason, to limit the output. So it is "data" plus "context" and neither of those is the envelope. |
|
Whether you want it inside the data or whether you want it in the context section, having a sample rate isn't a universal enough requirement to make it a property that all implementers must consider. A sample rate makes sense if you are transferring telemetry information on an ongoing basis in a structured data stream. Whether a structured telemetry data stream that is a stream of "events" is actually debatable. I personally don't consider the observation of a temperature sensor value a noteworthy "occurrence". I do consider sustained (e.g. average value over some observation period) passing of a defined threshold for that temperature sensor value observation an occurrence. The resulting event is raised either from the edge (device) or from a central analytics pipeline on behalf of the device depending on where you choose to place that logic. The proposed metadata value makes no sense whatsoever for discrete events that report state changes or other occurrences and that don't stem from a correlated data stream. "record created", "file changed", "power emergency, shutting down", "VM started", "deployment completed", "message available", "user locked out", "package delivered at door", "media transcoding complete", and "laundry drying finished", are all immediately actionable discrete events and for which a "sample rate" were nonsensical. The core properties we are currently aiming to focus on are certainly also useful for describing records or record groups of structured telemetry data streams, irrespective of whether one would consider those record as being events. I do believe that with so many companies in the effort, there's a great opportunity here to define a companion specification for telemetry that builds on the core events spec and then reflects telemetry specific aspects like the sample rate. In such a spec there might also be an explicit notion of sequencing/clocks and partitioning, sparse data support (omission of unchanged or undefined values) with key frames, data record batching, and appropriate data encoding and transport binding choices for such streams (e.g. Apache Avro with MQTT/AMQP/Kafka). Singling out just the sample rate to include in the core spec were odd, and pulling in all the telemetry aspects would be overloading the core with considerations that only apply to data streams and have the potential to confuse the many implementers focusing on discrete and immediately actionable events that lend themselves to be easily handled in stateless serverless infrastructures. |
Isn't that why it's marked as "Optional"? |
... by the sender or any intermediary system to limit the output. In any sufficiently high throughput system, there are plenty of reasons to limit output. Averaging (as previously discussed as "not an event") is one of them, and not what this PR is about. Sampling is another, and is directly applicable to events. |
Ah, now this is interesting - there's nothing in the current spec that says occurrences should be "noteworthy" or that only occurrences that the creating system believes should be acted upon are worth creating. Even your open PR to reword occurrence uses the language "the capture of a statement of fact during the operation of a software system." - and reading a temperature sensor is absolutely "a statement of fact", and even one that could be because of "a state change" (eg the temperature changing from 27 to 28 degrees). Ok, so there's something going on here that some kinds of events are worthy of being CloudEvents and some kinds of events are not. Is that our job to say? Where does an event representing a Lambda function handling an HTTP request fall in that spectrum? Is it only reasonable to emit events for Lambda functions that change the state of some back end storage? Or, if every invocation of the function is worthy of creating an event, must every event be passed along to every recipient in every possible use case? (Though now we're talking about systems receiving events and we've already lost the idea that it's the calling system's responsibility to decide whether an event is actionable.) Maybe the sending system does have some internal knowledge of importance on a per-event basis and chooses to emit events at different frequencies depending on that importance? Or maybe an intermediary system can impose importance on a stream of events and only pass through some of them? It feels like the root of this discussion is that there's some kind of difference between a "telemetry event" (that you'd like to mark as out of scope) and all other types of events. Except that a telemetry event vs. a normal event is a false dichotomy - you can certainly create graphs from events, but that doesn't make them telemetry events - they're just events that can be used by multiple recipients, one of whom might choose to create graphs about what they've seen. And sampling (and including evidence of that sampling) is about collection, not graphing (though obviously the sample rates can be used for graphing, as for other actions taken based on the event). Perhaps a better distinction would be "events for which it does not make sense to ever sample" vs. "events that it's ok to sample"? I don't think that division maps cleanly to any of the examples you or I gave either, since it is highly dependent on the systems involved and what sort of meaning they might put on each event to make such a decision. But should we prohibit that decision entirely? Because it's is insufficiently event-like from our "all events must be kept" perspective? Except we've already declared in the spec that that is also not true, as "Events are used to notify other systems that something has happened" - and it's up to the system receiving that event to decide whether to take action on that thing that happened. Or maybe it's that we're only supposed to create events for rare occurrences? That doesn't seem to take us to any more reasonable specification either. I know I'm asking a lot of rhetorical questions in this comment, and I don't actually expect them answered. But there's some context about allowing some kinds of events and avoiding other kinds of events that seems to be driving this conversation that I don't understand. Trying to follow that to its logical conclusion in either direction leads me to places that are obviously wrong, so I'm not sure how to reconcile that with the idea that this specification should seek to "ease event declaration and delivery across services, platforms and beyond." Whether sampling has been applied and how severely seems to me to be exactly the kind of thing you would want to be able to clearly communicate across services, platforms, and beyond. This spec is clear that it's not the job of the emitting service to define what actions should happen as a result of that event or even where it should be routed. There might be no consumers or many consumers for specific types of events. However, if the producer of the event does not include sufficient information in the event and its context to understand what that event represents, consumers will be unable to respond appropriately. That is why the |
|
IMO the spec shouldn't get into the business of defining noteworthy-ness. Its there to define how events look - and its up to some other mechanism (e.g. the act of subscribing for certain types of events in a pub/sub system) to decide whether or not they should be included in the stream of events. To me this PR comes down to whether or not the notion of "sampling" is common enough to include this property in the core spec, or whether we should consider it as a well-known extension. Do other's have a need for this property at this time? |
|
@duglin thanks for linking this with your comment in #101. It brings up a good point. Event-based observability systems are not new, but have so far only been really used at the really big companies - one reason being that they are expensive to run when observing a high throughput system. But cost models are changing and technology is changing and they are becoming more prevalent. You can see announcements from companies all over the map that are reaching in to the event space. One way that these cost models are changing is through sampling - it has the power to change the throughput of a system by orders of magnitude rather than mere percentiles. Sampling traffic is a core part of managing the volume of these event streams in many current production systems (including the Dapper paper from Google, Facebook's Scuba paper, the Jaeger system from Uber, and of course Honeycomb's product). More products and companies are understanding that there are only a few ways to manage observability system's growth - by preaggregation (this is a metrics system), by measuring fewer things (which nobody wants) or by sampling the events getting sent. This problem is actually exacerbated in the cloud and SaaS world, since within a corporate data center, it is less common to be as concerned about network throughput as other costs. However, with cloud interop, transit is an important source of cost to consider. Nobody wants to "send everything and decide what's important later" because the effect is that you're paying for transit for 99% or 99.99% of the traffic you're then going to drop on the floor. If this spec wants to help unify the industry and where it's going, then event-based observability systems are an obvious candidate to include. Event-based observability systems rely on sampling to manage their volume and remain in a viable place on the cost/benefit curve. |
| * Constraints: | ||
| * OPTIONAL | ||
| * If present, MUST be a positive integer | ||
| * When absent means that no sampling has yet been applied |
There was a problem hiding this comment.
nit: "yet been applied" is a bit confusing... maybe it will never be applied.
When events are used for business logic or ETLs, sampling is not really applicable. For example, user sign-up event or file upload.
There was a problem hiding this comment.
My intent with that language is to reinforce that intermediary systems may apply additional sampling. How about “When absent, a default value of 1 MAY be assumed, indicating no sampling”? Would love additional phrasing suggestions.
My distinction is between discrete events and event streams. Discrete events are individually actionable. Event streams are typically actionable after you look at them in context. That difference has immediate architectural consequences, because discrete events are distributed and handled differently than event streams in practice. In industrial systems, as you will know, alarms (discrete and actionable) are often first-class concepts clearly delineated from the remainder of data that flows into historians. In our cloud infrastructure, we clearly delineate between discrete events (via Azure Event Grid) and system logs and metrics (via Azure Monitoring), which is nearly the exact same split. I lay out the differences here: http://redmonk.com/videos/events-data-points-and-messages-clemens-vasters-thingmonk-2017/ |
Signed-off-by: Ben Hartshorne <ben@honeycomb.io>
6af04fb to
fe5bb11
Compare
|
We briefly discussed this on the call today. We currently have 4 buckets into which an attribute can go:
The PR itself isn't asking for 1) so that wasn't an option. No one on the call chose to speak up in favor of option 2). There was a proposal for option 3) but we got side-tracked by a process discussion so we didn't vote on it. Instead we chose to wait a week and ask for input from the WG/community on this proposal. Please add comments to this PR expressing your opinion as to which option above you'd like to see us go with this. Ideally, any option should also include some explanation of your reasoning (e.g. usecases) to help the WG members decide which category it should go into. We might discuss (and try to resolve) it next week... depending on where it lands in the agenda, so please speak up if you have an opinion. |
|
Actually, I think this should be #2, rather than being relegated to extension. |
|
Any other comments on this one? |
|
at 6/15 f2f, we agreed to have @inlined write this up as an extension |
This pull request supercedes cloudevents#76 and builds upon cloudevents#242. Per agreement at the f2f meeting 2018-06-15, this will be ratified as an extension but not currently as part of the core spec. The early inclusion of the spec and the ability for side-cars to impose sampling early in the pipeline make it reasonable to start with this as an extension. I know the original author hoped to make this a core feature, though there are two upsides: 1. Due to other decisions on 2018-06-15, this will have zero impact on JSON encoding (but will in Proto and may in SDKs) 2. Extensions now get a whole document. This lets us pitch sampling in more detail. Necessary additional changes: * Added `Integer` to our type system * Added convention for how scalar extensions should be documented. I'm not really excited about this proposal so others are welcome. Special thanks to @maplebed for kicking this effort off. Signed-off-by: Thomas Bouldin <inlined@google.com>
|
Closing due to #243 |
This pull request supercedes cloudevents#76 and builds upon cloudevents#242. Per agreement at the f2f meeting 2018-06-15, this will be ratified as an extension but not currently as part of the core spec. The early inclusion of the spec and the ability for side-cars to impose sampling early in the pipeline make it reasonable to start with this as an extension. I know the original author hoped to make this a core feature, though there are two upsides: 1. Due to other decisions on 2018-06-15, this will have zero impact on JSON encoding (but will in Proto and may in SDKs) 2. Extensions now get a whole document. This lets us pitch sampling in more detail. Necessary additional changes: * Added `Integer` to our type system * Added convention for how scalar extensions should be documented. I'm not really excited about this proposal so others are welcome. Special thanks to @maplebed for kicking this effort off. Signed-off-by: Thomas Bouldin <inlined@google.com>
* Add sampling extension. This pull request supercedes #76 and builds upon #242. Per agreement at the f2f meeting 2018-06-15, this will be ratified as an extension but not currently as part of the core spec. The early inclusion of the spec and the ability for side-cars to impose sampling early in the pipeline make it reasonable to start with this as an extension. I know the original author hoped to make this a core feature, though there are two upsides: 1. Due to other decisions on 2018-06-15, this will have zero impact on JSON encoding (but will in Proto and may in SDKs) 2. Extensions now get a whole document. This lets us pitch sampling in more detail. Necessary additional changes: * Added `Integer` to our type system * Added convention for how scalar extensions should be documented. I'm not really excited about this proposal so others are welcome. Special thanks to @maplebed for kicking this effort off. Signed-off-by: Thomas Bouldin <inlined@google.com> * PR feedback; also add conttent to extensions.md Signed-off-by: Thomas Bouldin <inlined@google.com> * @douglin doesn't like the world 'may' 😉 Signed-off-by: Thomas Bouldin <inlined@google.com> * Add 'e.g.' to list of scalar types in conventions Signed-off-by: Thomas Bouldin <inlined@google.com>
* Add sampling extension. This pull request supercedes cloudevents#76 and builds upon cloudevents#242. Per agreement at the f2f meeting 2018-06-15, this will be ratified as an extension but not currently as part of the core spec. The early inclusion of the spec and the ability for side-cars to impose sampling early in the pipeline make it reasonable to start with this as an extension. I know the original author hoped to make this a core feature, though there are two upsides: 1. Due to other decisions on 2018-06-15, this will have zero impact on JSON encoding (but will in Proto and may in SDKs) 2. Extensions now get a whole document. This lets us pitch sampling in more detail. Necessary additional changes: * Added `Integer` to our type system * Added convention for how scalar extensions should be documented. I'm not really excited about this proposal so others are welcome. Special thanks to @maplebed for kicking this effort off. Signed-off-by: Thomas Bouldin <inlined@google.com> * PR feedback; also add conttent to extensions.md Signed-off-by: Thomas Bouldin <inlined@google.com> * @douglin doesn't like the world 'may' 😉 Signed-off-by: Thomas Bouldin <inlined@google.com> * Add 'e.g.' to list of scalar types in conventions Signed-off-by: Thomas Bouldin <inlined@google.com>
There are many cases in an Event's life when a system (either the system creating the event or a system transporting the event) may wish to only emit a portion of the events that actually happened. In a high throughput system where creating the event is costly, a system may wish to only create an event for 1/100 of the times that something happened. Additionally, during the transmission of an event from the source to the eventual recipient, any step along the way may choose to only pass along a fraction of the events it receives.
In order for the system receiving the event to understand what is actually happening in the system that generated the event, information about how many similar events may have happened must be included in the event itself. This field provides a place for a system generating an event to indicate that the emitted event reepresents a given number of other similar events. It also provides a place for intermediary transport systems to modify the event when they impose additional sampling.
Two examples for context:
Every time our API server at Honeycomb accepts an incoming HTTP request, it generates an Event identifying the request. It is not necessary to actually transmit every one of these events, so to save on server and network resources as well as minimize the necessary size for the receiving cluster, we only send on average 1 event for every 3000 requests received. The actual sample rate varies according to attributes of the incoming traffic, so the sample rate is necessarily a part of the context of the Event.
Deep in the storage engine for Honeycomb, there is an operation that generates an event. Much of the metadata that is necessary to give the data in that event necessary context is expensive to compute. In order to avoid spending precious computational time on generating metadata that will then be thrown out, we only generate the contextual data for events that will actually get sent. It is necessary to indicate the sample rate in the event emitted to get an accurate reflection of the actual operation of the service.
Signed-off-by: Ben Hartshorne ben@honeycomb.io