We've talked about correlation in a few PRs (#128, #138, and other discussions I can't find based on sequence numbers).
I'd like to move discussions here where we can debate scope and approach in a common nexus to make sure all discussions have the appropriate scope.
I propose that there are two kinds of correlation possible:
- Sequence correlation. Events may be related to others by ordering or causality. Ordered sequence correlation may be achieved using the
eventTimestamp and source fields, though limited precision and clock skews may introduce error; a vector clock would fix this if we wanted to officially support the use case. Causality sequencing would require a new context attribute like "causedBy", or a weaker sounding property like "precededBy" to handle sequence correlation as well. The gotcha with these headers is that they cause head-of-queue blocking and I'm not sure what a system should do if the precededBy event were never received.
- Attribute correlation. Events could expose data that is possibly redundant with fields within data that are explicitly transparent to routing software. A subscription could chose to subscribe to limited event streams by filtering for only matching IDs or could enforce affinity based on that ID.
In my mental model, if one were to use SQL over a stream of CloudEvents, case (1) is an ORDER_BY clause and case (2) allows a WHERE or GROUP_BY clause. An individual query could compose (1) and multiple instances of (2).
Note, that I think it is inappropriate for (2) to actually pre-determine the correlation. The actual GROUP_BY or WHERE clause should be part of the query, not part of the data structure.
We've talked about correlation in a few PRs (#128, #138, and other discussions I can't find based on sequence numbers).
I'd like to move discussions here where we can debate scope and approach in a common nexus to make sure all discussions have the appropriate scope.
I propose that there are two kinds of correlation possible:
eventTimestampandsourcefields, though limited precision and clock skews may introduce error; a vector clock would fix this if we wanted to officially support the use case. Causality sequencing would require a new context attribute like "causedBy", or a weaker sounding property like "precededBy" to handle sequence correlation as well. The gotcha with these headers is that they cause head-of-queue blocking and I'm not sure what a system should do if the precededBy event were never received.In my mental model, if one were to use SQL over a stream of CloudEvents, case (1) is an ORDER_BY clause and case (2) allows a WHERE or GROUP_BY clause. An individual query could compose (1) and multiple instances of (2).
Note, that I think it is inappropriate for (2) to actually pre-determine the correlation. The actual GROUP_BY or WHERE clause should be part of the query, not part of the data structure.