Skip to content

Refine bus monitor#88

Merged
google-prow-robot merged 22 commits into
knative:masterfrom
scothis:bus-monitor
Jun 25, 2018
Merged

Refine bus monitor#88
google-prow-robot merged 22 commits into
knative:masterfrom
scothis:bus-monitor

Conversation

@scothis
Copy link
Copy Markdown
Contributor

@scothis scothis commented Jun 13, 2018

Proposed Changes

  • Allow monitor handler to return errors to requeue event
  • Filter handler calls to resources for the monitored bus

@google-prow-robot google-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 13, 2018
Comment thread pkg/buses/monitor.go Outdated
}

func (m *Monitor) removeBus(namespace string, name string) error {
// nothing to do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this mark this monitor helper as "unusable"?

Comment thread pkg/buses/monitor.go
func (m *Monitor) isChannelKnown(subscription channelsv1alpha1.Subscription) bool {
channelKey := makeChannelKeyFromSubscription(subscription)
summary := m.getChannelSummary(channelKey)
return summary != nil && summary.Channel != nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need for summary.Channel != nil (similar checks appear at other places). Seems like this should already be satisfied by the map lookup

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where a subscription resource is loaded before the channel resource, there won't be a channel in the map yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the summary != nil part? My point is: can summary be non nil but with its Channel member nil? I would expect not

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 19, 2018

No description provided.

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 19, 2018

@vaikas-google We think this is about ready to go. Can you take a look?

@evankanderson
Copy link
Copy Markdown
Member

who do you see provisioning a new Bus, a cluster operator or an application developer?

If I had to pick either the developer or operator persona as the bus deployer, I'd hedge and say it falls squarely in the middle of the two, but ultimately if forced I'd lean towards developer (but that may be my own bias as a developer).

I expect most clusters will have a single bus that handles the vast majority of channels with maybe one or two other exotic buses to utilize some special capability. For example, I don't expect a cluster to have both the pub/sub and kafka buses deployed as their capabilities are so similar.

Interesting -- I'd have thought that provisioning a bus would fall squarely on the operator. Would you expect the developer or the operator to provision the kafka cluster in the case of the kafka bus?

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 21, 2018

Interesting -- I'd have thought that provisioning a bus would fall squarely on the operator. Would you expect the developer or the operator to provision the kafka cluster in the case of the kafka bus?

That's a fair point. I'd expect the operator to be actively involved with deploying Kafka since that's certainly not set-and-forget.

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 21, 2018

@vaikas-google @evankanderson what are the next steps here?

@vaikas
Copy link
Copy Markdown
Contributor

vaikas commented Jun 21, 2018

I think in the name of continuing with the prototyping work and figuring out the rough shape of the APIs involved and getting the data plane wired in to bindings I'd be fine to get this merged in. However, in parallel we should probably put together some time and thinking of what the control plane API looks like and what belongs there and what doesn't. I could certainly see separate steps:

  1. Create the bus [say, deploy kafka using helm, or whatever]
  2. There's a controller that then knows about this specific bus
  3. CRDs then only represent control interfaces necessary for the wiring

Seems like there's some concerns with the current definitions so we should have some design discussions about the direction we all feel comfortable there, but I also am sensitive to making some progress with the understanding that these things might have to change. @evankanderson thoughts?

@evankanderson
Copy link
Copy Markdown
Member

@grantr pointed me to https://kubernetes.io/docs/concepts/storage/storage-classes/, which seems to be addressing a similar problem around many possible implementations of storage for persistent volumes in kubernetes.

Since my concern is actually about #66, rather than this PR, I don't necessarily want to block getting this in. But I'd like to open a discussion about how to refactor the existing bus definitions towards a CRD which might look more like the StorageClass style, e.g.

apiVersion: eventing.knative.dev/v1alpha1
kind: Bus
metadata:
  name: prod-kafka
provisioner: kafka
parameters:  # A map[string]string
  connectString: "mykafka..."
  reclaimAfter: "20d"
  ...

@evankanderson
Copy link
Copy Markdown
Member

@mattmoor also had some good observations this morning that a Bus feels like it is more likely to be a cluster-wide resource (thinking about Kafka or the equivalent), rather than a per-namespace resource.

If a Bus is cluster-wide, that also suggests provisioning by an operator. It seems like Channels should probably be per-namespace, which means that the Channel reference to the bus should include both a namespace and a name, unless the names are explicitly cluster-wide (i.e. scope: Cluster on the Bus CRD definition).

@evankanderson
Copy link
Copy Markdown
Member

Thanks for the explanation of the monitor -- it wasn't clear that it was actually a controller framework.

I think my main question/point was that it should be possible to have multiple controllers managing resources of the same type, as long as ownership is clearly defined between controllers. (I.e. you could have a Kafka controller and a PubSub controller both watching Channels or Buses, as long as it's clear from the resource definitions which one should act on that particular resource.) We'd probably also want a webhook to keep resources from changing from one controller regime to another without a delete.

Comment thread pkg/buses/monitor.go
subscriptionsSynced cache.InformerSynced
cache map[channelKey]*channelSummary
provisionedChannels map[channelKey]*channelsv1alpha1.Channel
provisionedSubscriptions map[subscriptionKey]*channelsv1alpha1.Subscription
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cache, provisionedChannels, and provisionedSubscriptions maps used for? Is the information in them recoverable if the monitor pod is restarted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these caches are idempotent and will repopulate from the informer when the pod restarts.

@grantr
Copy link
Copy Markdown
Contributor

grantr commented Jun 21, 2018

keeps a queryable cache for fast runtime lookups in the dispatcher.

Is this cache different from an informer? Informers are already caches that automatically stay current with Kubernetes state and can be queried by name and selector (see the generated Listers).

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 21, 2018

I'm in full agreement that the model proposed here needs further iteration. #66 was a green field spike, this PR uses the learnings from implementing PubSub (#98) and Kafka (#99) buses. With these PRs I feel much more comfortable that we can iterate on the model quicker while testing our assumptions with running code.

The notion of StorageClasses as a model for a bus is something that we had talked about in the context of riff, it's interesting to see the same idea arise organically. One possible downside of following that model is that storage is often a tradeoff between speed and cost. Many of the classes are named fast and slow. Messaging is a much richer domain with several other dimensions and tradeoffs. I need to read more about how storage classes are implemented.

The more I think about it, the more I agree that a Bus is primarily an operator construct. Their is a direct analogy between a message broker and a database. In the service broker world, the operator decides which services and configurations are available, while the developer consumes those services.

I think my main question/point was that it should be possible to have multiple controllers managing resources of the same type, as long as ownership is clearly defined between controllers. (I.e. you could have a Kafka controller and a PubSub controller both watching Channels or Buses, as long as it's clear from the resource definitions which one should act on that particular resource.) We'd probably also want a webhook to keep resources from changing from one controller regime to another without a delete.

I completely agree. The monitor is configured for a single bus, bus receives all channels and subscriptions via the informers. The first thing the bus does is filter out channels that are for other buses and subscriptions that are for other channels. The handler funcs on the monitor are only fired for types that the bus needs to handle. There's a bit of extra complexity in this implementation to handle the case of a channel switching buses, or a subscription switching channels. I agree that's a case we should disallow and simplify the monitor accordingly.

@scothis
Copy link
Copy Markdown
Contributor Author

scothis commented Jun 22, 2018

Is this cache different from an informer? Informers are already caches that automatically stay current with Kubernetes state and can be queried by name and selector (see the generated Listers).

There are two key purposes behind these caches:

  1. allow a bus to lookup all active subscribers for a channel. Since a channel can have 0..N subscriptions, reading from the informer cache is more than an identity lookup. We don't have labels on the resources that would make for an effective selector, but even if we did, I doubt we'd want to do that query for every request in the data plane.

  2. tracking provisioned resources helps with cleanup after a resource is deleted. Since there are potentially multiple buses operating on channels and subscriptions, the monitor figures out whether the deleted resource was one that it needs to care about. This get even more useful if there are params/arguments for a channel that are applicable during deletion (like preserving messages in a queue instead of discarding them).

    Based off the conversation we had in slack today, an alternate approach would be to set a key on the bus' status for each channel that is provisioned (and the channel-subscription relationship). This would require that we ensure the master resource would never be deleted until all dependent resources have been deleted, otherwise we could leak resources.

Comment thread pkg/buses/monitor.go Outdated
// apply arguments
if arguments != nil {
for _, arg := range *arguments {
if _, ok := known[arg.Name]; ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is inverted. Should read if !ok

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread pkg/buses/monitor.go
m.bus = bus

glog.Info("Starting workers")
// Launch two workers to process Bus resources
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment shouldn't say "two", depends on the value of threadiness

Copy link
Copy Markdown
Contributor Author

@scothis scothis Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markfisher
Copy link
Copy Markdown
Contributor

/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2018
@ericbottard
Copy link
Copy Markdown
Contributor

@mattmoor also had some good observations this morning that a Bus feels like it is more likely to be a cluster-wide resource (thinking about Kafka or the equivalent), rather than a per-namespace resource.

If a Bus is cluster-wide, that also suggests provisioning by an operator. It seems like Channels should probably be per-namespace, which means that the Channel reference to the bus should include both a namespace and a name, unless the names are explicitly cluster-wide (i.e. scope: Cluster on the Bus CRD definition).

Not sure this is exactly what you're referring to, but indeed in the Kafka Bus implementation for example, I craft the kafka topic name (which can be seen as an implementation detail from the channel user perspective) as being a combination of the channel name + namespace.

This way, assuming there is on Bus Custom Resource instance for one Kafka broker instance (certainly both the responsibility of an operator) then channels from any namespace can use it without clash.

If however, we envision that many Bus Custom Resource instances can share the same backing broker, then we'll want to factor in the name (and possibly namespace if we keep it namespaced) of the Bus itself.

IMO the "compromise" that makes the most sense is

  • Bus is cluster scoped
  • Channels and Subscriptions are namespaced
  • There is the implicit assumption that there is one backing broker per Bus

This is certainly something we had to consider working with riff, and there is a whole range of scenarios to cover

@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2018
@vaikas
Copy link
Copy Markdown
Contributor

vaikas commented Jun 25, 2018

/lgtm
/approve

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2018
@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: scothis, vaikas-google

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2018
@google-prow-robot google-prow-robot merged commit ea71285 into knative:master Jun 25, 2018
@scothis scothis deleted the bus-monitor branch June 25, 2018 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants