Skip to content

Workqueues are not used to handle all informer notifications #823

@grantr

Description

@grantr

/area API
/kind bug

Expected Behavior

Controllers that reconcile multiple resources should reconcile all resources without blocking other controllers.

Actual Behavior

Secondary resources block other controllers during reconcile and retry.

Detailed Explanation

Currently each controller reconciles the primary resource plus one or more secondary resources. For example, the configuration controller reconciles configurations with enqueueConfiguration, plus revisions with addRevisionEvent (which updates the configuration status with the latest revision names).

Each primary reconcile loop enqueues update notifications from the informer into a workqueue, then runs 2 worker goroutines processing those updates. The workqueue contains retry logic to allow for retries when reconcile fails.

None of the secondary reconcile loops use the above strategy. Each reconcile loop processes update notifications directly without enqueueing them in a workqueue. If I'm understanding https://github.com/kubernetes/client-go correctly, there are several issues with this:

  • Retries are done by the informer rather than the workqueue, with different retry behavior for each.
  • Reconcile blocks informers. Informers are global per process and all controllers run in the same process, so secondary reconcile handlers block all other controllers receiving events from that informer, and could possibly cause informers to miss watch updates (I don't know enough about the internals of watches and informers to say for sure).
  • Informer retries block the informer until the handler processes successfully, which could be never. In this case, no controller would receive another update from that informer.

Proposed fix(es)

Add a workqueue for each secondary resource reconciled by a controller. Each resource type needs a separate workqueue with separate queue workers, unless the workqueue setup is changed to allow different resource types in a single workqueue.

Alternately, move all secondary resource logic into the primary controller. For example, the code in addRevisionEvent in configuration controller could be moved to revision controller, where it could take advantage of the existing workqueue for revision updates.

A third alternative is to move secondary resource logic to separate special-purpose controllers. In the configuration controller example, addRevisionEvent would move to a new revision-readiness controller (name TBD). This helps preserve controller simplicity by defining them as a single reconciliation loop instead of a single controlled resource. In this solution, the 4 current Elafros controllers would be split into 10 controllers, but IMO the intent of each controller would be easier to understand.

Plea for expert help

This issue is deeper than my Kubernetes knowledge extends, so I'd like to get some advice from Kubernetes maintainers before deciding its importance and correct solution.

/cc @mattmoor @tcnghia

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/APIAPI objects and controllerskind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions