Skip to content

Allow extension services to be discovered#12222

Closed
paul-rogers wants to merge 7 commits intoapache:masterfrom
paul-rogers:220201-role
Closed

Allow extension services to be discovered#12222
paul-rogers wants to merge 7 commits intoapache:masterfrom
paul-rogers:220201-role

Conversation

@paul-rogers
Copy link
Copy Markdown
Contributor

@paul-rogers paul-rogers commented Feb 1, 2022

Description

Druid provides extensive extension support. Extension can define new services, but those services cannot yet be discovered. This PR ensures that they operate like native services. See this ticket for details.

The current code has a static list of node roles. This PR moves the list into a Guice multi-binder so that extensions can add to the list.

Next, the SQL system servers table code is revised to use the Guice-provided list of node roles rather than the hard-coded list.

Then, the /druid/coordinator/v1/cluster is added to include extension roles after the Druid-defined roles. The code imposed a specific order on the roles; those rules are preserved.

Finally, a new endpoint /druid/router/v1/cluster is added. The logic here is that clients start with the router endpoint. To get the list of services, they must first get the list of services to get the coordinator. But, they can't get the list of services without first haven gotten that list. To avoid this Catch-22, the new endpoint provides the list of services from the router itself.

Of course, the SQL servers system table also provides the list of services. The Catch-22 in this case is that if the cluster is broken, SQL is unavailable to help figure out which services are down. Having the native endpoint provides a reliable fallback: as long as the Router and ZK are up, we can learn about other services and see what is missing.

The PR includes some refactoring to make bits of role-related code usable in multiple contexts. Most of the code is not unit testable (we can't run servers in unit tests), but where it is testable, tests are added or modified.


Key changed/added classes in this PR
  • NodeRoles - A Guice multi-binder to hold the list of Druid- and extension-defined node roles.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster, in the context of the actual extension service which this PR supports.

@paul-rogers
Copy link
Copy Markdown
Contributor Author

Tests mostly pass. There is one IT failure, but it looks like a flake rather than an actual failure. Would some committer please rerun that one test?

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept looks good to me, but please let me know what you think of the line-by-line comments.

And, on testing: the integration test ITHighAvailabilityTest has a case that exercises discovery of an extended node role (CliCustomNodeRole) via DruidNodeDiscoveryProvider. If you add a case that verifies this through the new router API and the sys.servers table, that would constitute end-to-end testing of the new functionality. Which would be great.

Comment thread docs/operations/api-reference.md Outdated

* `/druid/v2/router/cluster`

Returns a list of the servers registered within the cluster. Similar to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the response format of this API be similar to, or exactly the same as, /druid/coordinator/v1/cluster? Ideally, if it's exactly the same as, we should say that; if it's merely similar we should outline the differences.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the docs state, the preferred solution is to query the system tables. Yet, as I tinker with clusters, I find it very easy to screw things up so that the the cluster is too broken for SQL. This API is meant to be a light layer on top of ZK to diagnose such issues without having to fire up the ZK client and come up with a way to decode node payloads. Reworded the docs to highlight this idea.

There are slight differences between the formats of the two endpoints. The Coordinator one appears to be tailored to the needs of the Druid Console (maybe?)

The coordinator one is more heavily formatted to put services in some preferred order:

{'coordinator': [{'service': 'druid/coordinator',
   'plaintextPort': 8081,
   'host': 'coordinator-one'},
  {'service': 'druid/coordinator',
   'plaintextPort': 8081,
   'host': 'coordinator-two'}],
 ...
 'broker': [{'service': 'druid/broker',
   'plaintextPort': 8082,
   'host': 'broker'}],
 'historical': [],

This one lists services alphabetically, including only those services actually running:

{'broker': [{'service': 'druid/broker',
   'host': 'broker',
   'plaintextPort': 8082}],
 'coordinator': [{'service': 'druid/coordinator',
   'host': 'coordinator-one',
   'plaintextPort': 8081},
  {'service': 'druid/coordinator',
   'host': 'coordinator-two',
   'plaintextPort': 8081}],
...

Comment thread server/src/main/java/org/apache/druid/discovery/NodeRoles.java Outdated
Comment thread server/src/main/java/org/apache/druid/server/Node.java
Comment thread server/src/test/java/org/apache/druid/server/NodeTest.java
@paul-rogers
Copy link
Copy Markdown
Contributor Author

Applied requested changes. Added integration tests. These are quite hard to test, so let's wait for the build to tell us what fixes may be needed before doing another review.

Comment thread server/src/main/java/org/apache/druid/server/Node.java Outdated
Comment thread server/src/main/java/org/apache/druid/server/Node.java Outdated
@paul-rogers
Copy link
Copy Markdown
Contributor Author

Having fun with ITs. In the previous commit, the one I added passed, multiple others failed. When restarted, the one I added failed, all others passed. Trying to find the cause of the intermittent failure. As it turns out, our auto-retry loop does not seem to provide the actual TestNG error so I'm flying blind. Will add logging and try the whole build again.

@paul-rogers paul-rogers marked this pull request as draft April 7, 2022 17:49
@paul-rogers
Copy link
Copy Markdown
Contributor Author

Converting to draft. Hard to debug the IT failures. Will return later when the ITs are more usable.

@paul-rogers
Copy link
Copy Markdown
Contributor Author

Rebased on latest master.

Comment thread docs/operations/api-reference.md Outdated
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 13, 2022

Just wanted to make sure you saw this comment too, on the docs: #12222 (comment)

@paul-rogers
Copy link
Copy Markdown
Contributor Author

Failed in an ARM 64 supervisor test which appears to be flaky.

@paul-rogers
Copy link
Copy Markdown
Contributor Author

Just wanted to make sure you saw this comment too, on the docs: #12222 (comment)

Thanks for the reminder. Fixed the issue and added a note to explain the purpose of the API. This will trigger a new build which may resolve the ARM 64 test noted above.

@paul-rogers paul-rogers marked this pull request as ready for review June 14, 2022 16:21
@paul-rogers
Copy link
Copy Markdown
Contributor Author

BTW: the change in this PR should be tested in an IT. At present, adding such an IT is quite a chore. Waiting for the IT revision PR to pass tests and merge, then it will be easy to add the required test.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM % minor comments

Comment thread server/src/main/java/org/apache/druid/initialization/Initialization.java Outdated
Comment thread services/src/main/java/org/apache/druid/server/http/RouterResource.java Outdated
Comment thread sql/src/main/java/org/apache/druid/sql/calcite/schema/SystemSchema.java Outdated
Comment thread server/src/main/java/org/apache/druid/discovery/NodeRoles.java Outdated
Comment thread server/src/main/java/org/apache/druid/discovery/NodeRoles.java Outdated
@paul-rogers
Copy link
Copy Markdown
Contributor Author

paul-rogers commented Jun 17, 2022

No good deed goes unpunished: the changes from review comments broke something that shows up only in ITs (or an IT is flaky). Am investigating the failure.

Turns out the IT in question won't run on the Mac. I'd convert to the new format, but the new ITs are blocked on failures on the old ITs. Can't spend more time on this now, converting to Draft and will revisit later.

@paul-rogers paul-rogers marked this pull request as draft June 17, 2022 20:21
@paul-rogers
Copy link
Copy Markdown
Contributor Author

Closing for now; too difficult to alter the ITs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants