Allow extension services to be discovered#12222
Allow extension services to be discovered#12222paul-rogers wants to merge 7 commits intoapache:masterfrom
Conversation
|
Tests mostly pass. There is one IT failure, but it looks like a flake rather than an actual failure. Would some committer please rerun that one test? |
gianm
left a comment
There was a problem hiding this comment.
The concept looks good to me, but please let me know what you think of the line-by-line comments.
And, on testing: the integration test ITHighAvailabilityTest has a case that exercises discovery of an extended node role (CliCustomNodeRole) via DruidNodeDiscoveryProvider. If you add a case that verifies this through the new router API and the sys.servers table, that would constitute end-to-end testing of the new functionality. Which would be great.
|
|
||
| * `/druid/v2/router/cluster` | ||
|
|
||
| Returns a list of the servers registered within the cluster. Similar to |
There was a problem hiding this comment.
Would the response format of this API be similar to, or exactly the same as, /druid/coordinator/v1/cluster? Ideally, if it's exactly the same as, we should say that; if it's merely similar we should outline the differences.
There was a problem hiding this comment.
As the docs state, the preferred solution is to query the system tables. Yet, as I tinker with clusters, I find it very easy to screw things up so that the the cluster is too broken for SQL. This API is meant to be a light layer on top of ZK to diagnose such issues without having to fire up the ZK client and come up with a way to decode node payloads. Reworded the docs to highlight this idea.
There are slight differences between the formats of the two endpoints. The Coordinator one appears to be tailored to the needs of the Druid Console (maybe?)
The coordinator one is more heavily formatted to put services in some preferred order:
{'coordinator': [{'service': 'druid/coordinator',
'plaintextPort': 8081,
'host': 'coordinator-one'},
{'service': 'druid/coordinator',
'plaintextPort': 8081,
'host': 'coordinator-two'}],
...
'broker': [{'service': 'druid/broker',
'plaintextPort': 8082,
'host': 'broker'}],
'historical': [],This one lists services alphabetically, including only those services actually running:
{'broker': [{'service': 'druid/broker',
'host': 'broker',
'plaintextPort': 8082}],
'coordinator': [{'service': 'druid/coordinator',
'host': 'coordinator-one',
'plaintextPort': 8081},
{'service': 'druid/coordinator',
'host': 'coordinator-two',
'plaintextPort': 8081}],
...e5ec2ad to
b2e8069
Compare
|
Applied requested changes. Added integration tests. These are quite hard to test, so let's wait for the build to tell us what fixes may be needed before doing another review. |
b2e8069 to
429a16d
Compare
|
Having fun with ITs. In the previous commit, the one I added passed, multiple others failed. When restarted, the one I added failed, all others passed. Trying to find the cause of the intermittent failure. As it turns out, our auto-retry loop does not seem to provide the actual TestNG error so I'm flying blind. Will add logging and try the whole build again. |
|
Converting to draft. Hard to debug the IT failures. Will return later when the ITs are more usable. |
|
Rebased on latest master. |
|
Just wanted to make sure you saw this comment too, on the docs: #12222 (comment) |
|
Failed in an ARM 64 supervisor test which appears to be flaky. |
Thanks for the reminder. Fixed the issue and added a note to explain the purpose of the API. This will trigger a new build which may resolve the ARM 64 test noted above. |
|
BTW: the change in this PR should be tested in an IT. At present, adding such an IT is quite a chore. Waiting for the IT revision PR to pass tests and merge, then it will be easy to add the required test. |
rohangarg
left a comment
There was a problem hiding this comment.
Thanks for the changes! LGTM % minor comments
|
No good deed goes unpunished: the changes from review comments broke something that shows up only in ITs (or an IT is flaky). Am investigating the failure. Turns out the IT in question won't run on the Mac. I'd convert to the new format, but the new ITs are blocked on failures on the old ITs. Can't spend more time on this now, converting to Draft and will revisit later. |
|
Closing for now; too difficult to alter the ITs. |
Description
Druid provides extensive extension support. Extension can define new services, but those services cannot yet be discovered. This PR ensures that they operate like native services. See this ticket for details.
The current code has a static list of node roles. This PR moves the list into a Guice multi-binder so that extensions can add to the list.
Next, the SQL system servers table code is revised to use the Guice-provided list of node roles rather than the hard-coded list.
Then, the
/druid/coordinator/v1/clusteris added to include extension roles after the Druid-defined roles. The code imposed a specific order on the roles; those rules are preserved.Finally, a new endpoint
/druid/router/v1/clusteris added. The logic here is that clients start with the router endpoint. To get the list of services, they must first get the list of services to get the coordinator. But, they can't get the list of services without first haven gotten that list. To avoid this Catch-22, the new endpoint provides the list of services from the router itself.Of course, the SQL servers system table also provides the list of services. The Catch-22 in this case is that if the cluster is broken, SQL is unavailable to help figure out which services are down. Having the native endpoint provides a reliable fallback: as long as the Router and ZK are up, we can learn about other services and see what is missing.
The PR includes some refactoring to make bits of role-related code usable in multiple contexts. Most of the code is not unit testable (we can't run servers in unit tests), but where it is testable, tests are added or modified.
Key changed/added classes in this PR
NodeRoles- A Guice multi-binder to hold the list of Druid- and extension-defined node roles.This PR has: