-
Notifications
You must be signed in to change notification settings - Fork 809
SOLR-18094 Support running embedded-zk in "ensemble" mode with new node role #2391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…found" with an out-dated ClusterState (apache#2363)" This reverts commit 5c399dd.
This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode. Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup. Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.
Embedded-ensemble-ZK is enabled any time the `zkQuorumRun` system
property is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.
```
export LH="localhost"
bin/solr start -p 8983 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8984 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8985 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
```
Some notes:
- this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
ZK nodes must be specified in a static ZK conn string provided at
startup
- this appears to run best when the security-manager is disabled.
|
FYI: https://cwiki.apache.org/confluence/display/SOLR/SIP-14+Embedded+Zookeeper for context on others' interest in moving this direction. |
|
Great work. I think we should rip out the existing embedded ZK, which is a hack. Do you want to proceed in a PR or perhaps in a central feature branch? |
…ion not found" with an out-dated ClusterState (apache#2363)"" This reverts commit 1fb376b.
88c8da4 to
de59401
Compare
|
Spent a few minutes pulling in the latest 'main'. Hoping to give it a bit more cleanup in the next few days if I can, but I've been hoping that for more than a year at this point (😬 ), so if anyone else is interested please feel free to move this forward! |
|
@gerlowskija I took a stab at getting in the use of the solr roles to determine if we hsould run embedded zk quorum mode, and it worked! Here is my start script, where each runs with |
# Conflicts: # solr/core/src/java/org/apache/solr/core/ZkContainer.java # solr/packaging/build.gradle
|
|
FYI, the |
Uses a base class SolrCloudWithEmbeddedZkQuorumTestCase New MiniSolrCloudCluster constructor that spins up a quorum cluster
…ing flag to not do the saem thing twice Our code is just single threaded at this point...
|
I asked Claude to summarize the last seven commits and he said: Recent work on the I think my summary is: I discovered that we have TWO ways of starting ZK, not including our independent We now have:
I am now wondering could we convert SolrZkServer to using a Secondly, now seeing the use of Thirdly, why is I did take a stab at the |
Thanks for chipping away on this. At some point we'll have something that can be released in 10.1 or 10.2 as an experimental feature for folks to try out, without dynamic reconfig. Then we can tackle the dynamic stuff... |
|
Thanks for taking a look at this @janhoy I think I've stalled a bit on this... It feels "so close' yet not sure how to get it over the hump. |
Yep, I got it up to speed with main, hardened the qourum test, and try to add test for stopping a quorum node and resume operation. I think we need to focus on the startup modes / params to have a clear story for the three modes as you mentioned
For the managed embedded quorum, I believe we have proved that setting the node role explicitly and passing Phase 1:Ship a experimental version with a working zk node-role, that starts a quorum that is somewhat stable. Phase 2:Make it easier to setup for a small test cluster with some convenience flags (see above). integrate into solr operator. Phase 3:Good support for SSL between solr and embedded zk quorum. Either self-provided certs, or self-signed certs during startup. Note, we can have non-ssl Solr API while still having SSL between solr and zookeeper. Phase 4:Support dynamic reconfiguration. I don't know how important that really is. In k8s service-names are stable and you don't need to reconfigure zk really, except when changing the size of your quorum. But in self managed infrastructure, you may want to retire some servers and move zk over to another set of nodes by reconfiguring in two phases. |
|
I wonder if the -c mode and the |
I don't like that kind of magic. I hope we can introduce this as an additional mode, then once it proves itself and is stable, we can make this mode the default when just starting a single node. But then we're at phase 4, where we have some degree of dynamic configuration and support for growing a quorum. I don't think that is important in the first few releases. I come to value stability and some degree of static non-mutable config. So it is not a crisis if you need to specify the |
that makes sense... I suspect you have a lot more experience in running clusters at scale than I do! |
| } | ||
| // TODO - should this code go in SolrZkServer to augment or replace its current | ||
| // capabilities? Doing so | ||
| // would definitely keep ZkContainer cleaner... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerlowskija I removed NOCOMMIT in this comment to unblock other precommit tests. Making this review comment to make sure we don't forget about it :)
|
I created SOLR-18094 as jira for this issue (sub-task of the SIP JIRA). Let's focus on getting this PR in shape for releasing the building blocks of the Zk code and the node-role. Target 10.1 as an experimental feature, can document it as such and solicit feedback from users. I made Precommit run, will spend some time some day to try to do a more thorough review. Let's make review comments for all known rough edges and things lacking. |
Description
Prior to this commit, Solr only supported running embedded-ZK in "standalone" mode (which cannot take part in any larger ZK ensemble or quorum). But there are usecases that would benefit from being able to do this, both on the development and testing side, and event for adventurous users who might want the benefits of a small multi-node SolrCloud cluster without the headache of also deploying ZK.
See SIP-14, SOLR-15636 for context
Solution
This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode. Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup. Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.
Embedded-ensemble-ZK is enabled any time the
zkQuorumRunsystemproperty is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.
Some notes:
- this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
ZK nodes must be specified in a static ZK conn string provided at
startup
- this appears to run best when the security-manager is disabled.
- the interface in particular for how we expose this is pretty rough, and there's a lot of room for improvement.
Tests
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.