-
Notifications
You must be signed in to change notification settings - Fork 962
Add ensemble relocation command which adheres to placement policy #2931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cb41b1e to
e3d3d00
Compare
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great work!
I did a first pass, it is a big patch.
So we'll need more eyes
| if (!excludeBookies.contains(currentNode) && predicate.apply(currentNode, ensemble)) { | ||
| if (ensemble.addNode(currentNode)) { | ||
| // add the candidate to exclude set | ||
| excludeBookies.add(currentNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not modify the excludeBookies Set passed as argument, please create a copy in the beginning of the method
otherwise this method will have unwanted side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify the excludeBookies to notify exclude bookie to called method like below.
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L671-L675
However, as you said, it might have unwanted side effects. Therefore, I'll modify to put to excludeBookies at called method.
diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
index d9167c321..fa3e59cd6 100644
--- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
+++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
@@ -1131,6 +1131,13 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
prevNode = replaceToAdherePlacementPolicyInternal(
curRack, excludeNodes, ensemble, ensemble,
provisionalEnsembleNodes, i, ensembleSize, minNumRacksPerWriteQuorumForThisEnsemble);
+ // got a good candidate
+ if (ensemble.addNode(prevNode)) {
+ // add the candidate to exclude set
+ excludeNodes.add(prevNode);
+ } else {
+ throw new BKNotEnoughBookiesException();
+ }
// replace to newer node
provisionalEnsembleNodes.set(i, prevNode);
} catch (BKNotEnoughBookiesException e) {
@@ -1159,10 +1166,6 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
final BookieNode currentNode = provisionalEnsembleNodes.get(ensembleIndex);
// if the current bookie could be applied to the ensemble, apply it to minify the number of bookies replaced
if (!excludeBookies.contains(currentNode) && predicate.apply(currentNode, ensemble)) {
- if (ensemble.addNode(currentNode)) {
- // add the candidate to exclude set
- excludeBookies.add(currentNode);
- }
return currentNode;
}
@@ -1234,11 +1237,6 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
continue;
}
BookieNode bn = (BookieNode) n;
- // got a good candidate
- if (ensemble.addNode(bn)) {
- // add the candidate to exclude set
- excludeBookies.add(bn);
- }
return bn;
}
}
| writableBookies.add(addr9.toBookieId()); | ||
|
|
||
| // add bookie node to resolver | ||
| StaticDNSResolver.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should do this also in a "After" method, in order to ensure that we do not pollute the test environment
e3d3d00 to
6bd16cf
Compare
|
I've created the issue about the failing |
6bd16cf to
5dc71e2
Compare
5dc71e2 to
e6e4deb
Compare
|
@eolivelli Addressed your first comments. PTAL. |
| final LedgerUnderreplicationManager lum = lmf.newLedgerUnderreplicationManager(); | ||
|
|
||
| final List<Long> targetLedgers = | ||
| flags.ledgerIds.stream().parallel().distinct().filter(ledgerId -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for 'parallel'?
This loop is doing metadata operations, you can kill the zookeeper if you don't throttle properly the requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for 'parallel'?
It was no strong reason.
This loop is doing metadata operations, you can kill the zookeeper if you don't throttle properly the requests.
I missed that point. I'll remove it.
| final NavigableSet<Pair<Long, Long>> failedTargets = new ConcurrentSkipListSet<>(); | ||
| final CountDownLatch latch = new CountDownLatch(targetLedgers.size()); | ||
| for (long ledgerId : targetLedgers) { | ||
| if (!flags.dryRun) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about logging the ledger id here?
| continue; | ||
| } | ||
| } | ||
| admin.asyncOpenLedger(ledgerId, (rc, lh, ctx) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move all this block into some BookKeeperAdmin method.
It will be easier to write tests and also to use this feature programmatically
d49826f to
1555e54
Compare
1555e54 to
62970b3
Compare
|
rerun failure checks |
|
@eolivelli Addressed your comments. PTAL. |
|
@eolivelli ping |
|
I have rather generic comments.
As I understand, there is enforceMinNumRacksPerWriteQuorum option to prevent that (and enforceMinNumFaultDomainsForWrite, enforceStrictZoneawarePlacement for other placement policies) As I understand, Auditor's placementPolicyCheck just detects the problem. Maybe it makes sense to make Auditor (optionally) put the ledgers with bad placement for re-replication and make AutoRecovery handle that? Another note is that the test only covers rack-aware policy. What happens in case of the region-aware or zone-aware policies? |
|
Also: #3359 |
Thank you for sharing. I'll check it first. |
| } | ||
|
|
||
| @Override | ||
| public PlacementResult<List<BookieId>> replaceToAdherePlacementPolicy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some case, the replace policy will replace more bookie. There are two racks.(rack1, rack2).
rack1: bookie1,bookie2,bookie3,bookie4
rack2: bookie5,bookie6,bookie7,bookie8
E:6 WQ:2 AQ:2. The ensemble is (5,6,1,7,2,8). (rack2,rack2,rack1,rack2,rack1,rack2), only replace the first bookie5 to bookie3, it adhere the placement policy. But now the implements didn't replace the firstly, it will replace from second element(bookie6), it will replace more bookie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. We can't "minimize" replacement bookies in some cases by the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can list the kinds of replace result, and pick the "minimize" result. We can pick the different start index to replace. such as start with index0, start with index1, .... Finally, we pick the "minimize" replace result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can list the kinds of replace result, and pick the "minimize" result.
Yeah.
I think, alternatively, if the minimization is too costly for calculation, reduce as much as possible in a realistic amount of time like the current approach.
horizonzy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?
|
Sorry for the late reply. I've checked #3359 (review) .
In my understanding, if we use
I think so too.
When using non-Rackaware, it throws an Exception. I'll add the test. |
|
I want to use the EnsemblePlacementPolicy#replaceToAdherePlacementPolicy in #3359, it's more graceful, and I will enhance it in some cases. |
Currently, maybe, we can't do it. This feature expects all of the bookies were already recovered to a normal state.
It's okay. |
|
Improve it both #3359 and this pr. I think the improvement is common. |
I means the |
Yes, If we already knows which bookie shutdown, I can add it to excludeNodes to ignore it. |
|
In my understanding, currently, defaultRack of Rackaware is
Could you tell me more about shutdown? I think the resolver doesn't return |
Yes, you are right. If the test case, we changed the default rack to If we didn't change it, the default rack is |
|
fix old workflow,please see #3455 for detail |
|
Part of the code are ported to #3359 and it was merged. |
Motivation
Currently, we can't recover the ensemble which isn't adhering to placement policy directly.
The
recovercommand or autorecovery process is one of the alternative approaches. Unfortunately, however, we can't.For example, suppose
/region/rack1,/region/rack2)[bookie1(/region/rack1), bookie2(/region/rack1)])When
bookie1orbookie2goes down, then run autorecovery process.If the bookie cluster has remained bookies which place to
/region/rack1, the ledger recovers to the/region/rack1bookie again.https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L225-L246
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1055-L1103
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L474-L481
Therefore, we can't recover the ensemble which isn't adhering to placement policy directly.
The
recovercommand is similar to this behavior.I want to add a relocation feature to adhere placement policy.
Changes
Add relocation feature to shell command.
So, we can use it synchronously instead of executing asynchronously like the autorecovery process.
In this specification, optimize the new ensemble arrangement to minify the number of bookies replaced.
Therefore, we should implement the new method that returns the new ensemble by policies.
First, I introduce this feature to RackawareEnsemblePlacementPolicy in this PR.