Skip to content

Conversation

@equanz
Copy link
Contributor

@equanz equanz commented Dec 7, 2021

Motivation

Currently, we can't recover the ensemble which isn't adhering to placement policy directly.
The recover command or autorecovery process is one of the alternative approaches. Unfortunately, however, we can't.

For example, suppose

  • the bookie cluster has two racks( /region/rack1 , /region/rack2 )
  • target ledger has the ensemble ( [bookie1(/region/rack1), bookie2(/region/rack1)] )
  • the replicator uses RackawareEnsemblePlacementPolicy

When bookie1 or bookie2 goes down, then run autorecovery process.
If the bookie cluster has remained bookies which place to /region/rack1, the ledger recovers to the /region/rack1 bookie again.
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L225-L246
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1055-L1103
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L474-L481

Therefore, we can't recover the ensemble which isn't adhering to placement policy directly.
The recover command is similar to this behavior.

I want to add a relocation feature to adhere placement policy.

Changes

Add relocation feature to shell command.
So, we can use it synchronously instead of executing asynchronously like the autorecovery process.

In this specification, optimize the new ensemble arrangement to minify the number of bookies replaced.
Therefore, we should implement the new method that returns the new ensemble by policies.
First, I introduce this feature to RackawareEnsemblePlacementPolicy in this PR.

@equanz equanz force-pushed the add_relocate_placement_command_pr branch 2 times, most recently from cb41b1e to e3d3d00 Compare December 7, 2021 05:29
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great work!

I did a first pass, it is a big patch.
So we'll need more eyes

@Ghatage @dlg99 @merlimat @ivankelly

if (!excludeBookies.contains(currentNode) && predicate.apply(currentNode, ensemble)) {
if (ensemble.addNode(currentNode)) {
// add the candidate to exclude set
excludeBookies.add(currentNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not modify the excludeBookies Set passed as argument, please create a copy in the beginning of the method
otherwise this method will have unwanted side effects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify the excludeBookies to notify exclude bookie to called method like below.
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L671-L675

However, as you said, it might have unwanted side effects. Therefore, I'll modify to put to excludeBookies at called method.

diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
index d9167c321..fa3e59cd6 100644
--- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
+++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
@@ -1131,6 +1131,13 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
                     prevNode = replaceToAdherePlacementPolicyInternal(
                             curRack, excludeNodes, ensemble, ensemble,
                             provisionalEnsembleNodes, i, ensembleSize, minNumRacksPerWriteQuorumForThisEnsemble);
+                    // got a good candidate
+                    if (ensemble.addNode(prevNode)) {
+                        // add the candidate to exclude set
+                        excludeNodes.add(prevNode);
+                    } else {
+                        throw new BKNotEnoughBookiesException();
+                    }
                     // replace to newer node
                     provisionalEnsembleNodes.set(i, prevNode);
                 } catch (BKNotEnoughBookiesException e) {
@@ -1159,10 +1166,6 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
         final BookieNode currentNode = provisionalEnsembleNodes.get(ensembleIndex);
         // if the current bookie could be applied to the ensemble, apply it to minify the number of bookies replaced
         if (!excludeBookies.contains(currentNode) && predicate.apply(currentNode, ensemble)) {
-            if (ensemble.addNode(currentNode)) {
-                // add the candidate to exclude set
-                excludeBookies.add(currentNode);
-            }
             return currentNode;
         }

@@ -1234,11 +1237,6 @@ public class RackawareEnsemblePlacementPolicyImpl extends TopologyAwareEnsembleP
                     continue;
                 }
                 BookieNode bn = (BookieNode) n;
-                // got a good candidate
-                if (ensemble.addNode(bn)) {
-                    // add the candidate to exclude set
-                    excludeBookies.add(bn);
-                }
                 return bn;
             }
         }

writableBookies.add(addr9.toBookieId());

// add bookie node to resolver
StaticDNSResolver.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do this also in a "After" method, in order to ensure that we do not pollute the test environment

@equanz equanz force-pushed the add_relocate_placement_command_pr branch from e3d3d00 to 6bd16cf Compare March 17, 2022 07:57
@equanz
Copy link
Contributor Author

equanz commented Mar 23, 2022

I've created the issue about the failing OWASP Dependency Check.
#3134

@equanz equanz force-pushed the add_relocate_placement_command_pr branch from 6bd16cf to 5dc71e2 Compare March 28, 2022 04:54
@equanz equanz force-pushed the add_relocate_placement_command_pr branch from 5dc71e2 to e6e4deb Compare March 30, 2022 09:00
@equanz
Copy link
Contributor Author

equanz commented Mar 30, 2022

@eolivelli Addressed your first comments. PTAL.
(Also rebase to current master to follow #3140 and #3149 .)

@equanz equanz requested a review from eolivelli April 21, 2022 04:59
final LedgerUnderreplicationManager lum = lmf.newLedgerUnderreplicationManager();

final List<Long> targetLedgers =
flags.ledgerIds.stream().parallel().distinct().filter(ledgerId -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for 'parallel'?
This loop is doing metadata operations, you can kill the zookeeper if you don't throttle properly the requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for 'parallel'?

It was no strong reason.

This loop is doing metadata operations, you can kill the zookeeper if you don't throttle properly the requests.

I missed that point. I'll remove it.

final NavigableSet<Pair<Long, Long>> failedTargets = new ConcurrentSkipListSet<>();
final CountDownLatch latch = new CountDownLatch(targetLedgers.size());
for (long ledgerId : targetLedgers) {
if (!flags.dryRun) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about logging the ledger id here?

continue;
}
}
admin.asyncOpenLedger(ledgerId, (rc, lh, ctx) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move all this block into some BookKeeperAdmin method.
It will be easier to write tests and also to use this feature programmatically

@equanz equanz force-pushed the add_relocate_placement_command_pr branch from d49826f to 1555e54 Compare May 24, 2022 06:03
@equanz equanz force-pushed the add_relocate_placement_command_pr branch from 1555e54 to 62970b3 Compare May 24, 2022 06:04
@equanz
Copy link
Contributor Author

equanz commented May 24, 2022

rerun failure checks

@equanz
Copy link
Contributor Author

equanz commented May 24, 2022

@eolivelli Addressed your comments. PTAL.

@equanz equanz requested a review from eolivelli July 13, 2022 04:11
@equanz
Copy link
Contributor Author

equanz commented Jul 13, 2022

@eolivelli ping

@dlg99
Copy link
Contributor

dlg99 commented Jul 20, 2022

I have rather generic comments.

If the bookie cluster has remained bookies which place to /region/rack1, the ledger recovers to the /region/rack1 bookie again.

As I understand, there is enforceMinNumRacksPerWriteQuorum option to prevent that (and enforceMinNumFaultDomainsForWrite, enforceStrictZoneawarePlacement for other placement policies)

As I understand, Auditor's placementPolicyCheck just detects the problem. Maybe it makes sense to make Auditor (optionally) put the ledgers with bad placement for re-replication and make AutoRecovery handle that?
In this case the CLI command is already existing triggeraudit.

Another note is that the test only covers rack-aware policy. What happens in case of the region-aware or zone-aware policies?

@dlg99
Copy link
Contributor

dlg99 commented Jul 20, 2022

Also: #3359

@equanz
Copy link
Contributor Author

equanz commented Jul 21, 2022

Also: #3359

Thank you for sharing. I'll check it first.

}

@Override
public PlacementResult<List<BookieId>> replaceToAdherePlacementPolicy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some case, the replace policy will replace more bookie. There are two racks.(rack1, rack2).
rack1: bookie1,bookie2,bookie3,bookie4
rack2: bookie5,bookie6,bookie7,bookie8
E:6 WQ:2 AQ:2. The ensemble is (5,6,1,7,2,8). (rack2,rack2,rack1,rack2,rack1,rack2), only replace the first bookie5 to bookie3, it adhere the placement policy. But now the implements didn't replace the firstly, it will replace from second element(bookie6), it will replace more bookie.

Copy link
Contributor Author

@equanz equanz Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We can't "minimize" replacement bookies in some cases by the current approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can list the kinds of replace result, and pick the "minimize" result. We can pick the different start index to replace. such as start with index0, start with index1, .... Finally, we pick the "minimize" replace result.

Copy link
Contributor Author

@equanz equanz Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can list the kinds of replace result, and pick the "minimize" result.

Yeah.
I think, alternatively, if the minimization is too costly for calculation, reduce as much as possible in a realistic amount of time like the current approach.

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?

@equanz
Copy link
Contributor Author

equanz commented Aug 5, 2022

@dlg99 #2931 (comment)

Sorry for the late reply. I've checked #3359 (review) .
The approach (using autorecovery) sounds better. If it was merged and still has the above issue, I'll rebase to follow #3359 interfaces.

As I understand, there is enforceMinNumRacksPerWriteQuorum option to prevent that (and enforceMinNumFaultDomainsForWrite, enforceStrictZoneawarePlacement for other placement policies)

In my understanding, if we use enforceMinNumRacksPerWriteQuorum , we can't recover quorum temporally for redundancy when a rack goes down. I want to avoid it.

As I understand, Auditor's placementPolicyCheck just detects the problem. Maybe it makes sense to make Auditor (optionally) put the ledgers with bad placement for re-replication and make AutoRecovery handle that?
In this case the CLI command is already existing triggeraudit.

I think so too.
I thought the next step is asynchronous relocation, like the auto-recovery feature (mentioned in here). However, it was implemented in #3359 .

Another note is that the test only covers rack-aware policy. What happens in the case of the region-aware or zone-aware policies?

When using non-Rackaware, it throws an Exception. I'll add the test.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/EnsemblePlacementPolicy.java#L460-L467

@horizonzy
Copy link
Member

horizonzy commented Aug 5, 2022

I want to use the EnsemblePlacementPolicy#replaceToAdherePlacementPolicy in #3359, it's more graceful, and I will enhance it in some cases.

@equanz
Copy link
Contributor Author

equanz commented Aug 5, 2022

@horizonzy

I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?

Currently, maybe, we can't do it. This feature expects all of the bookies were already recovered to a normal state.
However, your point is correct.

I want to use the EnsemblePlacementPolicy#replaceToAdherePlacementPolicy in #3359, it's more graceful, and I will enhance it in some cases.

It's okay.
Just confirm, do you mean "bring this feature to #3359 and improve it myself"?
Or improve it in this PR?

@horizonzy
Copy link
Member

Improve it both #3359 and this pr. I think the improvement is common.

@equanz
Copy link
Contributor Author

equanz commented Aug 5, 2022

I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?

Currently, maybe, we can't do it. This feature expects all of the bookies were already recovered to a normal state.
However, your point is correct.

I've checked the current code and found handing. If the enforceMinNumRacksPerWriteQuorum is true and the default rack name is /default-region/default-rack then handle in below.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L1089-L1090

If the default rack name is /default-rack, then we can't add it.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L138
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/net/NetworkTopologyImpl.java#L416-L420

However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.

Moreover, EnsemblePlacementPolicy#isEnsembleAdheringToPlacementPolicy has a similar process like below.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L1043

@horizonzy
Copy link
Member

I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?

If the default rack name is /default-rack, then we can't add it.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/net/NetworkTopologyImpl.java#L416-L420

I means the defaultRack is not literal, I means the defaultRack is /default-region/default-rack. 😂

@horizonzy
Copy link
Member

However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.

Yes, If we already knows which bookie shutdown, I can add it to excludeNodes to ignore it.

@equanz
Copy link
Contributor Author

equanz commented Aug 5, 2022

In my understanding, currently, defaultRack of Rackaware is /default-rack. So we should consider about it if use it.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L138
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L234-L248

However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.

Yes, If we already knows which bookie shutdown, I can add it to excludeNodes to ignore it.

Could you tell me more about shutdown? I think the resolver doesn't return /default-region/default-rack just call TopologyAwareEnsemblePlacementPolicy#handleBookiesThatJoined.

@horizonzy
Copy link
Member

In my understanding, currently, defaultRack of Rackaware is /default-rack. So we should consider about it if use it.

Yes, you are right. If the test case, we changed the default rack to /defualt-region/default-rack.

repp.withDefaultRack(NetworkTopology.DEFAULT_REGION_AND_RACK);

If we didn't change it, the default rack is /default-rack.

@StevenLuMT
Copy link
Member

fix old workflow,please see #3455 for detail

@ivankelly ivankelly removed their request for review March 28, 2023 08:37
@equanz
Copy link
Contributor Author

equanz commented Apr 5, 2023

Part of the code are ported to #3359 and it was merged.
So, I'll close this PR.

@equanz equanz closed this Apr 5, 2023
@equanz equanz deleted the add_relocate_placement_command_pr branch April 5, 2023 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants