Skip to content

Conversation

@hrsakai
Copy link
Contributor

@hrsakai hrsakai commented Sep 17, 2021

Motivation

In one day, zookeepers became high cpu usage and disk full.
The cause of this is bookie's gc of overreplicated ledgers.
Gc created/deleted zk nodes under /ledgers/underreplication/locks very frequently and some bookies ran gc at same time.
As a result, zookeepers created a lot of snapshots and became disk full.

I want to configure max zk concurrent requests lower than 1000(default) to avoid heavy traffic at a specific time.

Changes

  • Make max zk concurrent requests for garbage collection of overreplicated ledgers configurable.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, I think we should rename the property avoiding to mention Zookkeeper

protected static final String GC_WAIT_TIME = "gcWaitTime";
protected static final String IS_FORCE_GC_ALLOW_WHEN_NO_SPACE = "isForceGCAllowWhenNoSpace";
protected static final String GC_OVERREPLICATED_LEDGER_WAIT_TIME = "gcOverreplicatedLedgerWaitTime";
protected static final String GC_OVERREPLICATED_LEDGER_MAX_CONCURRENT_ZK_REQUESTS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this property is not tightly related to ZK but it's more a logical option, IMHO we should not mention ZK in the name of the property
GC_OVERREPLICATED_LEDGER_MAX_CONCURRENT_REQUESTS ?

@hrsakai
Copy link
Contributor Author

hrsakai commented Sep 17, 2021

@nicoloboschi
Thank you for your review.
I renamed to GC_OVERREPLICATED_LEDGER_MAX_CONCURRENT_REQUESTS.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

I left a comment about a small typo

this.zkLedgersRootPath = ZKMetadataDriverBase.resolveZkLedgersRootPath(conf);
LOG.info("Over Replicated Ledger Deletion : enabled=" + enableGcOverReplicatedLedger + ", interval="
+ gcOverReplicatedLedgerIntervalMillis);
LOG.info("Over Replicated Ledger Deletion : enabled={}, interval={}, maxConcurrentRequest={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: maxConcurrentRequest > maxConcurrentRequests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eolivelli
can you please review again and merge this PR?

@hrsakai
Copy link
Contributor Author

hrsakai commented Sep 20, 2021

rerun failure checks

1 similar comment
@hrsakai
Copy link
Contributor Author

hrsakai commented Sep 21, 2021

rerun failure checks

zymap pushed a commit that referenced this pull request Oct 22, 2021
### Motivation
- Issue is as described in [PR#2797](#2797).
> In one day, zookeepers became high cpu usage and disk full.
> The cause of this is bookie's gc of overreplicated ledgers.
> Gc created/deleted zk nodes under /ledgers/underreplication/locks very frequently and some bookies ran gc at same time.
> As a result, zookeepers created a lot of snapshots and became disk full.

- I want to reduce the number of lock node creations and deletions in ZK.

### Changes
- Add an ensemble check before creating the lock node.
This is to reduce the number of lock node creations and deletions in ZK.

- ~~If [PR#2797](#2797) was merged, this PR needs to be fixed.~~
zymap pushed a commit that referenced this pull request Oct 26, 2021
### Motivation
- Issue is as described in [PR#2797](#2797).
> In one day, zookeepers became high cpu usage and disk full.
> The cause of this is bookie's gc of overreplicated ledgers.
> Gc created/deleted zk nodes under /ledgers/underreplication/locks very frequently and some bookies ran gc at same time.
> As a result, zookeepers created a lot of snapshots and became disk full.

- I want to reduce the number of lock node creations and deletions in ZK.

### Changes
- Add an ensemble check before creating the lock node.
This is to reduce the number of lock node creations and deletions in ZK.

- ~~If [PR#2797](#2797) was merged, this PR needs to be fixed.~~

(cherry picked from commit 53954ca)
@hrsakai hrsakai force-pushed the make_max_zk_requests_configurable branch from 6fd01be to 24c1dd1 Compare November 11, 2021 02:45
@hrsakai hrsakai force-pushed the make_max_zk_requests_configurable branch from 24c1dd1 to 0118a27 Compare November 11, 2021 02:46
@hrsakai
Copy link
Contributor Author

hrsakai commented Nov 11, 2021

PTAL

Rebased to newer version of master and resolved conflicts.

@hrsakai
Copy link
Contributor Author

hrsakai commented Nov 11, 2021

rerun failure checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@eolivelli eolivelli merged commit 4124f1d into apache:master Nov 11, 2021
@hrsakai hrsakai deleted the make_max_zk_requests_configurable branch November 20, 2023 23:08
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
### Motivation
- Issue is as described in [PR#2797](apache#2797).
> In one day, zookeepers became high cpu usage and disk full.
> The cause of this is bookie's gc of overreplicated ledgers.
> Gc created/deleted zk nodes under /ledgers/underreplication/locks very frequently and some bookies ran gc at same time.
> As a result, zookeepers created a lot of snapshots and became disk full.

- I want to reduce the number of lock node creations and deletions in ZK.

### Changes
- Add an ensemble check before creating the lock node.
This is to reduce the number of lock node creations and deletions in ZK.

- ~~If [PR#2797](apache#2797) was merged, this PR needs to be fixed.~~
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants