BP-46: Data integrity check for running without journal #2936

Vanlightly · 2021-12-13T10:24:01Z

Motivation

To be able to run without the journal without compromising the replication protocol.

Changes

Introduction of a data integrity check that prevents data loss due to
running without the journal and auto-repairs the data.

Data integrity check

The integrity check is comprised two parts: a preboot check that
is triggered by either an unclean shutdown or an invalid cookie.
The preboot check marks any open ledgers as both fenced and in
limbo to prevent ledgers affected by potential data loss from
being written to. The limbo status prevents NoSuchEntry and
NoSuchLedger responses from being sent which avoid ledger
truncation from any ledger recovery operations. Finally it sets
a storage flag that a full check is required.

A new data integrity check service has been to run the full
integrity check once the bookie is running. If the service
sees that the full check storage flag is set then it runs
a full check. This involves scanning the index and comparing
it against metadata to discover missing entries. Any
missing entries are sourced from peer bookies by the
EntryCopier and written to ledger storage.

The data integrity check also has a different cookie validation
implementation.

The following configurations have been added to the conf
file:

dataIntegrityCheckingEnabled=true/false. False by default.
This config enables or disables data integrity checking.
When set to false the legacy cookie validation is used.
dataIntegrityStampMissingCookies=true/false. False by
default. This config allows the data integrity process
to stamp new cookies if a cookie is missing from a
directory. The full check will repair any lost data if
the directory data was lost.

Cookie verification for data integrity checking

The algorithm differs from that of LegacyCookieValidation in the
following ways:

A empty directory isn't considered a fatal condition. It just means
that the preboot phase of the data integrity checker must run. Once
the preboot phase runs, it should be safe to stamp the cookies
again.
Bookies are not allowed to change their identity. If they do, manual
operator intervention is required (which is ok as it is expected
that an operator would have to intervene to change the identity in
the first place).
A missing cookie in zookeeper is only valid, if there are no cookies
in any of the directories, as this is considered a new
boot. Otherwise, manual operator intervention is required.

Async iterator for ledger metadata

Common code for iterating over ledger metadata. There is already
asyncProcessLedgers in LedgerManager, but that only gives the ledgerId
and the API is nasty (it even uses ZK specific callbacks).

This change adds a more modern iterator, which takes a function which
returns a CompletableFuture. The iterator has rudimentary rate
limiting, by limiting the number of ledgers which can be processed at
a time. We should add something more advanced later, which takes into
account the response time from ZK.

Add limbo state to bookie ledger representation

Limbo state for a ledger means that we don't know whether we should
have an entry for the ledger or not, which can happen when a bookie is
started after having its disk wiped. We cannot response with a
NoSuchEntryException or NoSuchLedgerException as this tell the client
that we never had the requested entry, which may or may not be true,
but if we tell it to the client, the client will act like it's true
and possible mark the end of the ledger at an incorrect point.

This change also adds locking to LedgerMetadataIndex. Previously it
relied on the good graces to the calling code to avoid modifying the
same ledger concurrently. Now that we are also using the index to
store limbo state, we can't be so blasé.

Add entryExists call to ledgerStorage

Currently the only way to check if an entry exists in the storage is
to try to read the entry. However, this means pulling data out of the
entrylog, which it should be sufficient to check that the entry exists
in the index.

This change adds the entryExists call to ledgerStorage. This has only
been implemented for DbLedgerStorage. The implementation for the
others should be trivial, but it needs to be tested.

Pregenerate the writeset from ledger metadata

The bookkeeper client uses DistributionSchedule (of which
RoundRobinDistributionSchedule is the only impl) to decide which
members of the ensemble it writes an entry to. This writeset is
generated for each entry. However, there is only |ensemble| possible
writesets, so we should pregenerate them for the ledger and stop
trashing memory.

WriteSets represent a set of pregenerated writesets as would be
otherwise generated from the distribution schedule. The constructor
takes a list of indices (which should be generated based on the list
of bookies in the ensemble), which specifies the preferred order that
bookies should be tried for reads.

Storage state flags for LedgerStorage

If a bookie crashes in the middle of a full integrity check, it needs
to know to start it again when it reboots. For this, we need to
persist some flag to persistent storage.

This change adds persistent flags to the ledger storage
interface. Multiple flags can be added in future.

Flags are only implemented for DbLedgerStorage. The flags are stored
in the metadata index, with a negative ledger id as key. The key of
the storage selected for ledger 0 is used. This does mean flags will
be lost if there is a change in the storage disk configuration, but
data integrity checks will run in this case regardless.

EUNKNOWNLEDGERSTATE code

A new response code has been added to communicate that the state
of an entry is unknown due the ledger being in limbo.

Added bookie unclean shutdown detection

Adds unclean shutdown detection. When running with journal
writes disabled and data integrity checking enabled, if
the prior shutdown was unclean (not a graceful shutdown)
then the data integrity checks are triggered. These checks
avoid additional data loss scenarios and repair any lost
data caused by the loss of unflushed data at the time
of the unclean shutdown.

The BookieServer registers start-up and shutdown with the
UncleanShutdownDetection class. This class adds a dirty
file to each ledger dir on registering start-up and clears
all these files on registering shutdown. The presence of
any of these files on boot-up indicates the prior shutdown
was unclean.

Master Issue: #2705

dlg99

I didn't get very far in the review (didn't look at the tests, how upgrade/downgrade is handled etc). Realistically I'll have time for the second pass only in 2022.
Basic "happy path" that I looked makes sense; I submitted a couple of comments.

dlg99 · 2021-12-17T23:13:35Z

...r/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/SingleDirectoryDbLedgerStorage.java

+        ledgerIndex.clearLimbo(ledgerId);
+    }
+
+    private void throwIfLimbo(long ledgerId) throws IOException, BookieException {


maybe do throws .. DataUnknownException (instead of BookieException)? with similar changes wherever it bubbles up (readLastAddConfirmed/...), to make sure that some other Bookie Exception is not handled as a DataUnknown case.

A DataUnknownException is thrown which is a subclass of BookieException.

that's exactly what I wanted to avoid - some other subclassed exception thrown but handled the same because IDE generated catch for BookieException.

I used a subclass of BookieException as I try to follow the existing approaches to things. So I guess the question is why do we have a BookieException with subclasses? Is there a guideline for how to use it correctly or is the use of BookieException in general something we need to start refactoring out?

this is mostly for the convenience of the IDE code generation (+ documentation) and a rather generic discussion:
if the method throws BookieException, the IDE-generated catch will put one catch block and developer may need to chase code down to understand what specifically can happen.
If the exceptions in throws are specified as "throws DataUnknownException, OperationRejectedException, MetadataStoreException" one can easily see that exception handling might need to be different (or use catch BookieException if it does not).
This one is a private method with limited scope of use but, if its exception is rethrown, more specific exception in throws lets one define more specific exception in the throws of method that uses it.

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/UncleanShutdownDetectionImpl.java

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCheck.java

dlg99 · 2021-12-18T00:30:09Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieProtocol.java

+    /**
+     * Ledger in unknown state.
+     */
+    int EUNKNOWN = 107;


do we need to bump up CURRENT_PROTOCOL_VERSION?

What are the current guidelines for modifying the CURRENT_PROTOCOL_VERSION?

I don't think we have it documented (@sijie and @merlimat can correct me) but the rule of thumb was around old client not handling new responses correctly/doing something dangerously wrong when it gets the new response.

dlg99 · 2021-12-18T00:40:34Z

...eper-server/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCheckImpl.java

+        MetadataAsyncIterator iter = new MetadataAsyncIterator(scheduler,
+                ledgerManager, MAX_INFLIGHT, ZK_TIMEOUT_S, TimeUnit.SECONDS);
+        CompletableFuture<Void> promise = new CompletableFuture<>();
+        Map<Long, LedgerMetadata> ledgersCache =


should we be concerned about memory requirements here in case of large disks (to many ledgers/metadata with many segments)? See #1949

we can keep this here, and address the problem in #1949

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/InterleavedLedgerStorage.java

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/UncleanShutdownDetectionImpl.java

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCheck.java

eolivelli · 2021-12-18T10:55:52Z

...eper-server/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCheckImpl.java

+        MetadataAsyncIterator iter = new MetadataAsyncIterator(scheduler,
+                ledgerManager, MAX_INFLIGHT, ZK_TIMEOUT_S, TimeUnit.SECONDS);
+        CompletableFuture<Void> promise = new CompletableFuture<>();
+        Map<Long, LedgerMetadata> ledgersCache =


we can keep this here, and address the problem in #1949

...eper-server/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCheckImpl.java

...rver/src/main/java/org/apache/bookkeeper/bookie/datainteg/DataIntegrityCookieValidation.java

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java

eolivelli

+1

I left some minor comments

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/InterleavedLedgerStorage.java

bookkeeper-server/src/main/java/org/apache/bookkeeper/server/Main.java

bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/MockLedgerStorage.java

dlg99

LGTM

eolivelli · 2022-01-09T19:46:19Z

@Vanlightly this patch has been around for a while, I believe it is time to commit it.

Can you please resolve the conflicts?
I will be happy to merge

eolivelli · 2022-01-10T10:24:33Z

The conflict is about an ExceptionCode:

int DataUnknownException = -110;
vs
int EntryLogMetadataMapException = -110;

How can we solve the conflict ?
if DataUnknownException is already running in production somewhere I would keep DataUnknownException as -110

* Data integrity check The integrity check is comprised two parts: a preboot check that is triggered by either an unclean shutdown or an invalid cookie. The preboot check marks any open ledgers as both fenced and in limbo to prevent ledgers affected by potential data loss from being written to. The limbo status prevents NoSuchEntry and NoSuchLedger responses from being sent which avoid ledger truncation from any ledger recovery operations. Finally it sets a storage flag that a full check is required. A new data integrity check service has been to run the full integrity check once the bookie is running. If the service sees that the full check storage flag is set then it runs a full check. This involves scanning the index and comparing it against metadata to discover missing entries. Any missing entries are sourced from peer bookies by the EntryCopier and written to ledger storage. The data integrity check also has a different cookie validation implementation. The following configurations have been added to the conf file: - dataIntegrityCheckingEnabled=true/false. False by default. This config enables or disables data integrity checking. When set to false the legacy cookie validation is used. - dataIntegrityStampMissingCookies=true/false. False by default. This config allows the data integrity process to stamp new cookies if a cookie is missing from a directory. The full check will repair any lost data if the directory data was lost. * Cookie verification for data integrity checking The algorithm differs from that of LegacyCookieValidation in the following ways: - A empty directory isn't considered a fatal condition. It just means that the preboot phase of the data integrity checker must run. Once the preboot phase runs, it should be safe to stamp the cookies again. - Bookies are not allowed to change their identity. If they do, manual operator intervention is required (which is ok as it is expected that an operator would have to intervene to change the identity in the first place). - A missing cookie in zookeeper is only valid, if there are no cookies in any of the directories, as this is considered a new boot. Otherwise, manual operator intervention is required. * Async iterator for ledger metadata Common code for iterating over ledger metadata. There is already asyncProcessLedgers in LedgerManager, but that only gives the ledgerId and the API is nasty (it even uses ZK specific callbacks). This change adds a more modern iterator, which takes a function which returns a CompletableFuture. The iterator has rudimentary rate limiting, by limiting the number of ledgers which can be processed at a time. We should add something more advanced later, which takes into account the response time from ZK. * Add limbo state to bookie ledger representation Limbo state for a ledger means that we don't know whether we should have an entry for the ledger or not, which can happen when a bookie is started after having its disk wiped. We cannot response with a NoSuchEntryException or NoSuchLedgerException as this tell the client that we never had the requested entry, which may or may not be true, but if we tell it to the client, the client will act like it's true and possible mark the end of the ledger at an incorrect point. This change also adds locking to LedgerMetadataIndex. Previously it relied on the good graces to the calling code to avoid modifying the same ledger concurrently. Now that we are also using the index to store limbo state, we can't be so blasé. * Add entryExists call to ledgerStorage Currently the only way to check if an entry exists in the storage is to try to read the entry. However, this means pulling data out of the entrylog, which it should be sufficient to check that the entry exists in the index. This change adds the entryExists call to ledgerStorage. This has only been implemented for DbLedgerStorage. The implementation for the others should be trivial, but it needs to be tested. * Pregenerate the writeset from ledger metadata The bookkeeper client uses DistributionSchedule (of which RoundRobinDistributionSchedule is the only impl) to decide which members of the ensemble it writes an entry to. This writeset is generated for each entry. However, there is only |ensemble| possible writesets, so we should pregenerate them for the ledger and stop trashing memory. WriteSets represent a set of pregenerated writesets as would be otherwise generated from the distribution schedule. The constructor takes a list of indices (which should be generated based on the list of bookies in the ensemble), which specifies the preferred order that bookies should be tried for reads. * Storage state flags for LedgerStorage If a bookie crashes in the middle of a full integrity check, it needs to know to start it again when it reboots. For this, we need to persist some flag to persistent storage. This change adds persistent flags to the ledger storage interface. Multiple flags can be added in future, for example to mark the storage as dirty on boot, so we can detect non-clean shutdown. Flags are only implemented for DbLedgerStorage. The flags are stored in the metadata index, with a negative ledger id as key. The key of the storage selected for ledger 0 is used. This does mean flags will be lost if there is a change in the storage disk configuration, but data integrity checks will run in this case regardless. * EUNKNOWN code A new response code has been added to communicate that the state of an entry is unknown due the ledger being in limbo. * Added bookie unclean shutdown detection Adds unclean shutdown detection. When running with journal writes disabled and data integrity checking enabled, if the prior shutdown was unclean (not a graceful shutdown) then the data integrity checks are triggered. These checks avoid additional data loss scenarios and repair any lost data caused by the loss of unflushed data at the time of the unclean shutdown. The BookieServer registers start-up and shutdown with the UncleanShutdownDetection class. This class adds a dirty file to each ledger dir on registering start-up and clears all these files on registering shutdown. The presence of any of these files on boot-up indicates the prior shutdown was unclean.

* Data integrity check The integrity check is comprised two parts: a preboot check that is triggered by either an unclean shutdown or an invalid cookie. The preboot check marks any open ledgers as both fenced and in limbo to prevent ledgers affected by potential data loss from being written to. The limbo status prevents NoSuchEntry and NoSuchLedger responses from being sent which avoid ledger truncation from any ledger recovery operations. Finally it sets a storage flag that a full check is required. A new data integrity check service has been to run the full integrity check once the bookie is running. If the service sees that the full check storage flag is set then it runs a full check. This involves scanning the index and comparing it against metadata to discover missing entries. Any missing entries are sourced from peer bookies by the EntryCopier and written to ledger storage. The data integrity check also has a different cookie validation implementation. The following configurations have been added to the conf file: - dataIntegrityCheckingEnabled=true/false. False by default. This config enables or disables data integrity checking. When set to false the legacy cookie validation is used. - dataIntegrityStampMissingCookies=true/false. False by default. This config allows the data integrity process to stamp new cookies if a cookie is missing from a directory. The full check will repair any lost data if the directory data was lost. * Cookie verification for data integrity checking The algorithm differs from that of LegacyCookieValidation in the following ways: - A empty directory isn't considered a fatal condition. It just means that the preboot phase of the data integrity checker must run. Once the preboot phase runs, it should be safe to stamp the cookies again. - Bookies are not allowed to change their identity. If they do, manual operator intervention is required (which is ok as it is expected that an operator would have to intervene to change the identity in the first place). - A missing cookie in zookeeper is only valid, if there are no cookies in any of the directories, as this is considered a new boot. Otherwise, manual operator intervention is required. * Async iterator for ledger metadata Common code for iterating over ledger metadata. There is already asyncProcessLedgers in LedgerManager, but that only gives the ledgerId and the API is nasty (it even uses ZK specific callbacks). This change adds a more modern iterator, which takes a function which returns a CompletableFuture. The iterator has rudimentary rate limiting, by limiting the number of ledgers which can be processed at a time. We should add something more advanced later, which takes into account the response time from ZK. * Add limbo state to bookie ledger representation Limbo state for a ledger means that we don't know whether we should have an entry for the ledger or not, which can happen when a bookie is started after having its disk wiped. We cannot response with a NoSuchEntryException or NoSuchLedgerException as this tell the client that we never had the requested entry, which may or may not be true, but if we tell it to the client, the client will act like it's true and possible mark the end of the ledger at an incorrect point. This change also adds locking to LedgerMetadataIndex. Previously it relied on the good graces to the calling code to avoid modifying the same ledger concurrently. Now that we are also using the index to store limbo state, we can't be so blasé. * Add entryExists call to ledgerStorage Currently the only way to check if an entry exists in the storage is to try to read the entry. However, this means pulling data out of the entrylog, which it should be sufficient to check that the entry exists in the index. This change adds the entryExists call to ledgerStorage. This has only been implemented for DbLedgerStorage. The implementation for the others should be trivial, but it needs to be tested. * Pregenerate the writeset from ledger metadata The bookkeeper client uses DistributionSchedule (of which RoundRobinDistributionSchedule is the only impl) to decide which members of the ensemble it writes an entry to. This writeset is generated for each entry. However, there is only |ensemble| possible writesets, so we should pregenerate them for the ledger and stop trashing memory. WriteSets represent a set of pregenerated writesets as would be otherwise generated from the distribution schedule. The constructor takes a list of indices (which should be generated based on the list of bookies in the ensemble), which specifies the preferred order that bookies should be tried for reads. * Storage state flags for LedgerStorage If a bookie crashes in the middle of a full integrity check, it needs to know to start it again when it reboots. For this, we need to persist some flag to persistent storage. This change adds persistent flags to the ledger storage interface. Multiple flags can be added in future, for example to mark the storage as dirty on boot, so we can detect non-clean shutdown. Flags are only implemented for DbLedgerStorage. The flags are stored in the metadata index, with a negative ledger id as key. The key of the storage selected for ledger 0 is used. This does mean flags will be lost if there is a change in the storage disk configuration, but data integrity checks will run in this case regardless. * EUNKNOWN code A new response code has been added to communicate that the state of an entry is unknown due the ledger being in limbo. * Added bookie unclean shutdown detection Adds unclean shutdown detection. When running with journal writes disabled and data integrity checking enabled, if the prior shutdown was unclean (not a graceful shutdown) then the data integrity checks are triggered. These checks avoid additional data loss scenarios and repair any lost data caused by the loss of unflushed data at the time of the unclean shutdown. The BookieServer registers start-up and shutdown with the UncleanShutdownDetection class. This class adds a dirty file to each ledger dir on registering start-up and clears all these files on registering shutdown. The presence of any of these files on boot-up indicates the prior shutdown was unclean. Co-authored-by: Ivan Kelly <ikelly@splunk.com>

Vanlightly force-pushed the without-journal branch 3 times, most recently from 0a94476 to a0e4253 Compare December 13, 2021 16:38

Vanlightly requested review from eolivelli, ivankelly and zymap December 14, 2021 08:38

Vanlightly force-pushed the without-journal branch from a0e4253 to e977381 Compare December 14, 2021 16:58

Vanlightly requested review from dlg99 and merlimat December 15, 2021 10:56

dlg99 requested a review from jvrao December 17, 2021 17:35

dlg99 reviewed Dec 18, 2021

View reviewed changes

eolivelli reviewed Dec 18, 2021

View reviewed changes

eolivelli approved these changes Dec 21, 2021

View reviewed changes

dlg99 approved these changes Jan 5, 2022

View reviewed changes

Ivan Kelly and others added 9 commits January 12, 2022 09:57

Failing to create dirty files prevents boot

7f15265

Consistent naming of DataIntegrityCheck methods

f312cf8

Capitalised dirty filename variable

a6e2a08

More checks on deleting unclean shutdown dirty file

b9008ad

Renamed EUNKNOWN to EUNKNOWNLEDGERSTATE

0d1479f

Log all dirty files found on boot

add4624

Added more context to error in preboot check

17722b4

RxSchedulerLifecycleComponent takes ownership of rxExecutor

8e4bcd6

Vanlightly force-pushed the without-journal branch from 98101b9 to 8e4bcd6 Compare January 12, 2022 09:07

ivankelly approved these changes Jan 12, 2022

View reviewed changes

eolivelli merged commit a65e888 into apache:master Jan 12, 2022

StevenLuMT mentioned this pull request Jan 14, 2022

rxjava: add maven dependency #2994

Closed

RaulGracia mentioned this pull request Mar 30, 2022

Validate Apache BookKeeper 4.15.0 release - DO NOT MERGE pravega/pravega#6669

Closed

dlg99 mentioned this pull request May 5, 2022

Pulsar with Bk 4.15.0 apache/pulsar#15142

Merged

5 tasks

StevenLuMT mentioned this pull request Aug 19, 2022

add async or sync for :ledger metadata index's rocksdb write #3368

Closed

hezhangjian mentioned this pull request Apr 29, 2024

BP-44: Running without the journal #2705

Closed

BP-46: Data integrity check for running without journal #2936

BP-46: Data integrity check for running without journal #2936

Uh oh!

Conversation

Vanlightly commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli commented Jan 9, 2022

Uh oh!

eolivelli commented Jan 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vanlightly commented Dec 13, 2021 •

edited

Loading