-
Notifications
You must be signed in to change notification settings - Fork 963
BP-38: Bypass journal ledger #1944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Glad to see this is being worked on. Will review the proposal. |
|
|
||
| To guarantee high durability, BK write journal before flush data to persistent device which will cause two write of data. | ||
| At the presence of replicating and auto-recovery mechanism, the two-write is a bit waste of the persistent device bandwidth, | ||
| especially on the [scenarios](https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability) which prefer weak durability guarantee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe our auto-recovery and replication alone can address our persistence needs. I would request you to reword it saying that - "we may not need this level of persistence under scenarios like week durability....." something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| At the presence of replicating and auto-recovery mechanism, the two-write is a bit waste of the persistent device bandwidth, | ||
| especially on the [scenarios](https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability) which prefer weak durability guarantee. | ||
| This proposal is aimed at providing bypass journal ledger, this feature includes these parts work: | ||
| - add new write flag `BYPASS_JOURNAL` to existing protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume BYPASS_JOURNAL is applicable only in the week durability case. We have that in the protocol, so can't BYPASS_JOURNAL be just a configuration parameter on bookie, which will used only (if enabled) on week durability case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When user create LedgerHandle, he/she can construct it with WriteFlag.BYPASS_JOURNAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvrao I think we cannot have a bookie configuration parameter because the client must be aware of relaxed durability and advance LAC accordingly (like we do with DEFERRED_SYNC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvrao If I remember this correctly, we were thinking of adding BYPASS_JOURNAL as a WriteFlag, no? Applications can decide whether to use journal or not.
| especially on the [scenarios](https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability) which prefer weak durability guarantee. | ||
| This proposal is aimed at providing bypass journal ledger, this feature includes these parts work: | ||
| - add new write flag `BYPASS_JOURNAL` to existing protocol | ||
| - impl the newly write flag at the client side and server side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean client visible API change ? or where exactly is this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. What I mean is extending WriteFlags, the user who wants using this feature only needs pass WriteFlags to the WriteHandle. looks like this:
newCreateLedgerOp()
.withEnsembleSize(3)
.withWriteQuorumSize(3)
.withAckQuorumSize(3)
.withPassword(PASSWORD)
.withWriteFlags(WriteFlag.BYPASS_JOURNAL)
.execute()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArvinDevel it would be good if you can put the example code in the BP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do that
|
|
||
| Modify server side code mostly and don't change legerHandle's LAC advance logic, if the write flag is `BYPASS_JOURNAL`, after write to `LegerStorage`(the data maybe in the memTable, or the buffer of File, or the os cache), | ||
| bookie return result to the client directly. | ||
| This impl is like [disable syncData](https://github.com/apache/bookkeeper/issues/753), once all the replica fails, the BK cluster can't recovery from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this the above will be obsolete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bookie level config syncData example is just used to compare. We still needs to modify server side to sense client's WriteFlag.BYPASS_JOURNAL option, and extend WriteFlag to give users this choice.
|
|
||
| 1. Relax LAC protocol | ||
|
|
||
| Modify server side code mostly and don't change legerHandle's LAC advance logic, if the write flag is `BYPASS_JOURNAL`, after write to `LegerStorage`(the data maybe in the memTable, or the buffer of File, or the os cache), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change legerHandle's LAC advance logic
I think we should use the DEFERRED_SYNC way of handling LAC, that is that a regular write will not advance LAC, but you need a "force" (like you state below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difficult is how to execute "force", leave to app or client lib, and how often to schedule force if WriteHandle is responsible for that. Do you have any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force() RPC may carry a flag on the wire which tells that the clients wants to fait for a flush of the memtable for the given ledger.
force() with DEFERRED_SINC -> wait for/force a write to the journal
force() with BYPASS_JOURNAL -> wait for/force a flush on the EntryLogger
It makes sense only in conjunction with "multiple entry loggers"/"one entry logger per ledger" feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but I'm more worried about that is the synchronized "force" applicable for bypass-journal weak durability. On the "deferred_sync" scenario, the consumer/ReadHandle can tolerant reading after "force" or close. If we still restrict the reader can only read up to the point where writer "force", will this limit the applied scenarios of bypass-journal weak durability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @eolivelli. BYPASS_JOURNAL should have same LAC semantics as DEFERRED_SYNC. We only advanced LAC when force happen. force will carry the write flag and decide to force a write to the journal, or force a flush on ledger storage.
For bypass journal use cases, I think applications are usually when a ledger is closed/forced. So the above assumptions would make implementation easier.
|
|
||
| - Add persistent callback to LedgerStorage | ||
|
|
||
| Maintain non-persistent entry list and `maxPersistentEntryId` on `LedgerDescriptor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to have a 'maxPersistentEntryId' while working on DEFERRED_SYNC but actually it was not possible because if you use "LedgerHandleAdv" entries won't get to the bookie with a specific order, so you will have to keep track of holes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, thanks
|
|
||
| - Client side changes | ||
|
|
||
| Add `nonPersistentLAC` to WriteHandle. WriteHandle with 'bypass-journal' option updates the LAC using `maxPersistentEntryId`, and update `nonPersistentLAC` if receives enough ack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nonPersistentLAC is more like 'pendingAddsSequenceHead' that we introduced for DEFERRED_SYNC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I forgot to state something to complete nonPersistentLAC :
- carry
nonPersistentLACto bookie, and LedgerDesciptor record it - extend readEntry of ReadHandle, so that it can read up to
nonPersistentLAC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why persisting an non-persistent value ?
The value makes sense only on the client (writer), you won't ever read that value from bookies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I want is relaxing the reader to read up to latest entry as early as possible.
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very like this proposal and I think we should go for the 'WriteFlag' way.
thank you for working on this @ArvinDevel
|
When you will go thru the implementation remember to create patches for the Server-side first and then we will be able to work on the client side, because client-side changes are "public API" and we cannot release an API which is missing the server side counterpart |
|
@ArvinDevel this doc does not deal with fencing. With DEFERRED_SYNC for instance we are not changing how recovery/fencing works. |
To be honest, I'm not familiar with fencing stuff. To simplify the design, can we keep the simple design which not changing recovery/fencing? Since the fence info is small enough, store it to journal has little effect. |
|
@jvrao @eolivelli @sijie @merlimat |
047d868 to
611ed44
Compare
Descriptions of the changes in this PR:
Motivation
To guarantee high durability, BK write journal before flush data to persistent device which will cause two write of data.
At the presence of replicating and auto-recovery mechanism, the two-write is a bit waste of the persistent device bandwidth,
especially on the scenarios which prefer weak durability guarantee.
This proposal is aimed at providing bypass journal ledger, this feature includes these parts work:
BYPASS_JOURNALto existing protocolChanges
(Describe: what changes you have made)
Master Issue: #1945