Simplified the logic for ForceWriteThread after we introduced queue.drainTo() #3830

merlimat · 2023-03-01T21:25:56Z

Motivation

In #3545 we have switched the ForceWriteThread to take advantage o BlockingQueue.drainTo() method for reducing contention, though the core logic of the force-write was not touched at the time.

The logic of force-write is quite complicated because it tries to group multiple force-write requests in the queue by sending a new marker and grouping them when the marker is received. This also leads to a bit of lag when there are many requests coming in and the IO is stressed, as we're waiting a bit more before issuing the fsync.

Instead, with the drainTo() approach we can greatly simplify the logic and maintain a strict fsync grouping:

drain all the force-write-requests available in the queue into a local array list
perform the fsync
update the journal log mark to the position of the last fw request
trigger send-responses for all the requests
go back to read from the queue

This refactoring will also enable further improvements, to optimize how the send responses are prepared, since we have now a list of responses ready to send.

…rainTo()

zymap · 2023-03-02T02:30:06Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java

+                    // Sync and mark the journal up to the position of the last entry in the batch
+                    ForceWriteRequest lastRequest = localRequests.get(requestsCount - 1);
+                    syncJournal(lastRequest);


Is it possible that there have two log files in the batch?

+1
If the localRequests queue contains multiple journal files and we only sync the lastRequest's journal file, other journal files will skip sync.

Yes, it's possible though the Journal thread would have already closed the previous file, so we don't need to either fsync or close it

Actually, you're correct. We need to ensure all the files are closed before the response is triggered. Fixed it.

Original logic will execute forceWrite before close, which will run bestEffortRemoveFromPageCache. If there have two different journal files in the batch, we only force write for the last file, do we need to do the force write for the another journal file? Does the close will do that?

+1
It has two bugs:

Those non-last journal files in the batch won't be removed from OS PageCache

That non-last journal files channel in the batch will just call close instead of fore(false), which can't ensure the data is flushed to the disk.

All good points. I'll fix it.

I was actually 100% sure that close() implied an fsync, but that's not really the case.

hangc0276 · 2023-03-03T11:01:34Z

Please rebase the master after #3836 is merged to trigger the CI.

codecov-commenter · 2023-03-04T17:45:56Z

Codecov Report

Merging #3830 (1879a97) into master (b4112df) will increase coverage by 0.00%.
The diff coverage is 75.86%.

@@            Coverage Diff            @@
##             master    #3830   +/-   ##
=========================================
  Coverage     68.21%   68.22%           
+ Complexity     6761     6751   -10     
=========================================
  Files           473      473           
  Lines         40950    40889   -61     
  Branches       5240     5229   -11     
=========================================
- Hits          27935    27896   -39     
+ Misses        10762    10734   -28     
- Partials       2253     2259    +6

Flag	Coverage Δ
bookie	`39.78% <72.41%> (-0.05%)`	⬇️
client	`44.09% <72.41%> (-0.11%)`	⬇️
remaining	`29.50% <68.96%> (-0.01%)`	⬇️
replication	`41.33% <72.41%> (+0.03%)`	⬆️
tls	`20.96% <68.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...g/apache/bookkeeper/bookie/stats/JournalStats.java	`85.71% <ø> (-3.76%)`	⬇️
...ain/java/org/apache/bookkeeper/bookie/Journal.java	`79.81% <75.86%> (-0.96%)`	⬇️
...g/apache/bookkeeper/proto/WriteEntryProcessor.java	`76.05% <0.00%> (-4.23%)`	⬇️
...apache/bookkeeper/bookie/LedgerDescriptorImpl.java	`68.42% <0.00%> (-3.51%)`	⬇️
.../apache/bookkeeper/proto/ReadEntryProcessorV3.java	`62.50% <0.00%> (-2.09%)`	⬇️
...he/bookkeeper/bookie/InterleavedLedgerStorage.java	`77.44% <0.00%> (-1.88%)`	⬇️
...org/apache/bookkeeper/client/PendingReadLacOp.java	`73.68% <0.00%> (-1.76%)`	⬇️
...ava/org/apache/bookkeeper/client/PendingAddOp.java	`87.12% <0.00%> (-1.49%)`	⬇️
.../main/java/org/apache/bookkeeper/util/ZkUtils.java	`82.47% <0.00%> (-1.04%)`	⬇️
... and 19 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zymap

LGTM

### Motivation Note: this is stacked on top of #3830 & #3835 This change improves the way the AddRequests responses are send to client. The current flow is: * The journal-force-thread issues the fsync on the journal file * We iterate over all the entries that were just synced and for each of them: 1. Trigger channel.writeAndFlus() 2. This will jump on the connection IO thread (Netty will use a `write()` to `eventfd` to post the task and wake the epoll) 3. Write the object in the connection and trigger the serialization logic 4. Grab a `ByteBuf` from the pool and write ~20 bytes with the response 5. Write and flush the buffer on the channel 6. With the flush consolidator we try to group multiple buffer into a single `writev()` syscall, though each call will have a long list of buffer, making the memcpy inefficient. 7. Release all the buffers and return them to the pool All these steps are quite expensive when the bookie is receiving a lot of small requests. This PR changes the flow into: 1. journal fsync 2. go through each request and prepare the response into a per-connection `ByteBuf` which is not written on the channel as of yet 3. after preparing all the responses, we flush them at once: Trigger an event on all the connections that will write the accumulated buffers. The advantages are: 1. 1 ByteBuf allocated per connection instead of 1 per request 1. Less allocations and stress of buffer pool 2. More efficient socket write() operations 3. 1 task per connection posted on the Netty IO threads, instead of 1 per request.

…rainTo() (apache#3830) ### Motivation In apache#3545 we have switched the `ForceWriteThread` to take advantage o `BlockingQueue.drainTo()` method for reducing contention, though the core logic of the force-write was not touched at the time. The logic of force-write is quite complicated because it tries to group multiple force-write requests in the queue by sending a new marker and grouping them when the marker is received. This also leads to a bit of lag when there are many requests coming in and the IO is stressed, as we're waiting a bit more before issuing the fsync. Instead, with the `drainTo()` approach we can greatly simplify the logic and maintain a strict fsync grouping: 1. drain all the force-write-requests available in the queue into a local array list 2. perform the fsync 3. update the journal log mark to the position of the last fw request 4. trigger send-responses for all the requests 5. go back to read from the queue This refactoring will also enable further improvements, to optimize how the send responses are prepared, since we have now a list of responses ready to send.

### Motivation Note: this is stacked on top of apache#3830 & apache#3835 This change improves the way the AddRequests responses are send to client. The current flow is: * The journal-force-thread issues the fsync on the journal file * We iterate over all the entries that were just synced and for each of them: 1. Trigger channel.writeAndFlus() 2. This will jump on the connection IO thread (Netty will use a `write()` to `eventfd` to post the task and wake the epoll) 3. Write the object in the connection and trigger the serialization logic 4. Grab a `ByteBuf` from the pool and write ~20 bytes with the response 5. Write and flush the buffer on the channel 6. With the flush consolidator we try to group multiple buffer into a single `writev()` syscall, though each call will have a long list of buffer, making the memcpy inefficient. 7. Release all the buffers and return them to the pool All these steps are quite expensive when the bookie is receiving a lot of small requests. This PR changes the flow into: 1. journal fsync 2. go through each request and prepare the response into a per-connection `ByteBuf` which is not written on the channel as of yet 3. after preparing all the responses, we flush them at once: Trigger an event on all the connections that will write the accumulated buffers. The advantages are: 1. 1 ByteBuf allocated per connection instead of 1 per request 1. Less allocations and stress of buffer pool 2. More efficient socket write() operations 3. 1 task per connection posted on the Netty IO threads, instead of 1 per request.

Simplified the logic for ForceWriteThread after we introduced queue.d…

4b7501e

…rainTo()

merlimat added the type/improvement label Mar 1, 2023

merlimat added this to the 4.16.0 milestone Mar 1, 2023

merlimat requested review from eolivelli, hangc0276 and zymap March 1, 2023 21:25

merlimat self-assigned this Mar 1, 2023

merlimat added 3 commits March 1, 2023 13:29

Fixed checkstyle

19e5913

More checkstyle

c502e22

Removed unused metric

2859c2d

zymap reviewed Mar 2, 2023

View reviewed changes

merlimat added 3 commits March 2, 2023 11:12

Fixed tests

9fca9fb

Ensure file is closed before the callback

43291f6

Merge remote-tracking branch 'apache/master' into force-write-thread

f1e3b57

Always fsync before closing

8fa4aae

merlimat mentioned this pull request Mar 4, 2023

Group and flush add-responses after journal sync #3837

Merged

merlimat added 2 commits March 4, 2023 08:27

Merge remote-tracking branch 'apache/master' into force-write-thread

9568135

Fixed spotbugs

1879a97

zymap approved these changes Mar 6, 2023

View reviewed changes

hangc0276 approved these changes Mar 6, 2023

View reviewed changes

hangc0276 merged commit 128c52e into apache:master Mar 6, 2023

michaeljmarshall mentioned this pull request Apr 22, 2024

Replacement for bookie_journal_JOURNAL_CB_QUEUE_SIZE #4308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified the logic for ForceWriteThread after we introduced queue.drainTo() #3830

Simplified the logic for ForceWriteThread after we introduced queue.drainTo() #3830

Uh oh!

merlimat commented Mar 1, 2023 •

edited

Loading

Uh oh!

zymap Mar 2, 2023

Uh oh!

hangc0276 Mar 2, 2023

Uh oh!

merlimat Mar 2, 2023

Uh oh!

merlimat Mar 2, 2023

Uh oh!

zymap Mar 3, 2023

Uh oh!

hangc0276 Mar 3, 2023

Uh oh!

merlimat Mar 3, 2023

Uh oh!

merlimat Mar 3, 2023

Uh oh!

hangc0276 commented Mar 3, 2023

Uh oh!

codecov-commenter commented Mar 4, 2023 •

edited

Loading

Uh oh!

zymap left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Simplified the logic for ForceWriteThread after we introduced queue.drainTo() #3830

Simplified the logic for ForceWriteThread after we introduced queue.drainTo() #3830

Uh oh!

Conversation

merlimat commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hangc0276 commented Mar 3, 2023

Uh oh!

codecov-commenter commented Mar 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zymap left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

merlimat commented Mar 1, 2023 •

edited

Loading

codecov-commenter commented Mar 4, 2023 •

edited

Loading