Skip to content

Conversation

@merlimat
Copy link
Contributor

@merlimat merlimat commented Mar 1, 2023

Motivation

In #3545 we have switched the ForceWriteThread to take advantage o BlockingQueue.drainTo() method for reducing contention, though the core logic of the force-write was not touched at the time.

The logic of force-write is quite complicated because it tries to group multiple force-write requests in the queue by sending a new marker and grouping them when the marker is received. This also leads to a bit of lag when there are many requests coming in and the IO is stressed, as we're waiting a bit more before issuing the fsync.

Instead, with the drainTo() approach we can greatly simplify the logic and maintain a strict fsync grouping:

  1. drain all the force-write-requests available in the queue into a local array list
  2. perform the fsync
  3. update the journal log mark to the position of the last fw request
  4. trigger send-responses for all the requests
  5. go back to read from the queue

This refactoring will also enable further improvements, to optimize how the send responses are prepared, since we have now a list of responses ready to send.

@merlimat merlimat added this to the 4.16.0 milestone Mar 1, 2023
@merlimat merlimat self-assigned this Mar 1, 2023
Comment on lines +489 to +491
// Sync and mark the journal up to the position of the last entry in the batch
ForceWriteRequest lastRequest = localRequests.get(requestsCount - 1);
syncJournal(lastRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that there have two log files in the batch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
If the localRequests queue contains multiple journal files and we only sync the lastRequest's journal file, other journal files will skip sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's possible though the Journal thread would have already closed the previous file, so we don't need to either fsync or close it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you're correct. We need to ensure all the files are closed before the response is triggered. Fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original logic will execute forceWrite before close, which will run bestEffortRemoveFromPageCache. If there have two different journal files in the batch, we only force write for the last file, do we need to do the force write for the another journal file? Does the close will do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
It has two bugs:

  • Those non-last journal files in the batch won't be removed from OS PageCache
  • That non-last journal files channel in the batch will just call close instead of fore(false), which can't ensure the data is flushed to the disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good points. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually 100% sure that close() implied an fsync, but that's not really the case.

@hangc0276
Copy link
Contributor

Please rebase the master after #3836 is merged to trigger the CI.

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2023

Codecov Report

Merging #3830 (1879a97) into master (b4112df) will increase coverage by 0.00%.
The diff coverage is 75.86%.

@@            Coverage Diff            @@
##             master    #3830   +/-   ##
=========================================
  Coverage     68.21%   68.22%           
+ Complexity     6761     6751   -10     
=========================================
  Files           473      473           
  Lines         40950    40889   -61     
  Branches       5240     5229   -11     
=========================================
- Hits          27935    27896   -39     
+ Misses        10762    10734   -28     
- Partials       2253     2259    +6     
Flag Coverage Δ
bookie 39.78% <72.41%> (-0.05%) ⬇️
client 44.09% <72.41%> (-0.11%) ⬇️
remaining 29.50% <68.96%> (-0.01%) ⬇️
replication 41.33% <72.41%> (+0.03%) ⬆️
tls 20.96% <68.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/apache/bookkeeper/bookie/stats/JournalStats.java 85.71% <ø> (-3.76%) ⬇️
...ain/java/org/apache/bookkeeper/bookie/Journal.java 79.81% <75.86%> (-0.96%) ⬇️
...g/apache/bookkeeper/proto/WriteEntryProcessor.java 76.05% <0.00%> (-4.23%) ⬇️
...apache/bookkeeper/bookie/LedgerDescriptorImpl.java 68.42% <0.00%> (-3.51%) ⬇️
.../apache/bookkeeper/proto/ReadEntryProcessorV3.java 62.50% <0.00%> (-2.09%) ⬇️
...he/bookkeeper/bookie/InterleavedLedgerStorage.java 77.44% <0.00%> (-1.88%) ⬇️
...org/apache/bookkeeper/client/PendingReadLacOp.java 73.68% <0.00%> (-1.76%) ⬇️
...ava/org/apache/bookkeeper/client/PendingAddOp.java 87.12% <0.00%> (-1.49%) ⬇️
.../main/java/org/apache/bookkeeper/util/ZkUtils.java 82.47% <0.00%> (-1.04%) ⬇️
... and 19 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hangc0276 hangc0276 merged commit 128c52e into apache:master Mar 6, 2023
hangc0276 pushed a commit that referenced this pull request Mar 7, 2023
### Motivation

Note: this is stacked on top of #3830 & #3835

This change improves the way the AddRequests responses are send to client. 

The current flow is: 
 * The journal-force-thread issues the fsync on the journal file
 * We iterate over all the entries that were just synced and for each of them:
     1. Trigger channel.writeAndFlus()
     2. This will jump on the connection IO thread (Netty will use a `write()` to `eventfd` to post the task and wake the epoll)
     3. Write the object in the connection and trigger the serialization logic
     4. Grab a `ByteBuf` from the pool and write ~20 bytes with the response
     5. Write and flush the buffer on the channel
     6. With the flush consolidator we try to group multiple buffer into a single `writev()` syscall, though each call will have a long list of buffer, making the memcpy inefficient.
     7. Release all the buffers and return them to the pool

All these steps are quite expensive when the bookie is receiving a lot of small requests. 

This PR changes the flow into: 

1. journal fsync
2. go through each request and prepare the response into a per-connection `ByteBuf` which is not written on the channel as of yet
3. after preparing all the responses, we flush them at once: Trigger an event on all the connections that will write the accumulated buffers.

The advantages are: 
 1. 1 ByteBuf allocated per connection instead of 1 per request
    1. Less allocations and stress of buffer pool
    2. More efficient socket write() operations
 3. 1 task per connection posted on the Netty IO threads, instead of 1 per request.
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
…rainTo() (apache#3830)

### Motivation

In apache#3545 we have switched the `ForceWriteThread` to take advantage o `BlockingQueue.drainTo()` method for reducing contention, though the core logic of the force-write was not touched at the time.

The logic of force-write is quite complicated because it tries to group multiple force-write requests in the queue by sending a new marker and grouping them when the marker is received. This also leads to a bit of lag when there are many requests coming in and the IO is stressed, as we're waiting a bit more before issuing the fsync.

Instead, with the `drainTo()` approach we can greatly simplify the logic and maintain a strict fsync grouping: 
 1. drain all the force-write-requests available in the queue into a local array list
 2. perform the fsync
 3. update the journal log mark to the position of the last fw request
 4. trigger send-responses for all the requests
 5. go back to read from the queue


This refactoring will also enable further improvements, to optimize how the send responses are prepared, since we have now a list of responses ready to send.
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
### Motivation

Note: this is stacked on top of apache#3830 & apache#3835

This change improves the way the AddRequests responses are send to client. 

The current flow is: 
 * The journal-force-thread issues the fsync on the journal file
 * We iterate over all the entries that were just synced and for each of them:
     1. Trigger channel.writeAndFlus()
     2. This will jump on the connection IO thread (Netty will use a `write()` to `eventfd` to post the task and wake the epoll)
     3. Write the object in the connection and trigger the serialization logic
     4. Grab a `ByteBuf` from the pool and write ~20 bytes with the response
     5. Write and flush the buffer on the channel
     6. With the flush consolidator we try to group multiple buffer into a single `writev()` syscall, though each call will have a long list of buffer, making the memcpy inefficient.
     7. Release all the buffers and return them to the pool

All these steps are quite expensive when the bookie is receiving a lot of small requests. 

This PR changes the flow into: 

1. journal fsync
2. go through each request and prepare the response into a per-connection `ByteBuf` which is not written on the channel as of yet
3. after preparing all the responses, we flush them at once: Trigger an event on all the connections that will write the accumulated buffers.

The advantages are: 
 1. 1 ByteBuf allocated per connection instead of 1 per request
    1. Less allocations and stress of buffer pool
    2. More efficient socket write() operations
 3. 1 task per connection posted on the Netty IO threads, instead of 1 per request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants