Skip to content

RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.#964

Merged
szetszwo merged 9 commits intoapache:masterfrom
szetszwo:RATIS-589
Nov 16, 2023
Merged

RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.#964
szetszwo merged 9 commits intoapache:masterfrom
szetszwo:RATIS-589

Conversation

@szetszwo
Copy link
Contributor

See RATIS-589

@szetszwo szetszwo requested a review from SzyWilliam November 13, 2023 17:38
Copy link
Member

@SzyWilliam SzyWilliam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo Thanks for the patch! the idea of zero copy is exciting!
However, the current implementation seems to have an issue. While avoiding one internal heap copy operation, we are now facing an additional direct memory to heap copy operation.

final ByteBuffer duplicated = buf.duplicate();
duplicated.position(pos).limit(protoEndPos);
checksum.reset();
checksum.update(duplicated);
Copy link
Member

@SzyWilliam SzyWilliam Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using DirectByteBuffer here, the checksum operation essentially reads all data from direct memory back to heap memory. Even worse, the UsafeOperation will do a lot of validations before each ByteBuffer.get, leading to significant performance issues.

// from DirectByteBuffer.java
public byte get(int i) {
        try {
            return ((SCOPED_MEMORY_ACCESS.getByte(session(), null, ix(checkIndex(i)))));
        } finally {
            Reference.reachabilityFence(this);
        }
    }

Maybe we can found a way to compute checksum directly on LogEntryProto, since it's a heap object?

Copy link
Contributor Author

@szetszwo szetszwo Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks for reviewing this!

Since we are using DirectByteBuffer here, the checksum operation essentially reads all data from direct memory back to heap memory.

You are right that it has to read the direct buffer for CRC computation. Even for byte[], it has to read the array. There seems no ways to avoid reading it for CRC computation.

Even worse, the UsafeOperation will do a lot of validations before each ByteBuffer.get, leading to significant performance issues.

Good point! We may use getLong(..) to reduce the number of calls.

According to the result from https://stackoverflow.com/questions/59008751/direct-java-nio-bytebuffer-vs-java-array-performance-test , the performance for byte[] is better than direct ByteBuffer for memory access. However, we will gain back the performance when including the buffer copying.

@szetszwo
Copy link
Contributor Author

... While avoiding one internal heap copy operation, we are now facing an additional direct memory to heap copy operation.

Yes, this change avoids

  • one copy (i.e. one heap read and one heap write)
  • one more heap read for crc

but adds

  • one direct buffer read for crc

Maybe we can found a way to compute checksum directly on LogEntryProto, since it's a heap object?

This is a good suggestion! Let me see if there is a way to do it.

@szetszwo
Copy link
Contributor Author

... Let me see if there is a way to do it.

One idea (similar to DigestOutputStream) is to create a new class CrcByteBuffer extending ByteBuffer. When ever something is written to it, it will update CRC. Then, it avoid reading from the buffer.

@szetszwo
Copy link
Contributor Author

... create a new class CrcByteBuffer extending ByteBuffer.

Since there are package private method CrcByteBuffer have to be in java.nio. Everything seems to work until getting

  • java.lang.SecurityException: Prohibited package name: java.nio

when running unit tests.

@SzyWilliam
Copy link
Member

We may use getLong(..) to reduce the number of calls.

You are right! getLong() seems good enough to amortize one index validation to 8 bytes.

Since there are package private method CrcByteBuffer have to be in java.nio. Everything seems to work until getting
java.lang.SecurityException: Prohibited package name: java.nio

Thanks for taking time on this exploration! That's really unfortunate, we are almost there. Still, I think getLong is good enough!

@szetszwo szetszwo requested a review from SzyWilliam November 15, 2023 19:18
Copy link
Member

@SzyWilliam SzyWilliam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good to me. Thanks @szetszwo for the big patch! It's exciting to see the zero-copy features!

@szetszwo szetszwo merged commit 2e89dad into apache:master Nov 16, 2023
@szetszwo
Copy link
Contributor Author

@SzyWilliam , thanks a lot for reviewing this and the suggestions!

RexXiong pushed a commit to apache/celeborn that referenced this pull request May 30, 2024
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(apache/ratis#964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(apache/ratis#718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (apache/ratis#718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (apache/ratis#731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (apache/ratis#903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (apache/ratis#728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (apache/ratis#732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (apache/ratis#734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (apache/ratis#741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (apache/ratis#742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (apache/ratis#751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (apache/ratis#632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (apache/ratis#671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (apache/ratis#635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (apache/ratis#897)
      - RATIS-1866. Maintain leader lease after AppendEntries (apache/ratis#898)
      - RATIS-1894. Implement ReadOnly based on leader lease (apache/ratis#925)
      - RATIS-1882. Support read-after-write consistency (apache/ratis#913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (apache/ratis#906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (apache/ratis#930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (apache/ratis#896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (apache/ratis#961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (apache/ratis#950)
      - RATIS-1636. Support re-config ratis properties (apache/ratis#800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (apache/ratis#901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (apache/ratis#931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (apache/ratis#932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (apache/ratis#955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (apache/ratis#956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (apache/ratis#957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (apache/ratis#923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge apache/ratis#922 (apache/ratis#922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (apache/ratis#927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (apache/ratis#933)
         - RATIS-1912. Fix infinity election when perform membership change. (apache/ratis#954)
         - RATIS-1858. Follower keeps logging first election timeout. (apache/ratis#894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](apache/ratis@ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants