HDDS-7228. Avoid excessive memory consumpiton in checksum calculation #5364

Cyrill · 2023-09-26T13:41:33Z

What changes were proposed in this pull request?

Cache byte buffers for checksum calculation to reduce GC pressure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7228

How was this patch tested?

Performance tests.

adoroszlai · 2023-09-26T15:19:17Z

Thanks @Cyrill for the patch. Please fix this checkstyle issue:

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/CRC32ByteBufferImpl.java
 27: Missing a Javadoc comment.

https://github.com/Cyrill/ozone/actions/runs/6313525138/job/17141782688#step:7:11

sodonnel · 2023-09-26T15:19:22Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/CRC32ByteBufferImpl.java

+    if (buffer.hasArray()) {
+      checksum.update(buffer.array(), buffer.position() + buffer.arrayOffset(),
+          buffer.remaining());
+    } else {


As checksum.update() accepts a buffer, why do you need to copy the incoming buffer contents into a new (cached) buffer and then pass that new buffer to the checksum.update(). Could we not pass buffer directly to checksum.update?

Unfortunately not. The code here and in ChecksumByteBufferImpl is merely a copy of checksum.update() and the latter is prone to the same issue - buffer.hasArray() returns false when the buffer is read only (which is our case).

sodonnel · 2023-09-26T16:06:27Z

I recall an earlier conversation we had on Slack, where you pointed out that the reason the current code does not work well, is because it does this:

  public void update(ByteBuffer buffer) {
    if (buffer.hasArray()) {
      checksum.update(buffer.array(), buffer.position() + buffer.arrayOffset(),
          buffer.remaining());
    } else {
      byte[] b = new byte[buffer.remaining()];
      buffer.get(b);
      checksum.update(b, 0, b.length);
    }
  }

buffer.hasArray() always returns false if the buffer is readOnly, and it our case, the buffer is created by a call to writeChunk.getData().asReadOnlyByteBufferList() which is ByteString.asReadOnlyByteBufferList().

Looking inside KeyValueHandler.java where the chunk writes happen on the Datanode and where the checksum validation happens, the code looks like:

      WriteChunkRequestProto writeChunk = request.getWriteChunk();
      BlockID blockID = BlockID.getFromProtobuf(writeChunk.getBlockID());
      ContainerProtos.ChunkInfo chunkInfoProto = writeChunk.getChunkData();
      ChunkInfo chunkInfo = ChunkInfo.getFromProtoBuf(chunkInfoProto);
      Preconditions.checkNotNull(chunkInfo);

      ChunkBuffer data = null;
      if (dispatcherContext == null) {
        dispatcherContext = new DispatcherContext.Builder().build();
      }
      WriteChunkStage stage = dispatcherContext.getStage();
      if (stage == WriteChunkStage.WRITE_DATA ||
          stage == WriteChunkStage.COMBINED) {
        data =
            ChunkBuffer.wrap(writeChunk.getData().asReadOnlyByteBufferList());
        validateChunkChecksumData(data, chunkInfo);

We can see that pull the data from the proto as a byteString and then create the asReadOnlyByteBufferList, which basically copies the proto into a ByteBuffer.

Then inside validateChunkChecksumData, the code looks like:

  private void validateChunkChecksumData(ChunkBuffer data, ChunkInfo info)
      throws StorageContainerException {
    if (validateChunkChecksumData) {
      try {
        Checksum.verifyChecksum(data.toByteString(byteBufferToByteString),
            info.getChecksumData(), 0);
      } catch (OzoneChecksumException ex) {
        throw ChunkUtils.wrapInStorageContainerException(ex);
      }
    }
  }

And then:

  public static boolean verifyChecksum(ByteString byteString,
      ChecksumData checksumData, int startIndex) throws OzoneChecksumException {
    final ByteBuffer buffer = byteString.asReadOnlyByteBuffer();
    return verifyChecksum(buffer, checksumData, startIndex);
}

So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer (edit: this PR actually makes it better as it avoids re-allocating the final byte buffer each time, but it feels like there are other improvements needed in the flow to the code changed here)!

Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day).

sodonnel · 2023-09-26T17:07:04Z

So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer!

Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day).

Ok - so it seems that ByteString.asReadOnlyByteBufferList() simply wraps the internal ByteBuffer inside the ByteString.

Then data.toByteString(byteBufferToByteString) uses this function:

    byteBufferToByteString =
        ByteStringConversion.createByteBufferConversion(isUnsafeByteBufferConversionEnabled);

isUnsafeByteBufferConversionEnabled defaults to true, so this again creates a ByteString which wraps the ByteBuffer and doesn't copy. So it seems like all these copies are not copies at all.

sodonnel · 2023-09-26T17:14:15Z

I think it would be worth benchmarking this code:

      while(buffer.hasRemaining()) {
        checksum.update(buffer.get());
      }

vs

withCachedBuffer(buffer.capacity(), cachedBuffer -> {
        cachedBuffer.put(buffer);
        cachedBuffer.flip();
        checksum.update(cachedBuffer);
      });

As if the former may be is as fast as the latter because:

The checksum has to be updated byte by byte anyway.
It avoids copying the source bytes into a new buffer (cached) and then applying each byte in turn to the checksum.

If both are the same speed, the former avoid any new allocations.

sodonnel · 2023-09-26T20:21:00Z

I benchmarked something similar to the above with some code I added in #1910 but has since been removed.

The byte by byte approach is very slow. In my test it was at about 70 ops per second.

Using the original code, assuming hasArray() returns true, gives about 2000 ops per second.

Then I changed my code to have a second byteBuffer. Copied the data into it and then ran the test and it dropped to about 1850 ops per second. This was with an Indirect buffer.

Finally I tried again with a "Direct" buffer private static ByteBuffer data3 = DirectByteBufferAllocator.INSTANCE.allocate(1024 * 1024); - this was actually slower at 1400 ops per second. I have no idea why the Direct is slower than indirect, and this was run on my laptop. Apple M1 Java 11, so different OS / Java may vary.

@jojochuang Suggested there may be some reflection hacks to get access to the underlying byte array inside the ByteString, which could allow us to avoid copying the data at all.

jojochuang · 2023-09-26T21:47:50Z

It's a really silly hack though... #3759

sodonnel · 2023-09-26T21:55:20Z

@jojochuang You mentioned you still noticed some overhead on a NVME cluster - do you recall what that was? I guess the checksum calculation itself would still have overhead, but this "hack" you mentioned would avoid the memory allocation and an extra copy, so it makes the checksum calculation as fast as the algorithm allows.

I guess the write rate onto NVME may be so high the checksum overhead is significant. In my benchmarks, I was getting about 2000 ops per section on a single thread calculating 1M bytes per checksum on 4MB of data, which is about 8000MB/s per thread.

The hack you suggested may not be perfect, but its a lot better than what we have currently with the extra copy!

sodonnel · 2023-09-27T10:18:01Z

@Cyrill Could you test with the change in #3759 and see how performance looks?

Also, are you seeing this allocation problem on the DN and have you turned on DN side checksum validation (chunk.data.validation.check)?

Cyrill · 2023-09-28T18:50:07Z

So I tested with the chunk.data.validation.check property and without it and I clearly see the effect.

This is the test run without the DN checksum check

This is the test run with the DN checksum check

Total allocated byte[] arrays memory is twice as large (665GB) as when the property is turned off (381GB).

Cyrill · 2023-09-28T18:56:09Z

Also I checked the reflection fix

The results are similar the cached buffer fix, though they look more stable.

Here is what we had with the cached buffers:

Cyrill · 2023-09-28T18:57:42Z

@sodonnel ^^ added test results

sodonnel · 2023-09-28T20:03:53Z

So with either the cached buffers or the reflection fix, the GC is much better. I'd expect the write performance to be better with the reflect fix, as it avoids copying all the data from one buffer to another - did you notice any impact there?

Cyrill · 2023-09-29T07:24:13Z

The observed behaviour is not significantly different when performing writes since the number of concurrent write operations is lower than that in reads.
But still GC is hitting more frequently without the fix:

Now with the reflection fix:

Cyrill · 2023-10-02T08:47:53Z

@sodonnel
As a summary of all test that we ran with and without the fix:

stable +3% total throughput
DN CPU usage during writes decreased by 20-30 %
S3G CPU usage during reads decreased by 30 %
Fewer newgen cleanups
almost got rid of Full GC.

Bottom line, the performance looks way better with the fix.
We can go with the reflection one since it is way simpler than the cached (this) one.
What would be the next steps to get this change in master?

sodonnel · 2023-10-02T11:45:21Z

@Cyrill I've kicked off the build on #3759 - @jojochuang Any reason not to commit your change? I see you mentioned there was still some overhead on NVME, but it makes things a lot better than they are currently, so I think its worth committing and we can see what else can be done in the future.

sodonnel · 2023-10-03T14:46:54Z

There are some test failures in #3759 - see my last couple of comments on that PR.

sodonnel · 2023-10-11T09:08:27Z

#3759 is now merged, so I think we can close this one?

HDDS-7228. Avoid excessive memory consumpiton in checksum calculation

e9103f7

sodonnel reviewed Sep 26, 2023

View reviewed changes

HDDS-7228. Avoid excessive memory consumption in checksum calculation

7acacb8

Cyrill closed this Oct 11, 2023

Cyrill deleted the HDDS-7228 branch October 11, 2023 10:42

HDDS-7228. Avoid excessive memory consumpiton in checksum calculation #5364

HDDS-7228. Avoid excessive memory consumpiton in checksum calculation #5364

Uh oh!

Conversation

Cyrill commented Sep 26, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Sep 26, 2023

Uh oh!

sodonnel Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrill Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel commented Sep 26, 2023

Uh oh!

sodonnel commented Sep 26, 2023

Uh oh!

sodonnel commented Sep 26, 2023

Uh oh!

jojochuang commented Sep 26, 2023

Uh oh!

sodonnel commented Sep 26, 2023

Uh oh!

sodonnel commented Sep 27, 2023

Uh oh!

Cyrill commented Sep 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrill commented Sep 28, 2023

Uh oh!

Cyrill commented Sep 28, 2023

Uh oh!

sodonnel commented Sep 28, 2023

Uh oh!

Cyrill commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrill commented Oct 2, 2023

Uh oh!

sodonnel commented Oct 2, 2023

Uh oh!

sodonnel commented Oct 3, 2023

Uh oh!

sodonnel commented Oct 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sodonnel Sep 26, 2023 •

edited

Loading

sodonnel commented Sep 26, 2023 •

edited

Loading

Cyrill commented Sep 28, 2023 •

edited

Loading

Cyrill commented Sep 29, 2023 •

edited

Loading