Skip to content

Conversation

@Cyrill
Copy link
Contributor

@Cyrill Cyrill commented Sep 26, 2023

What changes were proposed in this pull request?

Cache byte buffers for checksum calculation to reduce GC pressure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7228

How was this patch tested?

Performance tests.

@adoroszlai
Copy link
Contributor

Thanks @Cyrill for the patch. Please fix this checkstyle issue:

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/CRC32ByteBufferImpl.java
 27: Missing a Javadoc comment.

https://github.com/Cyrill/ozone/actions/runs/6313525138/job/17141782688#step:7:11

if (buffer.hasArray()) {
checksum.update(buffer.array(), buffer.position() + buffer.arrayOffset(),
buffer.remaining());
} else {
Copy link
Contributor

@sodonnel sodonnel Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As checksum.update() accepts a buffer, why do you need to copy the incoming buffer contents into a new (cached) buffer and then pass that new buffer to the checksum.update(). Could we not pass buffer directly to checksum.update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The code here and in ChecksumByteBufferImpl is merely a copy of checksum.update() and the latter is prone to the same issue - buffer.hasArray() returns false when the buffer is read only (which is our case).

@sodonnel
Copy link
Contributor

sodonnel commented Sep 26, 2023

I recall an earlier conversation we had on Slack, where you pointed out that the reason the current code does not work well, is because it does this:

  public void update(ByteBuffer buffer) {
    if (buffer.hasArray()) {
      checksum.update(buffer.array(), buffer.position() + buffer.arrayOffset(),
          buffer.remaining());
    } else {
      byte[] b = new byte[buffer.remaining()];
      buffer.get(b);
      checksum.update(b, 0, b.length);
    }
  }

buffer.hasArray() always returns false if the buffer is readOnly, and it our case, the buffer is created by a call to writeChunk.getData().asReadOnlyByteBufferList() which is ByteString.asReadOnlyByteBufferList().

Looking inside KeyValueHandler.java where the chunk writes happen on the Datanode and where the checksum validation happens, the code looks like:

      WriteChunkRequestProto writeChunk = request.getWriteChunk();
      BlockID blockID = BlockID.getFromProtobuf(writeChunk.getBlockID());
      ContainerProtos.ChunkInfo chunkInfoProto = writeChunk.getChunkData();
      ChunkInfo chunkInfo = ChunkInfo.getFromProtoBuf(chunkInfoProto);
      Preconditions.checkNotNull(chunkInfo);

      ChunkBuffer data = null;
      if (dispatcherContext == null) {
        dispatcherContext = new DispatcherContext.Builder().build();
      }
      WriteChunkStage stage = dispatcherContext.getStage();
      if (stage == WriteChunkStage.WRITE_DATA ||
          stage == WriteChunkStage.COMBINED) {
        data =
            ChunkBuffer.wrap(writeChunk.getData().asReadOnlyByteBufferList());
        validateChunkChecksumData(data, chunkInfo);

We can see that pull the data from the proto as a byteString and then create the asReadOnlyByteBufferList, which basically copies the proto into a ByteBuffer.

Then inside validateChunkChecksumData, the code looks like:

  private void validateChunkChecksumData(ChunkBuffer data, ChunkInfo info)
      throws StorageContainerException {
    if (validateChunkChecksumData) {
      try {
        Checksum.verifyChecksum(data.toByteString(byteBufferToByteString),
            info.getChecksumData(), 0);
      } catch (OzoneChecksumException ex) {
        throw ChunkUtils.wrapInStorageContainerException(ex);
      }
    }
  }

And then:

  public static boolean verifyChecksum(ByteString byteString,
      ChecksumData checksumData, int startIndex) throws OzoneChecksumException {
    final ByteBuffer buffer = byteString.asReadOnlyByteBuffer();
    return verifyChecksum(buffer, checksumData, startIndex);
}

So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer (edit: this PR actually makes it better as it avoids re-allocating the final byte buffer each time, but it feels like there are other improvements needed in the flow to the code changed here)!

Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day).

@sodonnel
Copy link
Contributor

So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer!

Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day).

Ok - so it seems that ByteString.asReadOnlyByteBufferList() simply wraps the internal ByteBuffer inside the ByteString.

Then data.toByteString(byteBufferToByteString) uses this function:

    byteBufferToByteString =
        ByteStringConversion.createByteBufferConversion(isUnsafeByteBufferConversionEnabled);

isUnsafeByteBufferConversionEnabled defaults to true, so this again creates a ByteString which wraps the ByteBuffer and doesn't copy. So it seems like all these copies are not copies at all.

@sodonnel
Copy link
Contributor

I think it would be worth benchmarking this code:

      while(buffer.hasRemaining()) {
        checksum.update(buffer.get());
      }

vs

withCachedBuffer(buffer.capacity(), cachedBuffer -> {
        cachedBuffer.put(buffer);
        cachedBuffer.flip();
        checksum.update(cachedBuffer);
      });

As if the former may be is as fast as the latter because:

  1. The checksum has to be updated byte by byte anyway.
  2. It avoids copying the source bytes into a new buffer (cached) and then applying each byte in turn to the checksum.

If both are the same speed, the former avoid any new allocations.

@sodonnel
Copy link
Contributor

I benchmarked something similar to the above with some code I added in #1910 but has since been removed.

The byte by byte approach is very slow. In my test it was at about 70 ops per second.

Using the original code, assuming hasArray() returns true, gives about 2000 ops per second.

Then I changed my code to have a second byteBuffer. Copied the data into it and then ran the test and it dropped to about 1850 ops per second. This was with an Indirect buffer.

Finally I tried again with a "Direct" buffer private static ByteBuffer data3 = DirectByteBufferAllocator.INSTANCE.allocate(1024 * 1024); - this was actually slower at 1400 ops per second. I have no idea why the Direct is slower than indirect, and this was run on my laptop. Apple M1 Java 11, so different OS / Java may vary.

@jojochuang Suggested there may be some reflection hacks to get access to the underlying byte array inside the ByteString, which could allow us to avoid copying the data at all.

@jojochuang
Copy link
Contributor

It's a really silly hack though... #3759

@sodonnel
Copy link
Contributor

@jojochuang You mentioned you still noticed some overhead on a NVME cluster - do you recall what that was? I guess the checksum calculation itself would still have overhead, but this "hack" you mentioned would avoid the memory allocation and an extra copy, so it makes the checksum calculation as fast as the algorithm allows.

I guess the write rate onto NVME may be so high the checksum overhead is significant. In my benchmarks, I was getting about 2000 ops per section on a single thread calculating 1M bytes per checksum on 4MB of data, which is about 8000MB/s per thread.

The hack you suggested may not be perfect, but its a lot better than what we have currently with the extra copy!

@sodonnel
Copy link
Contributor

@Cyrill Could you test with the change in #3759 and see how performance looks?

Also, are you seeing this allocation problem on the DN and have you turned on DN side checksum validation (chunk.data.validation.check)?

@Cyrill
Copy link
Contributor Author

Cyrill commented Sep 28, 2023

So I tested with the chunk.data.validation.check property and without it and I clearly see the effect.

This is the test run without the DN checksum check
noprop

This is the test run with the DN checksum check
with_prop

Total allocated byte[] arrays memory is twice as large (665GB) as when the property is turned off (381GB).

@Cyrill
Copy link
Contributor Author

Cyrill commented Sep 28, 2023

Also I checked the reflection fix
reflection_fix_
The results are similar the cached buffer fix, though they look more stable.

Here is what we had with the cached buffers:
after

@Cyrill
Copy link
Contributor Author

Cyrill commented Sep 28, 2023

@sodonnel ^^ added test results

@sodonnel
Copy link
Contributor

So with either the cached buffers or the reflection fix, the GC is much better. I'd expect the write performance to be better with the reflect fix, as it avoids copying all the data from one buffer to another - did you notice any impact there?

@Cyrill
Copy link
Contributor Author

Cyrill commented Sep 29, 2023

The observed behaviour is not significantly different when performing writes since the number of concurrent write operations is lower than that in reads.
But still GC is hitting more frequently without the fix:
no_fix

Now with the reflection fix:
with_fix

@Cyrill
Copy link
Contributor Author

Cyrill commented Oct 2, 2023

@sodonnel
As a summary of all test that we ran with and without the fix:

  • stable +3% total throughput
  • DN CPU usage during writes decreased by 20-30 %
  • S3G CPU usage during reads decreased by 30 %
  • Fewer newgen cleanups
  • almost got rid of Full GC.

Bottom line, the performance looks way better with the fix.
We can go with the reflection one since it is way simpler than the cached (this) one.
What would be the next steps to get this change in master?

@sodonnel
Copy link
Contributor

sodonnel commented Oct 2, 2023

@Cyrill I've kicked off the build on #3759 - @jojochuang Any reason not to commit your change? I see you mentioned there was still some overhead on NVME, but it makes things a lot better than they are currently, so I think its worth committing and we can see what else can be done in the future.

@sodonnel
Copy link
Contributor

sodonnel commented Oct 3, 2023

There are some test failures in #3759 - see my last couple of comments on that PR.

@sodonnel
Copy link
Contributor

#3759 is now merged, so I think we can close this one?

@Cyrill Cyrill closed this Oct 11, 2023
@Cyrill Cyrill deleted the HDDS-7228 branch October 11, 2023 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants