-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7228. Avoid excessive memory consumpiton in checksum calculation #5364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @Cyrill for the patch. Please fix this checkstyle issue: https://github.com/Cyrill/ozone/actions/runs/6313525138/job/17141782688#step:7:11 |
| if (buffer.hasArray()) { | ||
| checksum.update(buffer.array(), buffer.position() + buffer.arrayOffset(), | ||
| buffer.remaining()); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As checksum.update() accepts a buffer, why do you need to copy the incoming buffer contents into a new (cached) buffer and then pass that new buffer to the checksum.update(). Could we not pass buffer directly to checksum.update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. The code here and in ChecksumByteBufferImpl is merely a copy of checksum.update() and the latter is prone to the same issue - buffer.hasArray() returns false when the buffer is read only (which is our case).
|
I recall an earlier conversation we had on Slack, where you pointed out that the reason the current code does not work well, is because it does this:
Looking inside We can see that pull the data from the proto as a byteString and then create the asReadOnlyByteBufferList, which basically copies the proto into a ByteBuffer. Then inside And then: So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer (edit: this PR actually makes it better as it avoids re-allocating the final byte buffer each time, but it feels like there are other improvements needed in the flow to the code changed here)! Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day). |
|
So we start with a byteString in the protobuf. Then we convert that into a ByteBuffer, back to a byteString and back to a ByteBuffer and then in this PR a final copy from ByteBuffer to ByteBuffer! Were you seeing excessive memory usage on the DN side when writing chunks and with chunk.data.validation.check=true (the default is false, which is frankly crazy and will result in dataloss some day). Ok - so it seems that Then
|
|
I think it would be worth benchmarking this code: vs As if the former may be is as fast as the latter because:
If both are the same speed, the former avoid any new allocations. |
|
I benchmarked something similar to the above with some code I added in #1910 but has since been removed. The byte by byte approach is very slow. In my test it was at about 70 ops per second. Using the original code, assuming hasArray() returns true, gives about 2000 ops per second. Then I changed my code to have a second byteBuffer. Copied the data into it and then ran the test and it dropped to about 1850 ops per second. This was with an Indirect buffer. Finally I tried again with a "Direct" buffer @jojochuang Suggested there may be some reflection hacks to get access to the underlying byte array inside the ByteString, which could allow us to avoid copying the data at all. |
|
It's a really silly hack though... #3759 |
|
@jojochuang You mentioned you still noticed some overhead on a NVME cluster - do you recall what that was? I guess the checksum calculation itself would still have overhead, but this "hack" you mentioned would avoid the memory allocation and an extra copy, so it makes the checksum calculation as fast as the algorithm allows. I guess the write rate onto NVME may be so high the checksum overhead is significant. In my benchmarks, I was getting about 2000 ops per section on a single thread calculating 1M bytes per checksum on 4MB of data, which is about 8000MB/s per thread. The hack you suggested may not be perfect, but its a lot better than what we have currently with the extra copy! |
|
@sodonnel ^^ added test results |
|
So with either the cached buffers or the reflection fix, the GC is much better. I'd expect the write performance to be better with the reflect fix, as it avoids copying all the data from one buffer to another - did you notice any impact there? |
|
@sodonnel
Bottom line, the performance looks way better with the fix. |
|
@Cyrill I've kicked off the build on #3759 - @jojochuang Any reason not to commit your change? I see you mentioned there was still some overhead on NVME, but it makes things a lot better than they are currently, so I think its worth committing and we can see what else can be done in the future. |
|
There are some test failures in #3759 - see my last couple of comments on that PR. |
|
#3759 is now merged, so I think we can close this one? |






What changes were proposed in this pull request?
Cache byte buffers for checksum calculation to reduce GC pressure.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7228
How was this patch tested?
Performance tests.