Adding buffer pools for stream decompression#69
Merged
Viq111 merged 3 commits intoDataDog:1.xfrom Sep 25, 2019
Merged
Conversation
Digging into pprof, there was a lot of time spent in allocations and in GC. The allocations seemed to be mostly focused on `NewReaderDict`. In that method, two []byte slices were allocated. The size of those slices is based on compile-time decisions in the zstd C library. In an attempt to improve that situation, two changes were made:
* The sizes of those buffers are read, casted, and checked once
* The buffers come from a pair of `sync.Pool`s in the `NewReaderDict` factory method, returned to the pool in `reader.Close`.
Using the following test:
```
$ PAYLOAD=zstd_decompress.c go test \
-bench BenchmarkStreamDecompression \
-benchtime=5s \
-benchmem
```
The number of bytes allocated per decompression dropped to 1/705 of previous:
```
pkg: github.com/TriggerMail/zstd
BenchmarkStreamDecompression 100000 121617 ns/op 595.48 MB/s 384 B/op 10 allocs/op
pkg: github.com/DataDog/zstd
BenchmarkStreamDecompression 50000 156142 ns/op 463.81 MB/s 270640 B/op 10 allocs/op
```
Note that this improvement is only in the Go-managed heap memory, not additional heap allocations in creating the zstd context. Because those are freed explicitly in `reader.Close`, the concern is heap fragmentation.
Viq111
approved these changes
Sep 25, 2019
Collaborator
Viq111
left a comment
There was a problem hiding this comment.
Great change!
Indeed I ran the benchmark locally as well (fyi, you can use benchstat to compare easily before/after results) and looks promising both for small and large payloads:
Small: zstd_decompress.c:
name old time/op new time/op delta
StreamDecompression-4 198µs ± 9% 120µs ± 3% -39.76% (p=0.000 n=10+10)
name old speed new speed delta
StreamDecompression-4 366MB/s ±10% 606MB/s ± 3% +65.50% (p=0.000 n=10+10)Big (mr):
name old time/op new time/op delta
StreamDecompression-4 24.4ms ± 3% 23.0ms ± 2% -5.77% (p=0.000 n=9+9)
name old speed new speed delta
StreamDecompression-4 409MB/s ± 3% 434MB/s ± 2% +6.11% (p=0.000 n=9+9)
Collaborator
|
Actually since we now use a sync mechanism on the decompression path, do you mind added a parallel test to your PR ? Something like: func TestStreamCompressionDecompressionParallel(t *testing.T) {
for i := 0; i < 1000; i++ {
t.Run("", func(t2 *testing.T) {
t2.Parallel()
TestStreamCompressionDecompression(t2)
})
}
}should be sufficient |
Contributor
Author
|
Test added. Ran successfully with |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Digging into pprof, there was a lot of time spent in allocations and in GC. The allocations seemed to be mostly focused on
NewReaderDict. In that method, two []byte slices were allocated. The size of those slices is based on compile-time decisions in the zstd C library. In an attempt to improve that situation, two changes were made:sync.Pools in theNewReaderDictfactory method, returned to the pool inreader.Close.Using the following test:
The number of bytes allocated per decompression dropped to 1/705 of previous:
Note that this improvement is only in the Go-managed heap memory, not additional heap allocations in creating the zstd context.