perf: use 8KB buffer for local ObjectWriter#5907
perf: use 8KB buffer for local ObjectWriter#5907wkalt wants to merge 1 commit intolance-format:mainfrom
Conversation
We pick the buffer size for object writers according to caller configuration, or by defaulting to 5MB in order to guarantee a multipart write in object storage. For local storage, the 5MB buffer is not applicable and can be wasteful, if many writers are open simultaneously. We encounter that situation during the shuffle stage of an IVF-PQ index build. Change the object writer to use an 8KB buffer when the object store in use is local.
@westonpace I see a promising improvement in the memory usage pattern during an IVF-PQ index build. Related to some stuff you are working on. I do not have a full understanding yet of why the previous code is growing from peak to peak though. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
I'm surprised this has much impact at all as the file writer is going to do its own buffering and shouldn't really be sending small writes to the object writer in the first place. I don't think we can easily get rid of the file writer's default buffering as we want to avoid tiny pages for read performance reasons. However, we could override the file writer's default buffering in the shuffler when we open the file writers because we don't care that much about read performance (since we are only going to read it once to write the final index). |
| is_local: bool, | ||
| ) -> Bytes { | ||
| let new_capacity = if is_local { | ||
| 8 * 1024 // 8 KB for local filesystem |
There was a problem hiding this comment.
Is this going to chop up large writes into tiny 8KiB writes? From a syscall perspective that maybe isn't the best approach. We should probably just send the entire buffer to the OS?
There was a problem hiding this comment.
hmm, actually this is way worse than I thought. It is going to do a simulated multipart write on the local FS.
There was a problem hiding this comment.
I prototyped a specialized local writer here that doesn't do the multipart simulation. I didn't see an improvement in write throughput, so left it aside, but feel free to play around with it. wjones127@7d7e30a
There was a problem hiding this comment.
this works perfectly 👍
|
@westonpace here is what the heap dumps show: main: patch: those allocations in main are all these 5MB buffers in ObjectWriter::new. I have one for each partition (these are during IVF shuffle for a 100M row dataset). I agree reducing the page size to 8k is not good. Do you have any thought on how to best accomplish this?
Change ObjectWriter to support that? or bypass ObjectWriter? I have a chart now with all of my optimizations with and without this change, so I can disentangle it from the other change to remove that buffer. The difference is much more significant and may indicate something that needs to be looked at
There are 4096 partitions here, which is 20GB when multiplied by 5MB. edit: There was some time between those last two thoughts... given the 20GB, this result actually seems reasonable (for the 5MB setting) and the initial allocation/increase is the buffers becoming resident. |
|
superseded by #5939 |


We pick the buffer size for object writers according to caller configuration, or by defaulting to 5MB in order to guarantee a multipart write in object storage. For local storage, the 5MB buffer is not applicable and can be wasteful, if many writers are open simultaneously. We encounter that situation during the shuffle stage of an IVF-PQ index build.
Change the object writer to use an 8KB buffer when the object store in use is local.