Skip to content

perf: use 8KB buffer for local ObjectWriter#5907

Closed
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:task/local-writer-smaller-buffer
Closed

perf: use 8KB buffer for local ObjectWriter#5907
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:task/local-writer-smaller-buffer

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented Feb 7, 2026

We pick the buffer size for object writers according to caller configuration, or by defaulting to 5MB in order to guarantee a multipart write in object storage. For local storage, the 5MB buffer is not applicable and can be wasteful, if many writers are open simultaneously. We encounter that situation during the shuffle stage of an IVF-PQ index build.

Change the object writer to use an 8KB buffer when the object store in use is local.

We pick the buffer size for object writers according to caller
configuration, or by defaulting to 5MB in order to guarantee a multipart
write in object storage. For local storage, the 5MB buffer is not
applicable and can be wasteful, if many writers are open simultaneously.
We encounter that situation during the shuffle stage of an IVF-PQ index
build.

Change the object writer to use an 8KB buffer when the object store in
use is local.
@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 7, 2026

progress

@westonpace I see a promising improvement in the memory usage pattern during an IVF-PQ index build. Related to some stuff you are working on.

I do not have a full understanding yet of why the previous code is growing from peak to peak though.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@westonpace
Copy link
Copy Markdown
Member

I'm surprised this has much impact at all as the file writer is going to do its own buffering and shouldn't really be sending small writes to the object writer in the first place. I don't think we can easily get rid of the file writer's default buffering as we want to avoid tiny pages for read performance reasons.

However, we could override the file writer's default buffering in the shuffler when we open the file writers because we don't care that much about read performance (since we are only going to read it once to write the final index).

is_local: bool,
) -> Bytes {
let new_capacity = if is_local {
8 * 1024 // 8 KB for local filesystem
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to chop up large writes into tiny 8KiB writes? From a syscall perspective that maybe isn't the best approach. We should probably just send the entire buffer to the OS?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually this is way worse than I thought. It is going to do a simulated multipart write on the local FS.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prototyped a specialized local writer here that doesn't do the multipart simulation. I didn't see an improvement in write throughput, so left it aside, but feel free to play around with it. wjones127@7d7e30a

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works perfectly 👍

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 10, 2026

@westonpace here is what the heap dumps show:

main:

      flat  flat%        cum   cum%  function
20470.93MB  71.5% 20471.43MB  71.5%  lance_io::object_writer::ObjectWriter::new::{{closure}}
 4263.99MB  14.9%  4264.99MB  14.9%  alloc::boxed::Box<T>::new
 2131.86MB   7.4%  6846.11MB  23.9%  <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<...
  811.36MB   2.8%   811.36MB   2.8%  irallocx_prof
  384.00MB   1.3%   384.00MB   1.3%  <lance_file::io::LanceEncodingsIo as lance_encoding::EncodingsIo>::submit_req...
  272.00MB   1.0%   272.00MB   1.0%  bytes::bytes_mut::BytesMut::with_capacity
  124.00MB   0.4%   124.00MB   0.4%  mallocx
   70.00MB   0.2%    70.00MB   0.2%  prost::message::Message::encode_to_vec
   19.50MB   0.1%   286.31MB   1.0%  lance_file::writer::FileWriter::write_page::{{closure}}
   12.50MB   0.0%    12.50MB   0.0%  alloc::sync::Arc<[T],A>::allocate_for_slice_in::{{closure}}
   10.00MB   0.0%    10.00MB   0.0%  lance_io::object_writer::ObjectWriter::next_part_buffer
    9.50MB   0.0% 26813.49MB  93.7%  <lance_index::vector::v3::shuffler::IvfShuffler as lance_index::vector::v3::s...
    9.00MB   0.0%     9.50MB   0.0%  alloc::boxed::Box<T,A>::try_new_uninit_in
    7.52MB   0.0%     7.52MB   0.0%  lance_encoding::data::encode_flat_data
    7.50MB   0.0%    15.00MB   0.1%  prost::encoding::message::merge_repeated
    4.34MB   0.0%     4.34MB   0.0%  <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
    4.00MB   0.0%     4.00MB   0.0%  prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::...
    2.50MB   0.0%     2.50MB   0.0%  prost::encoding::uint64::merge_repeated::{{closure}}
    1.50MB   0.0%     1.50MB   0.0%  hashbrown::raw::alloc::inner::do_alloc
    1.02MB   0.0%     1.02MB   0.0%  lance_table::utils::stream::apply_row_id_and_deletes

patch:

      flat  flat%        cum   cum%  function
 4222.48MB  52.3%  4225.48MB  52.3%  alloc::boxed::Box<T>::new
 2020.79MB  25.0%  6702.79MB  83.0%  <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<...
  828.50MB  10.3%   828.50MB  10.3%  irallocx_prof
  384.00MB   4.8%   384.00MB   4.8%  <lance_file::io::LanceEncodingsIo as lance_encoding::EncodingsIo>::submit_req...
  272.00MB   3.4%   272.00MB   3.4%  bytes::bytes_mut::BytesMut::with_capacity
  149.50MB   1.9%   149.50MB   1.9%  mallocx
   92.01MB   1.1%    92.01MB   1.1%  prost::message::Message::encode_to_vec
   30.74MB   0.4%    30.74MB   0.4%  lance_io::object_writer::ObjectWriter::next_part_buffer
   20.50MB   0.3%   345.69MB   4.3%  lance_file::writer::FileWriter::write_page::{{closure}}
   13.50MB   0.2%    17.01MB   0.2%  prost::encoding::message::merge_repeated
   12.50MB   0.2%    12.50MB   0.2%  alloc::sync::Arc<[T],A>::allocate_for_slice_in::{{closure}}
    9.03MB   0.1%     9.03MB   0.1%  lance_encoding::data::encode_flat_data
    8.50MB   0.1%    10.50MB   0.1%  alloc::boxed::Box<T,A>::try_new_uninit_in
    2.50MB   0.0%     2.50MB   0.0%  calloc
    2.00MB   0.0%     2.00MB   0.0%  prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::...
    1.60MB   0.0%     1.60MB   0.0%  lance_table::utils::stream::apply_row_id_and_deletes
    1.50MB   0.0%     1.50MB   0.0%  hashbrown::raw::alloc::inner::do_alloc
    1.00MB   0.0%     1.00MB   0.0%  prost::encoding::uint64::merge_repeated::{{closure}}
    1.00MB   0.0%     1.00MB   0.0%  <T as alloc::slice::<impl [T]>::to_vec_in::ConvertVec>::to_vec
    0.75MB   0.0%     0.75MB   0.0%  lance::io::exec::filtered_read::FilteredReadStream::plan_scan::{{closure}}::{...
[wyatt@desktop lance](task/index-build-progress-reporting) $

those allocations in main are all these 5MB buffers in ObjectWriter::new. I have one for each partition (these are during IVF shuffle for a 100M row dataset).

I agree reducing the page size to 8k is not good. Do you have any thought on how to best accomplish this?

We should probably just send the entire buffer to the OS?

Change ObjectWriter to support that? or bypass ObjectWriter?

I have a chart now with all of my optimizations with and without this change, so I can disentangle it from the other change to remove that buffer. The difference is much more significant and may indicate something that needs to be looked at

progress

There are 4096 partitions here, which is 20GB when multiplied by 5MB.

edit: There was some time between those last two thoughts... given the 20GB, this result actually seems reasonable (for the 5MB setting) and the initial allocation/increase is the buffers becoming resident.

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 11, 2026

superseded by #5939

@wkalt wkalt closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants