feat: spill page metadata to disk during IVF shuffle by wkalt · Pull Request #5921 · lance-format/lance

wkalt · 2026-02-09T16:35:26Z

During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle.

This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.

During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle. This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.

wkalt · 2026-02-09T16:37:25Z

I have some concerns about excessive file descriptors for large numbers of partitions. I have the same concerns about the existing FileWriters though, so I figured it was probably something we could solve holistically once this is in place. This will 2x the number of file descriptors required for the build.

wkalt · 2026-02-09T16:39:31Z

here is a comparison with the other open patch #5912

codecov · 2026-02-09T17:14:09Z

Codecov Report

❌ Patch coverage is 83.54430% with 26 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-file/src/writer.rs	83.66%	15 Missing and 10 partials ⚠️
rust/lance-index/src/vector/v3/shuffler.rs	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace

This is kind of fun and clever. It's a bit of complexity for the file writer but I don't foresee it needing much maintenance so I'm for it!

westonpace · 2026-02-09T17:42:51Z

+    path: Path,
+    position: u64,
+    column_buffers: Vec<Vec<u8>>,
+    column_chunks: Vec<Vec<(u64, u32)>>,


Can you document a little what these fields are holding?

westonpace · 2026-02-09T17:43:28Z

+// to the spill file. Divided evenly across columns (with a floor of 64 bytes).
+const DEFAULT_SPILL_BUFFER_LIMIT: usize = 256 * 1024;
+
+struct PageMetadataSpill {


Could you document the structure of the spill file at a high level? It looks like a series of column chunks where each chunk is a series of page messages for a single column?

updated the comments, thanks

westonpace · 2026-02-09T18:16:49Z

If we restricted this to local files we could actually spill the metadata into the file itself. This would mean the file would ultimately have junk left around in it but for the narrow case of IVF shuffling this isn't a big deal since the file itself is temporary. This would remove the need for a second file handle.

That being said, due to the local-only restriction, I think I still prefer the current approach.

wkalt · 2026-02-10T14:21:20Z

If we restricted this to local files we could actually spill the metadata into the file itself. This would mean the file would ultimately have junk left around in it but for the narrow case of IVF shuffling this isn't a big deal since the file itself is temporary. This would remove the need for a second file handle.

That being said, due to the local-only restriction, I think I still prefer the current approach.

@westonpace something like this did cross my mind. I think the local-only concern is that you would need to have this column metadata in order to finalize the write to the temporary data file? That would be possible on local disk but not on object storage (reading from a file you're still writing to).

I just wanna make sure the concern isn't random IO, because this does do random IO on the metadata file. I figured that would probably suck if we ever spilled to remote storage, but that it would probably be small in the scheme of a large index build.

wkalt · 2026-02-10T14:54:31Z

thinking through ^ a bit more, I think the current design probably would present some issues for remotely-spilled files. It would work but I think you'd want to do some optimizations.

In situations where there is a lot of spilled data, like wide tables with high rowcounts, you could conceivably end up getting 10s or even 100s of GB on local disk even and reading it randomly would really suck (but it would work). Remote storage would be horrible. So we may end up needing some smarter reading than this when we require those scales.

I think that is probably a problem we can solve though, and I think we might need to do some other reworking for the file-handle concern anyway once we get to the really large partition counts, so maybe this will naturally evolve (and since it's just transient index build structures we should be pretty free to change things).

Let me know if that seems worth looking into before merging this. I will do some larger scale validation of this strategy later in the week.

westonpace · 2026-02-10T15:17:00Z

My concern with remote files is that you can't open them for reading while a write is in progress. So you couldnt go back and gather the metadata.

Still, I wouldnt worry about it. I had once brainstormed a single file solution to this problem actually.

The file writer could easily support a mode where it writes one array at a time (instead of one batch at a time). Then we just make a single file with num_partitons columns and disable column caches (set min page size to 0 so each write writes a page). Each write will write one array to one column. Then we close the file, reopen it, and read it out one column at a time.

github-actions Bot added the enhancement New feature or request label Feb 9, 2026

westonpace approved these changes Feb 9, 2026

View reviewed changes

wkalt added 2 commits February 10, 2026 06:21

improve commenting

81dacd3

format

73defa7

westonpace merged commit 94b0308 into lance-format:main Feb 10, 2026
30 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: spill page metadata to disk during IVF shuffle#5921

feat: spill page metadata to disk during IVF shuffle#5921
westonpace merged 3 commits intolance-format:mainfrom
wkalt:task/spill-page-metadata

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

codecov Bot commented Feb 9, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace Feb 9, 2026

Uh oh!

westonpace Feb 9, 2026

Uh oh!

wkalt Feb 10, 2026

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

westonpace commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

codecov Bot commented Feb 9, 2026

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

wkalt Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

westonpace commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants