Skip to content

feat: spill page metadata to disk during IVF shuffle#5921

Merged
westonpace merged 3 commits intolance-format:mainfrom
wkalt:task/spill-page-metadata
Feb 10, 2026
Merged

feat: spill page metadata to disk during IVF shuffle#5921
westonpace merged 3 commits intolance-format:mainfrom
wkalt:task/spill-page-metadata

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented Feb 9, 2026

During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle.

This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.

During IVF shuffle, we have a FileWriter per partition and each
accumulates page metadata in memory over the course of the shuffle. With
large datasets and large numbers of partitions, this memory grows over
time to dominate the memory cost of IVF shuffle.

This patch adds optional functionality to the FileWriter that serializes
page metadata to a spill file and enables it by default in the IVF
shuffler.
@github-actions github-actions Bot added the enhancement New feature or request label Feb 9, 2026
@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

I have some concerns about excessive file descriptors for large numbers of partitions. I have the same concerns about the existing FileWriters though, so I figured it was probably something we could solve holistically once this is in place. This will 2x the number of file descriptors required for the build.

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

progress

here is a comparison with the other open patch #5912

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 83.54430% with 26 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-file/src/writer.rs 83.66% 15 Missing and 10 partials ⚠️
rust/lance-index/src/vector/v3/shuffler.rs 80.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of fun and clever. It's a bit of complexity for the file writer but I don't foresee it needing much maintenance so I'm for it!

path: Path,
position: u64,
column_buffers: Vec<Vec<u8>>,
column_chunks: Vec<Vec<(u64, u32)>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document a little what these fields are holding?

// to the spill file. Divided evenly across columns (with a floor of 64 bytes).
const DEFAULT_SPILL_BUFFER_LIMIT: usize = 256 * 1024;

struct PageMetadataSpill {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the structure of the spill file at a high level? It looks like a series of column chunks where each chunk is a series of page messages for a single column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments, thanks

@westonpace
Copy link
Copy Markdown
Member

If we restricted this to local files we could actually spill the metadata into the file itself. This would mean the file would ultimately have junk left around in it but for the narrow case of IVF shuffling this isn't a big deal since the file itself is temporary. This would remove the need for a second file handle.

That being said, due to the local-only restriction, I think I still prefer the current approach.

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 10, 2026

If we restricted this to local files we could actually spill the metadata into the file itself. This would mean the file would ultimately have junk left around in it but for the narrow case of IVF shuffling this isn't a big deal since the file itself is temporary. This would remove the need for a second file handle.

That being said, due to the local-only restriction, I think I still prefer the current approach.

@westonpace something like this did cross my mind. I think the local-only concern is that you would need to have this column metadata in order to finalize the write to the temporary data file? That would be possible on local disk but not on object storage (reading from a file you're still writing to).

I just wanna make sure the concern isn't random IO, because this does do random IO on the metadata file. I figured that would probably suck if we ever spilled to remote storage, but that it would probably be small in the scheme of a large index build.

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 10, 2026

thinking through ^ a bit more, I think the current design probably would present some issues for remotely-spilled files. It would work but I think you'd want to do some optimizations.

In situations where there is a lot of spilled data, like wide tables with high rowcounts, you could conceivably end up getting 10s or even 100s of GB on local disk even and reading it randomly would really suck (but it would work). Remote storage would be horrible. So we may end up needing some smarter reading than this when we require those scales.

I think that is probably a problem we can solve though, and I think we might need to do some other reworking for the file-handle concern anyway once we get to the really large partition counts, so maybe this will naturally evolve (and since it's just transient index build structures we should be pretty free to change things).

Let me know if that seems worth looking into before merging this. I will do some larger scale validation of this strategy later in the week.

@westonpace
Copy link
Copy Markdown
Member

My concern with remote files is that you can't open them for reading while a write is in progress. So you couldnt go back and gather the metadata.

Still, I wouldnt worry about it. I had once brainstormed a single file solution to this problem actually.

The file writer could easily support a mode where it writes one array at a time (instead of one batch at a time). Then we just make a single file with num_partitons columns and disable column caches (set min page size to 0 so each write writes a page). Each write will write one array to one column. Then we close the file, reopen it, and read it out one column at a time.

@westonpace westonpace merged commit 94b0308 into lance-format:main Feb 10, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants