feat: spill page metadata to disk during IVF shuffle#5921
feat: spill page metadata to disk during IVF shuffle#5921westonpace merged 3 commits intolance-format:mainfrom
Conversation
During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle. This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.
|
I have some concerns about excessive file descriptors for large numbers of partitions. I have the same concerns about the existing FileWriters though, so I figured it was probably something we could solve holistically once this is in place. This will 2x the number of file descriptors required for the build. |
here is a comparison with the other open patch #5912 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
This is kind of fun and clever. It's a bit of complexity for the file writer but I don't foresee it needing much maintenance so I'm for it!
| path: Path, | ||
| position: u64, | ||
| column_buffers: Vec<Vec<u8>>, | ||
| column_chunks: Vec<Vec<(u64, u32)>>, |
There was a problem hiding this comment.
Can you document a little what these fields are holding?
| // to the spill file. Divided evenly across columns (with a floor of 64 bytes). | ||
| const DEFAULT_SPILL_BUFFER_LIMIT: usize = 256 * 1024; | ||
|
|
||
| struct PageMetadataSpill { |
There was a problem hiding this comment.
Could you document the structure of the spill file at a high level? It looks like a series of column chunks where each chunk is a series of page messages for a single column?
There was a problem hiding this comment.
updated the comments, thanks
|
If we restricted this to local files we could actually spill the metadata into the file itself. This would mean the file would ultimately have junk left around in it but for the narrow case of IVF shuffling this isn't a big deal since the file itself is temporary. This would remove the need for a second file handle. That being said, due to the local-only restriction, I think I still prefer the current approach. |
@westonpace something like this did cross my mind. I think the local-only concern is that you would need to have this column metadata in order to finalize the write to the temporary data file? That would be possible on local disk but not on object storage (reading from a file you're still writing to). I just wanna make sure the concern isn't random IO, because this does do random IO on the metadata file. I figured that would probably suck if we ever spilled to remote storage, but that it would probably be small in the scheme of a large index build. |
|
thinking through ^ a bit more, I think the current design probably would present some issues for remotely-spilled files. It would work but I think you'd want to do some optimizations. In situations where there is a lot of spilled data, like wide tables with high rowcounts, you could conceivably end up getting 10s or even 100s of GB on local disk even and reading it randomly would really suck (but it would work). Remote storage would be horrible. So we may end up needing some smarter reading than this when we require those scales. I think that is probably a problem we can solve though, and I think we might need to do some other reworking for the file-handle concern anyway once we get to the really large partition counts, so maybe this will naturally evolve (and since it's just transient index build structures we should be pretty free to change things). Let me know if that seems worth looking into before merging this. I will do some larger scale validation of this strategy later in the week. |
|
My concern with remote files is that you can't open them for reading while a write is in progress. So you couldnt go back and gather the metadata. Still, I wouldnt worry about it. I had once brainstormed a single file solution to this problem actually. The file writer could easily support a mode where it writes one array at a time (instead of one batch at a time). Then we just make a single file with |

During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle.
This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.