perf: change Dataset::sample to sort its random indices#5915
perf: change Dataset::sample to sort its random indices#5915westonpace merged 2 commits intolance-format:mainfrom
Conversation
This promotes sample() from pub(crate) to pub so that a criterion benchmark can be added, and benchmark.
Prior to this commit, Dataset::sample() drew a random sample of indices and executed take on that unsorted sample. This triggers a slower and more memory-intensive branch of take than is necessary to meet sample's needs. Take uses one of three code paths depending on whether the requested indexes are, * contiguous (path 1) * sorted (path 2) * noncontiguous and unsorted (path 3) Take can take advantage of sorted inputs because neighboring indexes are more likely to reside on the same pages. It gets this benefit automatically in paths 1 and 2. In path 3, in order to take advantage of this optimization, take reorders its the input. Since the contract of Take is to return rows in the order requested, it must then reorder its take results according to the original requested ordering. This reordering step requires two copies of the data to be held in memory at the same time. This causes the IVF training stage of IVF-PQ index building to take about twice as much memory as it needs.
|
the visibility change for Dataset::sample is just for the benchmark. I'm happy to remove the benchmark and revert that part. The included benchmark does not measure memory but does measure a 62% speedup (which was not what I was targeting): The behavior of returning sampled results in physical order is consistent with what postgres does with TABLESAMPLE. I have not investigated other systems. The sample() method is not currently public so there is no precedent to break, but if we end up making it public here we should be mindful that we are choosing this direction. Personally I think this behavior is fine/expected for SQL but maybe there may be stakeholders that feel differently. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Should we add |
|
take on unsorted input does do a sort, but since take needs to return rows in the requested order, it then has to resort on the way out and this is where the cost is. So IMO take is doing what it should be doing (it would be worse if it didn't sort) but callers who don't care about the sorting, should still sort before handling indexes to take. |
|
(note that this benchmarks is for Dataset::sample, not for generic take) |
Yes, some take use cases don't care about order and some do. Take doesn't have the context to know. However, at this level we do know. There is a similar improvement available on the read side. When we search the vector index we get back a list of row addresses that we pass to |
|
Huh...looks like we already had this optimization in the python-level sample method: +1 for using this in rust |
|
Maybe I'm missing something, but I'm confused. Conceptually, sampling rows from a dataset shouldn't require any kind of sorting. Can we have a new API like "give me some random rows, in any order you like"? |
|
@cmccabe one of the commits has a better message than the PR description maybe -- This PR effectively implements what you describe. Prior to this commit you asked for random rows and it gave you those rows in random order. Now you ask for random rows and it gives you the rows in the order it chooses (physical stored order). If you want a different ordering you need to ask for it. |
Thanks for the explanation.
I guess one question in my mind is, how important is this optimization (where neighboring indexes are more likely to reside on the same pages) when doing a random sample for indexing? Intuitively I doubt it's very useful when the dataset is large. Another thing to think about for later, maybe? |
The file reader (well decoder) itself requires indices to be sorted. It's not entirely for I/O optimization purposes (it could still figure out which I/Os are close together if operating in an unsorted manner) but mainly to simplify logic and keep the code manageable. |
|
For example, imagine the reader is asking for one million rows and the first row they ask for and the last row they ask for are both on the same disk page. Do you load that page and cache it for the duration of the read? By forcing the indices to be in sorted order you bypass this question and leave it up to the caller. The various take methods that produce batches (e.g. The various take methods that produce streams ( |
This changes Dataset::sample to sort its random indices. Supplying sorted inputs to take results in a 50% reduction in peak memory consumption. This change causes the IVF training stage of IVF-PQ index builds to take approximately half as much memory.