Skip to content

perf: change Dataset::sample to sort its random indices#5915

Merged
westonpace merged 2 commits intolance-format:mainfrom
wkalt:task/sample-sort-ids
Feb 13, 2026
Merged

perf: change Dataset::sample to sort its random indices#5915
westonpace merged 2 commits intolance-format:mainfrom
wkalt:task/sample-sort-ids

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented Feb 9, 2026

This changes Dataset::sample to sort its random indices. Supplying sorted inputs to take results in a 50% reduction in peak memory consumption. This change causes the IVF training stage of IVF-PQ index builds to take approximately half as much memory.

wkalt added 2 commits February 8, 2026 16:40
This promotes sample() from pub(crate) to pub so that a criterion
benchmark can be added, and benchmark.
Prior to this commit, Dataset::sample() drew a random sample of indices
and executed take on that unsorted sample. This triggers a slower and
more memory-intensive branch of take than is necessary to meet sample's
needs.

Take uses one of three code paths depending on whether the requested
indexes are,
* contiguous (path 1)
* sorted (path 2)
* noncontiguous and unsorted (path 3)

Take can take advantage of sorted inputs because neighboring indexes are
more likely to reside on the same pages.  It gets this benefit
automatically in paths 1 and 2.

In path 3, in order to take advantage of this optimization, take
reorders its the input. Since the contract of Take is to return rows in
the order requested, it must then reorder its take results according to
the original requested ordering. This reordering step requires two
copies of the data to be held in memory at the same time.

This causes the IVF training stage of IVF-PQ index building to take
about twice as much memory as it needs.
@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

the visibility change for Dataset::sample is just for the benchmark. I'm happy to remove the benchmark and revert that part. The included benchmark does not measure memory but does measure a 62% speedup (which was not what I was targeting):

[wyatt@desktop lance](task/sample-sort-ids) $ cargo bench --bench take -p lance -- 'sample' --baseline before
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
   Compiling lance v3.0.0-beta.2 (/mnt/work/home/wyatt/work/lance/rust/lance)
    Finished `bench` profile [optimized + debuginfo] target(s) in 2m 13s
     Running benches/take.rs (target/release/deps/take-a05c111b1d24c007)
Benchmarking sample(1024 of 102400 rows): Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 18.8s, or reduce sample count to 2650.
sample(1024 of 102400 rows)
                        time:   [1.9270 ms 1.9288 ms 1.9306 ms]
                        change: [-7.6751% -7.5481% -7.4247%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 417 outliers among 10000 measurements (4.17%)
  1 (0.01%) low mild
  262 (2.62%) high mild
  154 (1.54%) high severe

Benchmarking sample(8192 of 102400 rows): Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 41.5s, or reduce sample count to 1200.
sample(8192 of 102400 rows)
                        time:   [4.1826 ms 4.1884 ms 4.1942 ms]
                        change: [-62.015% -61.959% -61.907%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 492 outliers among 10000 measurements (4.92%)
  347 (3.47%) high mild
  145 (1.45%) high severe

The behavior of returning sampled results in physical order is consistent with what postgres does with TABLESAMPLE. I have not investigated other systems. The sample() method is not currently public so there is no precedent to break, but if we end up making it public here we should be mindful that we are choosing this direction. Personally I think this behavior is fine/expected for SQL but maybe there may be stakeholders that feel differently.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@BubbleCal
Copy link
Copy Markdown
Contributor

BubbleCal commented Feb 9, 2026

Should we add sort into take? cc @westonpace @Xuanwo
I thought we have something similar in take but this PR actually improves the performance

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

take on unsorted input does do a sort, but since take needs to return rows in the requested order, it then has to resort on the way out and this is where the cost is. So IMO take is doing what it should be doing (it would be worse if it didn't sort) but callers who don't care about the sorting, should still sort before handling indexes to take.

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

(note that this benchmarks is for Dataset::sample, not for generic take)

@westonpace
Copy link
Copy Markdown
Member

take on unsorted input does do a sort, but since take needs to return rows in the requested order, it then has to resort on the way out and this is where the cost is. So IMO take is doing what it should be doing (it would be worse if it didn't sort) but callers who don't care about the sorting, should still sort before handling indexes to take.

Yes, some take use cases don't care about order and some do. Take doesn't have the context to know. However, at this level we do know.

There is a similar improvement available on the read side. When we search the vector index we get back a list of row addresses that we pass to TakeExec. The TakeExec then assumes the order is important but I think we always sort by distance after the fact and so the order is not important. As a result we do an extra sort in the take which we could potentially get rid of.

@westonpace
Copy link
Copy Markdown
Member

Huh...looks like we already had this optimization in the python-level sample method:

        total_num_rows = self.count_rows()
        indices = random.sample(range(total_num_rows), num_rows)
        if not randomize_order:
            # Sort the indices in order to increase the locality and thus reduce
            # the number of random reads.
            indices = sorted(indices)
        return self.take(indices, columns, **kwargs)

+1 for using this in rust

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Feb 9, 2026

Maybe I'm missing something, but I'm confused. Conceptually, sampling rows from a dataset shouldn't require any kind of sorting. Can we have a new API like "give me some random rows, in any order you like"?

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Feb 9, 2026

@cmccabe one of the commits has a better message than the PR description maybe --

Prior to this commit, Dataset::sample() drew a random sample of indices
and executed take on that unsorted sample. This triggers a slower and
more memory-intensive branch of take than is necessary to meet sample's
needs.

Take uses one of three code paths depending on whether the requested
indexes are,
* contiguous (path 1)
* sorted (path 2)
* noncontiguous and unsorted (path 3)

Take can take advantage of sorted inputs because neighboring indexes are
more likely to reside on the same pages.  It gets this benefit
automatically in paths 1 and 2.

In path 3, in order to take advantage of this optimization, take
reorders its the input. Since the contract of Take is to return rows in
the order requested, it must then reorder its take results according to
the original requested ordering. This reordering step requires two
copies of the data to be held in memory at the same time.

This causes the IVF training stage of IVF-PQ index building to take
about twice as much memory as it needs.

This PR effectively implements what you describe. Prior to this commit you asked for random rows and it gave you those rows in random order. Now you ask for random rows and it gives you the rows in the order it chooses (physical stored order). If you want a different ordering you need to ask for it.

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Feb 9, 2026

This PR effectively implements what you describe. Prior to this commit you asked for random rows and it gave you those rows in random order. Now you ask for random rows and it gives you the rows in the order it chooses (physical stored order). If you want a different ordering you need to ask for it.

Thanks for the explanation.

Take can take advantage of sorted inputs because neighboring indexes are
more likely to reside on the same pages. It gets this benefit
automatically in paths 1 and 2.
In path 3, in order to take advantage of this optimization, take
reorders its the input.

I guess one question in my mind is, how important is this optimization (where neighboring indexes are more likely to reside on the same pages) when doing a random sample for indexing? Intuitively I doubt it's very useful when the dataset is large. Another thing to think about for later, maybe?

@westonpace
Copy link
Copy Markdown
Member

westonpace commented Feb 9, 2026

I guess one question in my mind is, how important is this optimization (where neighboring indexes are more likely to reside on the same pages) when doing a random sample for indexing? Intuitively I doubt it's very useful when the dataset is large. Another thing to think about for later, maybe?

The file reader (well decoder) itself requires indices to be sorted. It's not entirely for I/O optimization purposes (it could still figure out which I/Os are close together if operating in an unsorted manner) but mainly to simplify logic and keep the code manageable.

@westonpace
Copy link
Copy Markdown
Member

For example, imagine the reader is asking for one million rows and the first row they ask for and the last row they ask for are both on the same disk page. Do you load that page and cache it for the duration of the read? By forcing the indices to be in sorted order you bypass this question and leave it up to the caller.

The various take methods that produce batches (e.g. take, sample) do cache it. They load the entire batch in memory, and then sort it.

The various take methods that produce streams (TakeExec, take_scan) will read (and decode) the requested disk page twice. They group the input into batches, order the batch, then resort, then emit. So the first and last row would be in different batches.

@westonpace westonpace merged commit e28a556 into lance-format:main Feb 13, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants