perf: change Dataset::sample to sort its random indices by wkalt · Pull Request #5915 · lance-format/lance

wkalt · 2026-02-09T01:06:48Z

This changes Dataset::sample to sort its random indices. Supplying sorted inputs to take results in a 50% reduction in peak memory consumption. This change causes the IVF training stage of IVF-PQ index builds to take approximately half as much memory.

This promotes sample() from pub(crate) to pub so that a criterion benchmark can be added, and benchmark.

Prior to this commit, Dataset::sample() drew a random sample of indices and executed take on that unsorted sample. This triggers a slower and more memory-intensive branch of take than is necessary to meet sample's needs. Take uses one of three code paths depending on whether the requested indexes are, * contiguous (path 1) * sorted (path 2) * noncontiguous and unsorted (path 3) Take can take advantage of sorted inputs because neighboring indexes are more likely to reside on the same pages. It gets this benefit automatically in paths 1 and 2. In path 3, in order to take advantage of this optimization, take reorders its the input. Since the contract of Take is to return rows in the order requested, it must then reorder its take results according to the original requested ordering. This reordering step requires two copies of the data to be held in memory at the same time. This causes the IVF training stage of IVF-PQ index building to take about twice as much memory as it needs.

wkalt · 2026-02-09T01:10:43Z

the visibility change for Dataset::sample is just for the benchmark. I'm happy to remove the benchmark and revert that part. The included benchmark does not measure memory but does measure a 62% speedup (which was not what I was targeting):

[wyatt@desktop lance](task/sample-sort-ids) $ cargo bench --bench take -p lance -- 'sample' --baseline before
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
   Compiling lance v3.0.0-beta.2 (/mnt/work/home/wyatt/work/lance/rust/lance)
    Finished `bench` profile [optimized + debuginfo] target(s) in 2m 13s
     Running benches/take.rs (target/release/deps/take-a05c111b1d24c007)
Benchmarking sample(1024 of 102400 rows): Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 18.8s, or reduce sample count to 2650.
sample(1024 of 102400 rows)
                        time:   [1.9270 ms 1.9288 ms 1.9306 ms]
                        change: [-7.6751% -7.5481% -7.4247%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 417 outliers among 10000 measurements (4.17%)
  1 (0.01%) low mild
  262 (2.62%) high mild
  154 (1.54%) high severe

Benchmarking sample(8192 of 102400 rows): Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 41.5s, or reduce sample count to 1200.
sample(8192 of 102400 rows)
                        time:   [4.1826 ms 4.1884 ms 4.1942 ms]
                        change: [-62.015% -61.959% -61.907%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 492 outliers among 10000 measurements (4.92%)
  347 (3.47%) high mild
  145 (1.45%) high severe

The behavior of returning sampled results in physical order is consistent with what postgres does with TABLESAMPLE. I have not investigated other systems. The sample() method is not currently public so there is no precedent to break, but if we end up making it public here we should be mindful that we are choosing this direction. Personally I think this behavior is fine/expected for SQL but maybe there may be stakeholders that feel differently.

codecov · 2026-02-09T01:40:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

BubbleCal · 2026-02-09T17:27:17Z

Should we add sort into take? cc @westonpace @Xuanwo
I thought we have something similar in take but this PR actually improves the performance

wkalt · 2026-02-09T17:45:03Z

take on unsorted input does do a sort, but since take needs to return rows in the requested order, it then has to resort on the way out and this is where the cost is. So IMO take is doing what it should be doing (it would be worse if it didn't sort) but callers who don't care about the sorting, should still sort before handling indexes to take.

wkalt · 2026-02-09T17:46:00Z

(note that this benchmarks is for Dataset::sample, not for generic take)

westonpace · 2026-02-09T17:51:19Z

take on unsorted input does do a sort, but since take needs to return rows in the requested order, it then has to resort on the way out and this is where the cost is. So IMO take is doing what it should be doing (it would be worse if it didn't sort) but callers who don't care about the sorting, should still sort before handling indexes to take.

Yes, some take use cases don't care about order and some do. Take doesn't have the context to know. However, at this level we do know.

There is a similar improvement available on the read side. When we search the vector index we get back a list of row addresses that we pass to TakeExec. The TakeExec then assumes the order is important but I think we always sort by distance after the fact and so the order is not important. As a result we do an extra sort in the take which we could potentially get rid of.

westonpace · 2026-02-09T17:53:40Z

Huh...looks like we already had this optimization in the python-level sample method:

        total_num_rows = self.count_rows()
        indices = random.sample(range(total_num_rows), num_rows)
        if not randomize_order:
            # Sort the indices in order to increase the locality and thus reduce
            # the number of random reads.
            indices = sorted(indices)
        return self.take(indices, columns, **kwargs)

+1 for using this in rust

cmccabe · 2026-02-09T19:12:49Z

Maybe I'm missing something, but I'm confused. Conceptually, sampling rows from a dataset shouldn't require any kind of sorting. Can we have a new API like "give me some random rows, in any order you like"?

wkalt · 2026-02-09T19:25:22Z

@cmccabe one of the commits has a better message than the PR description maybe --

Prior to this commit, Dataset::sample() drew a random sample of indices
and executed take on that unsorted sample. This triggers a slower and
more memory-intensive branch of take than is necessary to meet sample's
needs.

Take uses one of three code paths depending on whether the requested
indexes are,
* contiguous (path 1)
* sorted (path 2)
* noncontiguous and unsorted (path 3)

Take can take advantage of sorted inputs because neighboring indexes are
more likely to reside on the same pages.  It gets this benefit
automatically in paths 1 and 2.

In path 3, in order to take advantage of this optimization, take
reorders its the input. Since the contract of Take is to return rows in
the order requested, it must then reorder its take results according to
the original requested ordering. This reordering step requires two
copies of the data to be held in memory at the same time.

This causes the IVF training stage of IVF-PQ index building to take
about twice as much memory as it needs.

This PR effectively implements what you describe. Prior to this commit you asked for random rows and it gave you those rows in random order. Now you ask for random rows and it gives you the rows in the order it chooses (physical stored order). If you want a different ordering you need to ask for it.

cmccabe · 2026-02-09T19:35:12Z

This PR effectively implements what you describe. Prior to this commit you asked for random rows and it gave you those rows in random order. Now you ask for random rows and it gives you the rows in the order it chooses (physical stored order). If you want a different ordering you need to ask for it.

Thanks for the explanation.

Take can take advantage of sorted inputs because neighboring indexes are
more likely to reside on the same pages. It gets this benefit
automatically in paths 1 and 2.
In path 3, in order to take advantage of this optimization, take
reorders its the input.

I guess one question in my mind is, how important is this optimization (where neighboring indexes are more likely to reside on the same pages) when doing a random sample for indexing? Intuitively I doubt it's very useful when the dataset is large. Another thing to think about for later, maybe?

westonpace · 2026-02-09T20:02:03Z

I guess one question in my mind is, how important is this optimization (where neighboring indexes are more likely to reside on the same pages) when doing a random sample for indexing? Intuitively I doubt it's very useful when the dataset is large. Another thing to think about for later, maybe?

The file reader (well decoder) itself requires indices to be sorted. It's not entirely for I/O optimization purposes (it could still figure out which I/Os are close together if operating in an unsorted manner) but mainly to simplify logic and keep the code manageable.

westonpace · 2026-02-09T20:09:11Z

For example, imagine the reader is asking for one million rows and the first row they ask for and the last row they ask for are both on the same disk page. Do you load that page and cache it for the duration of the read? By forcing the indices to be in sorted order you bypass this question and leave it up to the caller.

The various take methods that produce batches (e.g. take, sample) do cache it. They load the entire batch in memory, and then sort it.

The various take methods that produce streams (TakeExec, take_scan) will read (and decode) the requested disk page twice. They group the input into batches, order the batch, then resort, then emit. So the first and last row would be in different batches.

wkalt added 2 commits February 8, 2026 16:40

chore: add benchmark for Dataset::sample()

142b74c

This promotes sample() from pub(crate) to pub so that a criterion benchmark can be added, and benchmark.

github-actions Bot added the performance label Feb 9, 2026

westonpace approved these changes Feb 13, 2026

View reviewed changes

westonpace merged commit e28a556 into lance-format:main Feb 13, 2026
29 of 30 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: change Dataset::sample to sort its random indices#5915

perf: change Dataset::sample to sort its random indices#5915
westonpace merged 2 commits intolance-format:mainfrom
wkalt:task/sample-sort-ids

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

codecov Bot commented Feb 9, 2026

Uh oh!

BubbleCal commented Feb 9, 2026 •

edited

Loading

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

cmccabe commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

cmccabe commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026 •

edited

Loading

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

codecov Bot commented Feb 9, 2026

Codecov Report

Uh oh!

BubbleCal commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

cmccabe commented Feb 9, 2026

Uh oh!

wkalt commented Feb 9, 2026

Uh oh!

cmccabe commented Feb 9, 2026

Uh oh!

westonpace commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BubbleCal commented Feb 9, 2026 •

edited

Loading

westonpace commented Feb 9, 2026 •

edited

Loading