perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out by Xuanwo · Pull Request #5165 · lance-format/lance

Xuanwo · 2025-11-06T10:57:06Z

While working on #5068, I found some heavy CPU task in our IO threads. This PR move out them and end up with speed up random access by up to 1890%.

Here is the bench result:

Python Bench

Tested with existsing random accesss test:

the data is processed by ChatGPT

Version	V1 Query A (s)	V2 Query A (s)	V1 Query B (s)	V2 Query B (s)
Before Optimization	0.007131	0.033655	0.000695	0.000477
After Optimization	0.007273	0.010790	0.000657	0.000431

Rust Bench

Tested with newly added rust bench (following the similiar impls with python one):

the data is processed by ChatGPT

Benchmark	Before (time)	After (time)	Speedup	Result
V2_0 Filtered Scan (10000 limit)	1.242 ms	1.224 ms	1.01×	✅ Slightly faster
V2_0 Random Take 5 rows	66.876 µs	69.681 µs	0.96×	⚙️ No significant change
V2_1 (FSST) Filtered Scan (10000 limit)	25.462 ms	1.355 ms	18.8×	✅ Much faster
V2_1 (FSST) Random Take 5 rows	63.231 µs	66.270 µs	0.95×	⚙️ Stable
V2_1 (FSST disabled) Filtered Scan (10000 limit)	23.733 ms	1.259 ms	18.9×	✅ Much faster
V2_1 (FSST disabled) Random Take 5 rows	63.219 µs	66.777 µs	0.95×	⚙️ Stable

This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

Signed-off-by: Xuanwo <github@xuanwo.io>

BubbleCal

Great work!

codecov-commenter · 2025-11-06T11:48:08Z

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.25%. Comparing base (f084677) to head (ed34990).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/decoder.rs	50.00%	6 Missing ⚠️
rust/lance-encoding/src/buffer.rs	77.77%	2 Missing ⚠️
rust/lance-table/src/utils/stream.rs	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5165      +/-   ##
==========================================
+ Coverage   82.05%   82.25%   +0.20%     
==========================================
  Files         342      344       +2     
  Lines      141516   144697    +3181     
  Branches   141516   144697    +3181     
==========================================
+ Hits       116115   119017    +2902     
- Misses      21561    21760     +199     
- Partials     3840     3920      +80

Flag	Coverage Δ
unittests	`82.25% <67.85%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

A few questions, I'm not quite sure how these fixes are speeding things up yet

westonpace · 2025-11-06T13:23:14Z

Also, the benchmarks in the PR description seem to be speeding up filtered scan. However the title says random access. Does the title need updated?

Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo · 2025-11-07T19:15:13Z

Hi @westonpace, please take another look. I think we now fully understand what happened.

Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo · 2025-11-13T07:44:39Z

Hi @westonpace, I'll merge this PR tomorrow if there no other concerns 💌

westonpace · 2025-11-13T14:01:47Z

                let emitted_batch_size_warning = slf.emitted_batch_size_warning.clone();
                let task = async move {
                    let next_task = next_task?;
-                    next_task.into_batch(emitted_batch_size_warning)


I thought we were going to do the spawn fix by replacing the existing spawn with a spawn_cpu call? Looks like we are still introducing a new spawn call?

Oh, I misunderstood your previous comments. The ReadBatchTask contains a future, but spawn_cpu only accepts a blocking function. Are you suggesting we add a spawn_async_cpu function for our CPU runtime?

I rethink about this and sure that we can remove the spwan inside wrap_with_row_id_and_delete which can speed up our perf a bit better.

But the spawn inside into_stream should keep as it allows the decode task to start as long as it's created instead of been polled.

Benchmarking V2_0 Filtered Scan (10000 limit): Collecting 100 samples in estimated 6.1V2_0 Filtered Scan (10000 limit) time: [1.2100 ms 1.2133 ms 1.2164 ms] change: [-0.5410% -0.0909% +0.3994%] (p = 0.72 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) low mild 5 (5.00%) high mild Benchmarking V2_0 Random Take 5 rows: Collecting 100 samples in estimated 5.2221 s (76V2_0 Random Take 5 rows time: [68.316 µs 68.599 µs 68.855 µs] change: [-1.9455% -1.5034% -1.0910%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low severe 3 (3.00%) high mild Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60. Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Collecting 100 samples in estimaV2_1 (FSST) Filtered Scan (10000 limit) time: [1.3157 ms 1.3198 ms 1.3241 ms] change: [-2.7881% -2.4058% -1.9991%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild Benchmarking V2_1 (FSST) Random Take 5 rows: Collecting 100 samples in estimated 5.279V2_1 (FSST) Random Take 5 rows time: [64.018 µs 64.162 µs 64.311 µs] change: [-2.7584% -2.3465% -1.9349%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Collecting 100 samples V2_1 (FSST disabled) Filtered Scan (10000 limit) time: [1.2192 ms 1.2230 ms 1.2270 ms] change: [-3.0966% -2.7663% -2.4295%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) low mild 2 (2.00%) high mild Benchmarking V2_1 (FSST disabled) Random Take 5 rows: Collecting 100 samples in estimaV2_1 (FSST disabled) Random Take 5 rows time: [63.852 µs 63.967 µs 64.110 µs] change: [-4.0087% -3.6134% -3.2627%] (p = 0.00 < 0.05) Performance has improved.

Hmm, I'm not entirely sure I agree but I don't want to go back and forth too much. We can merge this and revisit later (I still want to get rid of some of the I/O tasks) if you would like.

I think we will also want a more complex benchmark, we could use one of the more compute intensive TPC-H queries.

We will also need to add support for FilteredReadThreadingMode::MultiplePartitions in the Lance table provider.

The goal should be that one thread task does decoding and filtering. This way when we reach the filtering stage, the data is already in the CPU cache. If we put a spawn here then the decoding will happen on one thread task and the filtering on another. This means we will have to transfer the data between main memory.

Tracked in #5242

I agree most of your comments. The blocker here is the change set might be bigger than we expected. Let's revisit this part as follow ups.

Signed-off-by: Xuanwo <github@xuanwo.io>

…ask out (lance-format#5165)

Xuanwo added 4 commits November 6, 2025 17:42

add bench cases for rust

98272a3

Signed-off-by: Xuanwo <github@xuanwo.io>

compare between 2.0 and 2.1

dd9050c

Signed-off-by: Xuanwo <github@xuanwo.io>

Make clippy happy

2951c3a

Signed-off-by: Xuanwo <github@xuanwo.io>

optimize

1990809

Signed-off-by: Xuanwo <github@xuanwo.io>

github-actions Bot added python performance labels Nov 6, 2025

Xuanwo commented Nov 6, 2025

View reviewed changes

Comment thread rust/lance-encoding/src/data.rs

Comment thread rust/lance-encoding/src/decoder.rs Outdated

BubbleCal approved these changes Nov 6, 2025

View reviewed changes

Xuanwo requested a review from westonpace November 6, 2025 12:27

westonpace reviewed Nov 6, 2025

View reviewed changes

Comment thread rust/lance-encoding/src/data.rs

Comment thread rust/lance-encoding/src/decoder.rs Outdated

Xuanwo changed the title ~~perf: speed up random access by up to 17.7x by moving the heavy CPU task out~~ perf: speed up filtered scan by up to 17.7x by moving the heavy CPU task out Nov 6, 2025

This comment was marked as resolved.

Sign in to view

westonpace reviewed Nov 7, 2025

View reviewed changes

Comment thread rust/lance-encoding/src/data.rs Outdated

Comment thread rust/lance-encoding/src/data.rs

Comment thread rust/lance-encoding/src/decoder.rs Outdated

Xuanwo added 3 commits November 7, 2025 18:43

Use tokio spawn instead

19d7264

Signed-off-by: Xuanwo <github@xuanwo.io>

Let's rock

1bd887f

Signed-off-by: Xuanwo <github@xuanwo.io>

Merge branch 'main' into debug-oo-scheduler

7fce0dc

Xuanwo requested a review from westonpace November 7, 2025 19:14

Xuanwo changed the title ~~perf: speed up filtered scan by up to 17.7x by moving the heavy CPU task out~~ perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out Nov 7, 2025

Xuanwo added 2 commits November 8, 2025 13:04

Fix buffer

98d6c07

Signed-off-by: Xuanwo <github@xuanwo.io>

Merge branch 'main' into debug-oo-scheduler

c4acd47

westonpace reviewed Nov 13, 2025

View reviewed changes

Remove spawn in wrap_with_row_id_and_delete

ed34990

Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo mentioned this pull request Nov 14, 2025

Avoid extra tokio spawn by arranging decode and filter task on the same thread #5242

Open

Xuanwo merged commit 7c19c22 into main Nov 14, 2025
29 checks passed

Xuanwo deleted the debug-oo-scheduler branch November 14, 2025 14:35

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026

perf: speed up filtered scan by up to 18.9× by moving the heavy CPU t…

8374745

…ask out (lance-format#5165)

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Conversation

Xuanwo commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Bench

Rust Bench

Uh oh!

Uh oh!

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

westonpace commented Nov 6, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Xuanwo commented Nov 7, 2025

Uh oh!

Xuanwo commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

westonpace Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Xuanwo commented Nov 6, 2025 •

edited

Loading

codecov-commenter commented Nov 6, 2025 •

edited

Loading

Xuanwo Nov 14, 2025 •

edited

Loading

Xuanwo Nov 14, 2025 •

edited

Loading