Skip to content

perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out#5165

Merged
Xuanwo merged 10 commits intomainfrom
debug-oo-scheduler
Nov 14, 2025
Merged

perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out#5165
Xuanwo merged 10 commits intomainfrom
debug-oo-scheduler

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Nov 6, 2025

While working on #5068, I found some heavy CPU task in our IO threads. This PR move out them and end up with speed up random access by up to 1890%.

Here is the bench result:

Python Bench

Tested with existsing random accesss test:

the data is processed by ChatGPT

Version V1 Query A (s) V2 Query A (s) V1 Query B (s) V2 Query B (s)
Before Optimization 0.007131 0.033655 0.000695 0.000477
After Optimization 0.007273 0.010790 0.000657 0.000431

Rust Bench

Tested with newly added rust bench (following the similiar impls with python one):

the data is processed by ChatGPT

Benchmark Before (time) After (time) Speedup Result
V2_0 Filtered Scan (10000 limit) 1.242 ms 1.224 ms 1.01× ✅ Slightly faster
V2_0 Random Take 5 rows 66.876 µs 69.681 µs 0.96× ⚙️ No significant change
V2_1 (FSST) Filtered Scan (10000 limit) 25.462 ms 1.355 ms 18.8× ✅ Much faster
V2_1 (FSST) Random Take 5 rows 63.231 µs 66.270 µs 0.95× ⚙️ Stable
V2_1 (FSST disabled) Filtered Scan (10000 limit) 23.733 ms 1.259 ms 18.9× ✅ Much faster
V2_1 (FSST disabled) Random Take 5 rows 63.219 µs 66.777 µs 0.95× ⚙️ Stable

This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Comment thread rust/lance-encoding/src/data.rs
Comment thread rust/lance-encoding/src/decoder.rs Outdated
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.25%. Comparing base (f084677) to head (ed34990).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-encoding/src/decoder.rs 50.00% 6 Missing ⚠️
rust/lance-encoding/src/buffer.rs 77.77% 2 Missing ⚠️
rust/lance-table/src/utils/stream.rs 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5165      +/-   ##
==========================================
+ Coverage   82.05%   82.25%   +0.20%     
==========================================
  Files         342      344       +2     
  Lines      141516   144697    +3181     
  Branches   141516   144697    +3181     
==========================================
+ Hits       116115   119017    +2902     
- Misses      21561    21760     +199     
- Partials     3840     3920      +80     
Flag Coverage Δ
unittests 82.25% <67.85%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Xuanwo Xuanwo requested a review from westonpace November 6, 2025 12:27
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, I'm not quite sure how these fixes are speeding things up yet

Comment thread rust/lance-encoding/src/data.rs
Comment thread rust/lance-encoding/src/decoder.rs Outdated
@westonpace
Copy link
Copy Markdown
Member

Also, the benchmarks in the PR description seem to be speeding up filtered scan. However the title says random access. Does the title need updated?

@Xuanwo Xuanwo changed the title perf: speed up random access by up to 17.7x by moving the heavy CPU task out perf: speed up filtered scan by up to 17.7x by moving the heavy CPU task out Nov 6, 2025
@Xuanwo

This comment was marked as resolved.

Comment thread rust/lance-encoding/src/data.rs Outdated
Comment thread rust/lance-encoding/src/data.rs
Comment thread rust/lance-encoding/src/decoder.rs Outdated
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo requested a review from westonpace November 7, 2025 19:14
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Nov 7, 2025

Hi @westonpace, please take another look. I think we now fully understand what happened.

@Xuanwo Xuanwo changed the title perf: speed up filtered scan by up to 17.7x by moving the heavy CPU task out perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out Nov 7, 2025
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Nov 13, 2025

Hi @westonpace, I'll merge this PR tomorrow if there no other concerns 💌

Comment thread rust/lance-encoding/src/data.rs
Comment thread rust/lance-encoding/src/buffer.rs
let emitted_batch_size_warning = slf.emitted_batch_size_warning.clone();
let task = async move {
let next_task = next_task?;
next_task.into_batch(emitted_batch_size_warning)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to do the spawn fix by replacing the existing spawn with a spawn_cpu call? Looks like we are still introducing a new spawn call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misunderstood your previous comments. The ReadBatchTask contains a future, but spawn_cpu only accepts a blocking function. Are you suggesting we add a spawn_async_cpu function for our CPU runtime?

Copy link
Copy Markdown
Collaborator Author

@Xuanwo Xuanwo Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rethink about this and sure that we can remove the spwan inside wrap_with_row_id_and_delete which can speed up our perf a bit better.

But the spawn inside into_stream should keep as it allows the decode task to start as long as it's created instead of been polled.

Benchmarking V2_0 Filtered Scan (10000 limit): Collecting 100 samples in estimated 6.1V2_0 Filtered Scan (10000 limit)
                        time:   [1.2100 ms 1.2133 ms 1.2164 ms]
                        change: [-0.5410% -0.0909% +0.3994%] (p = 0.72 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild

Benchmarking V2_0 Random Take 5 rows: Collecting 100 samples in estimated 5.2221 s (76V2_0 Random Take 5 rows time:   [68.316 µs 68.599 µs 68.855 µs]
                        change: [-1.9455% -1.5034% -1.0910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild

Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Collecting 100 samples in estimaV2_1 (FSST) Filtered Scan (10000 limit)
                        time:   [1.3157 ms 1.3198 ms 1.3241 ms]
                        change: [-2.7881% -2.4058% -1.9991%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking V2_1 (FSST) Random Take 5 rows: Collecting 100 samples in estimated 5.279V2_1 (FSST) Random Take 5 rows
                        time:   [64.018 µs 64.162 µs 64.311 µs]
                        change: [-2.7584% -2.3465% -1.9349%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Collecting 100 samples V2_1 (FSST disabled) Filtered Scan (10000 limit)
                        time:   [1.2192 ms 1.2230 ms 1.2270 ms]
                        change: [-3.0966% -2.7663% -2.4295%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

Benchmarking V2_1 (FSST disabled) Random Take 5 rows: Collecting 100 samples in estimaV2_1 (FSST disabled) Random Take 5 rows
                        time:   [63.852 µs 63.967 µs 64.110 µs]
                        change: [-4.0087% -3.6134% -3.2627%] (p = 0.00 < 0.05)
                        Performance has improved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not entirely sure I agree but I don't want to go back and forth too much. We can merge this and revisit later (I still want to get rid of some of the I/O tasks) if you would like.

I think we will also want a more complex benchmark, we could use one of the more compute intensive TPC-H queries.

We will also need to add support for FilteredReadThreadingMode::MultiplePartitions in the Lance table provider.

The goal should be that one thread task does decoding and filtering. This way when we reach the filtering stage, the data is already in the CPU cache. If we put a spawn here then the decoding will happen on one thread task and the filtering on another. This means we will have to transfer the data between main memory.

Copy link
Copy Markdown
Collaborator Author

@Xuanwo Xuanwo Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #5242

I agree most of your comments. The blocker here is the change set might be bigger than we expected. Let's revisit this part as follow ups.

Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo merged commit 7c19c22 into main Nov 14, 2025
29 checks passed
@Xuanwo Xuanwo deleted the debug-oo-scheduler branch November 14, 2025 14:35
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants