perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out#5165
perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out#5165
Conversation
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5165 +/- ##
==========================================
+ Coverage 82.05% 82.25% +0.20%
==========================================
Files 342 344 +2
Lines 141516 144697 +3181
Branches 141516 144697 +3181
==========================================
+ Hits 116115 119017 +2902
- Misses 21561 21760 +199
- Partials 3840 3920 +80
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
A few questions, I'm not quite sure how these fixes are speeding things up yet
|
Also, the benchmarks in the PR description seem to be speeding up filtered scan. However the title says random access. Does the title need updated? |
This comment was marked as resolved.
This comment was marked as resolved.
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
|
Hi @westonpace, please take another look. I think we now fully understand what happened. |
Signed-off-by: Xuanwo <github@xuanwo.io>
|
Hi @westonpace, I'll merge this PR tomorrow if there no other concerns 💌 |
| let emitted_batch_size_warning = slf.emitted_batch_size_warning.clone(); | ||
| let task = async move { | ||
| let next_task = next_task?; | ||
| next_task.into_batch(emitted_batch_size_warning) |
There was a problem hiding this comment.
I thought we were going to do the spawn fix by replacing the existing spawn with a spawn_cpu call? Looks like we are still introducing a new spawn call?
There was a problem hiding this comment.
Oh, I misunderstood your previous comments. The ReadBatchTask contains a future, but spawn_cpu only accepts a blocking function. Are you suggesting we add a spawn_async_cpu function for our CPU runtime?
There was a problem hiding this comment.
I rethink about this and sure that we can remove the spwan inside wrap_with_row_id_and_delete which can speed up our perf a bit better.
But the spawn inside into_stream should keep as it allows the decode task to start as long as it's created instead of been polled.
Benchmarking V2_0 Filtered Scan (10000 limit): Collecting 100 samples in estimated 6.1V2_0 Filtered Scan (10000 limit)
time: [1.2100 ms 1.2133 ms 1.2164 ms]
change: [-0.5410% -0.0909% +0.3994%] (p = 0.72 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
5 (5.00%) high mild
Benchmarking V2_0 Random Take 5 rows: Collecting 100 samples in estimated 5.2221 s (76V2_0 Random Take 5 rows time: [68.316 µs 68.599 µs 68.855 µs]
change: [-1.9455% -1.5034% -1.0910%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low severe
3 (3.00%) high mild
Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
Benchmarking V2_1 (FSST) Filtered Scan (10000 limit): Collecting 100 samples in estimaV2_1 (FSST) Filtered Scan (10000 limit)
time: [1.3157 ms 1.3198 ms 1.3241 ms]
change: [-2.7881% -2.4058% -1.9991%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
Benchmarking V2_1 (FSST) Random Take 5 rows: Collecting 100 samples in estimated 5.279V2_1 (FSST) Random Take 5 rows
time: [64.018 µs 64.162 µs 64.311 µs]
change: [-2.7584% -2.3465% -1.9349%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
Benchmarking V2_1 (FSST disabled) Filtered Scan (10000 limit): Collecting 100 samples V2_1 (FSST disabled) Filtered Scan (10000 limit)
time: [1.2192 ms 1.2230 ms 1.2270 ms]
change: [-3.0966% -2.7663% -2.4295%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
2 (2.00%) high mild
Benchmarking V2_1 (FSST disabled) Random Take 5 rows: Collecting 100 samples in estimaV2_1 (FSST disabled) Random Take 5 rows
time: [63.852 µs 63.967 µs 64.110 µs]
change: [-4.0087% -3.6134% -3.2627%] (p = 0.00 < 0.05)
Performance has improved.
There was a problem hiding this comment.
Hmm, I'm not entirely sure I agree but I don't want to go back and forth too much. We can merge this and revisit later (I still want to get rid of some of the I/O tasks) if you would like.
I think we will also want a more complex benchmark, we could use one of the more compute intensive TPC-H queries.
We will also need to add support for FilteredReadThreadingMode::MultiplePartitions in the Lance table provider.
The goal should be that one thread task does decoding and filtering. This way when we reach the filtering stage, the data is already in the CPU cache. If we put a spawn here then the decoding will happen on one thread task and the filtering on another. This means we will have to transfer the data between main memory.
There was a problem hiding this comment.
Tracked in #5242
I agree most of your comments. The blocker here is the change set might be bigger than we expected. Let's revisit this part as follow ups.
Signed-off-by: Xuanwo <github@xuanwo.io>
While working on #5068, I found some heavy CPU task in our IO threads. This PR move out them and end up with speed up random access by up to 1890%.
Here is the bench result:
Python Bench
Tested with existsing random accesss test:
the data is processed by ChatGPT
Rust Bench
Tested with newly added rust bench (following the similiar impls with python one):
the data is processed by ChatGPT
This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.