perf: speed up format 2.2 300% by spawning structural decode batch tasks#5982
perf: speed up format 2.2 300% by spawning structural decode batch tasks#5982
Conversation
PR ReviewThis PR correctly aligns the structural decode batch path with the existing non-structural path (lines 1467-1475) by spawning CPU-heavy No P0/P1 issues found. The change is:
LGTM ✓ |
westonpace
left a comment
There was a problem hiding this comment.
We keep going back and forth on this one 😄. You originally added the spawn here (7c19c22) and then I removed it here (70636f6)
I think we have some competing goals. This spawn can improve scan performance because we are reading large blocks of data and the decode is expensive. However, it also hurts random access performance because in that case we have a very cheap decode and the introduction of a spawn increases tokio overhead.
I am also still worried about whether or not this will boost performance in an actual query. For example, if we were filtering on this data then not having the spawn means we would decode and filter in the same thread task. By introducing the spawn the decode and filter happen on different thread tasks which means data might have to get loaded into the CPU cache twice.
Can you add some kind of reader config setting? Ideally in a way where we can change the default value for this setting with an environment variable.
Yep, I realized that.
Seems to be a good idea, will try. |
This reverts commit 628683b.
|
cc @westonpace, I added a flag based on our query pattern and an env to allow users to override it. Let me know what do you think about this change. This seems like an interesting issue that may require a more complete design for us to address. I will add that as a follow-up. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Thanks for adding the variable!
…sks (lance-format#5982) `NextDecodeTask::into_batch` is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency. This changes becomes more meaningful while we are using zstd. Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans. Main run (3 rounds, 20 repeats each): - Local median latency: - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`) - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`) - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`) - S3 median latency: - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`) - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`) - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`) ## Changes move structural decode batch conversion in `StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
…sks (lance-format#5982) `NextDecodeTask::into_batch` is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency. This changes becomes more meaningful while we are using zstd. Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans. Main run (3 rounds, 20 repeats each): - Local median latency: - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`) - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`) - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`) - S3 median latency: - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`) - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`) - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`) ## Changes move structural decode batch conversion in `StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
…sks (#5982) `NextDecodeTask::into_batch` is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency. This changes becomes more meaningful while we are using zstd. Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans. Main run (3 rounds, 20 repeats each): - Local median latency: - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`) - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`) - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`) - S3 median latency: - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`) - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`) - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`) ## Changes move structural decode batch conversion in `StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
NextDecodeTask::into_batchis synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency.This changes becomes more meaningful while we are using zstd.
Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (
fineweb.lance.v2_2.lz4) with repeated scans.Main run (3 rounds, 20 repeats each):
894675us -> 289781us(3.087x,-67.61%)929515us -> 307874us(3.019x,-66.88%)1034383us -> 375041us(2.758x,-63.74%)3998660us -> 3510771us(1.139x,-12.20%)4068799us -> 3572090us(1.139x,-12.21%)4153371us -> 3592478us(1.156x,-13.50%)Changes
move structural decode batch conversion in
StructuralBatchDecodeStream::into_streamtotokio::spawn(...).await