Skip to content

perf: speed up format 2.2 300% by spawning structural decode batch tasks#5982

Merged
Xuanwo merged 13 commits intomainfrom
xuanwo/speed
Feb 25, 2026
Merged

perf: speed up format 2.2 300% by spawning structural decode batch tasks#5982
Xuanwo merged 13 commits intomainfrom
xuanwo/speed

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Feb 23, 2026

NextDecodeTask::into_batch is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency.

This changes becomes more meaningful while we are using zstd.

Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (fineweb.lance.v2_2.lz4) with repeated scans.

Main run (3 rounds, 20 repeats each):

  • Local median latency:
    • p50: 894675us -> 289781us (3.087x, -67.61%)
    • p95: 929515us -> 307874us (3.019x, -66.88%)
    • p99: 1034383us -> 375041us (2.758x, -63.74%)
  • S3 median latency:
    • p50: 3998660us -> 3510771us (1.139x, -12.20%)
    • p95: 4068799us -> 3572090us (1.139x, -12.21%)
    • p99: 4153371us -> 3592478us (1.156x, -13.50%)

Changes

move structural decode batch conversion in StructuralBatchDecodeStream::into_stream to tokio::spawn(...).await

@github-actions
Copy link
Copy Markdown
Contributor

PR Review

This PR correctly aligns the structural decode batch path with the existing non-structural path (lines 1467-1475) by spawning CPU-heavy into_batch work in a separate task.

No P0/P1 issues found.

The change is:

  • Minimal and focused
  • Consistent with the existing pattern in the codebase
  • Properly handles JoinError via Error::Wrapped
  • Well-motivated by benchmark results showing 3x improvement for local reads

LGTM ✓

@Xuanwo Xuanwo changed the title perf: spawn structural decode batch tasks perf: speed up format 2.2 300% by spawning structural decode batch tasks Feb 23, 2026
Copy link
Copy Markdown
Contributor

@justinrmiller justinrmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep going back and forth on this one 😄. You originally added the spawn here (7c19c22) and then I removed it here (70636f6)

I think we have some competing goals. This spawn can improve scan performance because we are reading large blocks of data and the decode is expensive. However, it also hurts random access performance because in that case we have a very cheap decode and the introduction of a spawn increases tokio overhead.

I am also still worried about whether or not this will boost performance in an actual query. For example, if we were filtering on this data then not having the spawn means we would decode and filter in the same thread task. By introducing the spawn the decode and filter happen on different thread tasks which means data might have to get loaded into the CPU cache twice.

Can you add some kind of reader config setting? Ideally in a way where we can change the default value for this setting with an environment variable.

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Feb 23, 2026

We keep going back and forth on this one 😄.

Yep, I realized that.

Can you add some kind of reader config setting? Ideally in a way where we can change the default value for this setting with an environment variable.

Seems to be a good idea, will try.

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Feb 24, 2026

cc @westonpace, I added a flag based on our query pattern and an env to allow users to override it. Let me know what do you think about this change. This seems like an interesting issue that may require a more complete design for us to address. I will add that as a follow-up.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 85.36585% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/decoder.rs 79.31% 5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the variable!

@Xuanwo Xuanwo merged commit b2d4c8f into main Feb 25, 2026
27 checks passed
@Xuanwo Xuanwo deleted the xuanwo/speed branch February 25, 2026 18:48
wjones127 pushed a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…sks (lance-format#5982)

`NextDecodeTask::into_batch` is synchronous and can be CPU-heavy.
Running it inline in the future poll path blocks Tokio workers and
reduces effective decode concurrency.

This changes becomes more meaningful while we are using zstd.

Benchmarks were run on AWS EC2 using both local and S3 copies of the
same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans.

Main run (3 rounds, 20 repeats each):
- Local median latency:
  - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`)
  - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`)
  - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`)
- S3 median latency:
  - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`)
  - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`)
  - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`)


## Changes

move structural decode batch conversion in
`StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
wjones127 pushed a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…sks (lance-format#5982)

`NextDecodeTask::into_batch` is synchronous and can be CPU-heavy.
Running it inline in the future poll path blocks Tokio workers and
reduces effective decode concurrency.

This changes becomes more meaningful while we are using zstd.

Benchmarks were run on AWS EC2 using both local and S3 copies of the
same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans.

Main run (3 rounds, 20 repeats each):
- Local median latency:
  - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`)
  - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`)
  - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`)
- S3 median latency:
  - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`)
  - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`)
  - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`)


## Changes

move structural decode batch conversion in
`StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
wjones127 pushed a commit that referenced this pull request Feb 26, 2026
…sks (#5982)

`NextDecodeTask::into_batch` is synchronous and can be CPU-heavy.
Running it inline in the future poll path blocks Tokio workers and
reduces effective decode concurrency.

This changes becomes more meaningful while we are using zstd.

Benchmarks were run on AWS EC2 using both local and S3 copies of the
same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans.

Main run (3 rounds, 20 repeats each):
- Local median latency:
  - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`)
  - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`)
  - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`)
- S3 median latency:
  - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`)
  - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`)
  - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`)


## Changes

move structural decode batch conversion in
`StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants