Skip to content

FullZip scan latency regression on cloud storage (S3) due to lazy I/O submission #6504

@hushengquan

Description

@hushengquan

Summary

PR #5981 (e25f16909) introduced FullZipReadSource and create_page_load_task as a unified abstraction for scheduling FullZip reads. While the PR itself brought significant performance improvements (especially the full-page scan shortcut and always-cached rep index), it inadvertently changed the I/O submission timing in two code paths: schedule_ranges_simple and the cached branch of schedule_ranges_rep. In both cases, submit_request was moved inside an async move { ... } block, which means the I/O is no longer submitted during the schedule phase — it is deferred until the decode phase polls the future.

Affected Code Paths

Path Before #5981 After #5981 Expected
schedule_ranges_simple (fixed-width, no rep index) submit_request called eagerly during scheduling submit_request inside create_page_load_task → deferred to decode Eager (byte ranges are known at schedule time via simple arithmetic)
schedule_ranges_rep cached branch (rep index in memory) All logic inside one async block (deferred) submit_request inside create_page_load_task → still deferred Eager (byte ranges computable from in-memory cached rep index, opportunity for optimization)
schedule_ranges_rep full-page scan N/A (new path) submit_single called eagerly ✅ Correct
schedule_ranges_rep uncached (no rep index) Deferred (two-stage I/O dependency) Deferred ✅ Correct (data byte ranges depend on first I/O result)

Impact

The scheduling architecture is designed as a two-thread pipeline: the scheduler thread issues I/O as fast as possible, and the decode stream consumes loaded pages. As described in decoder.rs:

Note that the scheduler thread does not need to wait for I/O to happen at any point. As soon as it starts it will start scheduling one page of I/O after another until it has scheduled the entire file's worth of I/O.

When submit_request is deferred into the future, the I/O request is not enqueued into the IoQueue until the decode stream actually polls it. This eliminates the overlap between I/O and scheduling/decoding of other pages, effectively serializing I/O with decode and adding one full network RTT of latency per page for cloud storage (S3/GCS/Azure).

Note: this does not cause unbounded I/O pressure because IoQueue already enforces IOPS limits (io_parallelism: 64 for cloud, 8 for local) and byte-level backpressure (io_buffer_size_bytes). Early submission simply enqueues requests into the priority queue sooner, giving the I/O scheduler better visibility for prioritization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions