Skip to content

perf: add a lightweight scheduler implementation#5773

Merged
westonpace merged 5 commits intolance-format:mainfrom
westonpace:perf/lite-scheduler
Feb 4, 2026
Merged

perf: add a lightweight scheduler implementation#5773
westonpace merged 5 commits intolance-format:mainfrom
westonpace:perf/lite-scheduler

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Jan 21, 2026

The current scheduler introduces too much synchronization overhead when we are in a high-iops throughput situation. This scheduler reduces the number of asynchronous context switches. On my desktop it doesn't actually have much impact on performance. However, on a system with more cores and higher RAM bandwidth the new scheduler more than doubles the amount of IOPS/s. Combined with the I/O uring (coming in a future PR) performance is actually 4x.

This scheduler makes tradeoffs which are not ideal for cloud readers, but which are important for the uring reader:

  • There is no dedicated background I/O loop thread and tasks are not launched with tokio spawn. This is fine for the local filesystem since each task is its own spawn blocking call. However, for cloud stores this is not great. Tasks are not spawned calls and need to be polled occasionally. The end result is that I/O tasks which are not front of the line might get paused if they fill up (for example, if the HTTP request queues fill up). In I/O uring this isn't a problem because polling one task actually progresses all I/O requests (not just the one associated with the task).
  • The lite scheduler does not enforce any kind of concurrency throttling. This is because it is mainly meant for local storage use cases and there is not much penalty for having too many concurrent requests. This would be a problem if using this scheduler for cloud storage however.

@westonpace
Copy link
Copy Markdown
Member Author

Drafting until I merge #5755

@westonpace westonpace marked this pull request as draft January 21, 2026 13:24
@github-actions
Copy link
Copy Markdown
Contributor

Code Review Summary

This PR introduces a lightweight scheduler implementation to reduce synchronization overhead in high-IOPS scenarios. The design is sound and the claimed 2-4x performance improvements are significant.

P0/P1 Issues

1. Concurrency throttle is ineffective (P0 - Bug)

In lite.rs:1662-1673, SimpleConcurrencyThrottle::try_acquire always returns true when concurrency_available > 0 but never decrements the counter:

fn try_acquire(&mut self) -> bool {
    if self.concurrency_available > 0 {
        // ...commented out...
        true  // Returns true but doesn't decrement
    } else {
        false
    }
}

Combined with release() being a no-op, this means the concurrency throttle provides no actual limiting. This appears intentional based on the TODO comment about deadlocks, but the current implementation allows unbounded concurrent I/O, which could cause resource exhaustion under load.

2. PrioritiesInFlight insertion is O(n) (P1 - Performance)

In lite.rs:1718-1724, every push performs a binary search followed by an insert, making it O(n). With up to 256 concurrent tasks as mentioned in comments, this could add measurable overhead in the hot path. Consider using a BTreeMap<u128, usize> to track priority counts instead.

3. Task cancellation leaks backpressure reservations (P1 - Bug)

In lite.rs:1527-1542, IoTask::cancel() creates a dummy BackpressureReservation with num_bytes: 0 regardless of whether the task had a real reservation. The comment in close() at line 2027-2028 acknowledges this is currently safe because the queue is local, but this is fragile if the design changes.

Minor Observations

  • The Reader trait change from async fn to BoxFuture return is a reasonable approach to make the futures 'static for the lite scheduler
  • The serialized_scheduling option controlled via env var is a good debugging/testing escape hatch
  • Tests should be added specifically for the lite scheduler path (currently tests only use use_lite_scheduler: false)

Overall, this is a well-structured performance improvement. Addressing the concurrency throttle behavior (either fix it or document why unbounded is acceptable) would be the main blocker.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 12.26667% with 329 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-io/src/scheduler/lite.rs 0.00% 285 Missing ⚠️
rust/lance-io/src/scheduler.rs 51.16% 42 Missing ⚠️
rust/lance-tools/src/meta.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@westonpace westonpace marked this pull request as ready for review January 30, 2026 00:33
@westonpace
Copy link
Copy Markdown
Member Author

Once the io_uring stuff merges I'll probably move away from triggering the lite scheduler with an environment variable and instead let the reader instance decide which scheduler it wants.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good, although I'm surprised by the lack of unit tests for the lite scheduler implementation. It might, for example, be helpful to double check ordering for TaskEntry is working as expected.

Also, is there some generic test suite of the schedule where you can test use_lite_scheduler on true and false?

@westonpace westonpace merged commit 70636f6 into lance-format:main Feb 4, 2026
26 of 27 checks passed
@westonpace
Copy link
Copy Markdown
Member Author

I added a test for ordering and one more for making sure get_range is called at the right time (a nuanced thing it took a long time to debug).

The on/off tests can come when the uring reader is added

westonpace added a commit that referenced this pull request Mar 31, 2026
~~This is still a draft while waiting on
#5755 and
#5773

This PR adds a new URI scheme `file+uring`. The scheme uses the same
local file reader as `file` but has two custom `Reader` implementations
that are based on the io_uring API. One of these creates a configurable
number of process-wide ring threads and the reader communicates with
this thread using a queue. The second assumes that the scheduler and
decoder run on the same thread and uses a thread local uring instance.

Both are able to saturate up to 1.5M IOPS/s when combined with the
scheduler rework. I've tested the thread local variant up to 2M IOPS/s.
These numbers are assuming the data is not in the kernel page cache.
I've seen results as high as 4M IOPS/s when the data is cached.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants