perf: add a lightweight scheduler implementation#5773
perf: add a lightweight scheduler implementation#5773westonpace merged 5 commits intolance-format:mainfrom
Conversation
|
Drafting until I merge #5755 |
Code Review SummaryThis PR introduces a lightweight scheduler implementation to reduce synchronization overhead in high-IOPS scenarios. The design is sound and the claimed 2-4x performance improvements are significant. P0/P1 Issues1. Concurrency throttle is ineffective (P0 - Bug) In fn try_acquire(&mut self) -> bool {
if self.concurrency_available > 0 {
// ...commented out...
true // Returns true but doesn't decrement
} else {
false
}
}Combined with 2. PrioritiesInFlight insertion is O(n) (P1 - Performance) In 3. Task cancellation leaks backpressure reservations (P1 - Bug) In Minor Observations
Overall, this is a well-structured performance improvement. Addressing the concurrency throttle behavior (either fix it or document why unbounded is acceptable) would be the main blocker. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
6dfe2be to
9c94785
Compare
|
Once the io_uring stuff merges I'll probably move away from triggering the lite scheduler with an environment variable and instead let the reader instance decide which scheduler it wants. |
fc6d1f8 to
1346fa5
Compare
wjones127
left a comment
There was a problem hiding this comment.
Seems good, although I'm surprised by the lack of unit tests for the lite scheduler implementation. It might, for example, be helpful to double check ordering for TaskEntry is working as expected.
Also, is there some generic test suite of the schedule where you can test use_lite_scheduler on true and false?
|
I added a test for ordering and one more for making sure get_range is called at the right time (a nuanced thing it took a long time to debug). The on/off tests can come when the uring reader is added |
~~This is still a draft while waiting on #5755 and #5773 This PR adds a new URI scheme `file+uring`. The scheme uses the same local file reader as `file` but has two custom `Reader` implementations that are based on the io_uring API. One of these creates a configurable number of process-wide ring threads and the reader communicates with this thread using a queue. The second assumes that the scheduler and decoder run on the same thread and uses a thread local uring instance. Both are able to saturate up to 1.5M IOPS/s when combined with the scheduler rework. I've tested the thread local variant up to 2M IOPS/s. These numbers are assuming the data is not in the kernel page cache. I've seen results as high as 4M IOPS/s when the data is cached. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The current scheduler introduces too much synchronization overhead when we are in a high-iops throughput situation. This scheduler reduces the number of asynchronous context switches. On my desktop it doesn't actually have much impact on performance. However, on a system with more cores and higher RAM bandwidth the new scheduler more than doubles the amount of IOPS/s. Combined with the I/O uring (coming in a future PR) performance is actually 4x.
This scheduler makes tradeoffs which are not ideal for cloud readers, but which are important for the uring reader: