-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
At @akurmustafa 's great suggestion on #18260 I submitted, and was accepted to speak at the first TokioConf about using Tokio as the DataFusion CPU runtime ngine
Here is more detail
https://www.tokioconf.com/speakers
Here is the talk summary
Using Tokio for CPU-Bound Tasks (Works Really Well)
The Tokio runtime at the heart of the Rust async ecosystem is also a good choice for CPU-heavy jobs such as those found in analytics engines. We will review what makes Tokio a compelling choice for CPU bound workloads, address common concerns, and report on our experience using Tokio as the thread scheduler for Apache DataFusion
Describe the solution you'd like
I want to create this talk / slides in the open. If the talks aren't recorded, I will also record a second version of the talk
Describe alternatives you've considered
The high level idea will be to summarize the findings in
https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/
And then refresh the major pitfalls
Talk outline:
- Analytic DB 101 + Volcano Model: Explain DataFusion execution model (data flow graphs, and vectorzed execution)
- Explain why people thought using tokio for CPU was bad and the counter arguments
- Demonstrate how tokio's scheduler effectively implements the "get_next_batch()" API on the same thread
- Discuss pitfalls
Major pitfall 1: Using the same async runtime for IO and CPU bound tasks
- Explain symptoms (everything just slows down under high concurency) -- the theore is that this is due to the network protocol congestion control protocol (
- Explain solution: use separate runtimes, thread it throgh
- TODO: find DF example of multiple runtimes
- TODO: mention the challenge of having to pass a new runtime to different IO libraries (object_store, etc)
Major pitfall 2: Hot loops and cancelling
- Basically summarize the contents of https://datafusion.apache.org/blog/2025/06/30/cancellation/ from @pepijnve
- Explain symtpoms: Cancelling and the plan keeps going
- Solution 1: (obvious one) no hot loops
- Solution 2: (less obvious) need to make sure we periodically yield back to the scheduler (otherwise tasks keep running but the scheduer never gets a chance to figure out the consumers have been dropped)
Additional context
No response