-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10292: [Rust] [DataFusion] Simplify merge #8453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note that there is an important note in Tokio's documentation:
I.e. Tokio's |
|
This looks great, but how do we control the concurrency now? The Without being able to control the concurrency level it will be difficult (but not impossible) to run scalability benchmarks to see how we scale with an increasing concurrency level. |
|
Ah, I see the use-case now. Ok, I think that tokio's recommendation is:
Thus, IMO we should remove the Note that in the benchmarks, we can even specify different concurrency parameter depending on the benchmark (in case we want to perform some scaling). |
|
I wonder if the scope of this is too much. I really like this change and would like to see if merged. Perhaps we just remove the concurrency config and let users of DataFusion create their own tokio runtimes? |
|
It would make sense to create the tokio runtime specifically in the benchmark crate and in the CLI but for other cases, we are being called as a library and should assume that the user has created the tokio runtime, I think. |
|
@andygrove , just to understand, we should avoid merging to master until the release, right? (I have this feeling, but I am not super sure xD) |
|
@jorgecarleitao I see no reason to wait before merging this (unless you want to remove the |
|
Ah, ok, I mistakenly understood that we were in a "feature freeze" until the release (which is why I have not been merging stuff these days). |
|
I think there are _some_ things we would want to hold off on merging, like
changing the Rust nightly version for example, or anything else we feel
would be risky at the last minute.
…On Tue, Oct 13, 2020 at 12:07 PM Jorge Leitao ***@***.***> wrote:
Ah, ok, I mistakenly understood that we were in a "feature freeze" until
the release (which is why I have not been merging stuff these days).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8453 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHEBRDSPNVFQAQLRA2BPVTSKSJO5ANCNFSM4SOBNGPA>
.
|
For what it is worth, one thing I plan to do as part of my internal project, is to limit "concurrent CPU bound tasks" by using a |
Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool. Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available. This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance: <details> <summary>Benchmark results</summary> ``` Switched to branch 'simplify_merge' Your branch is up to date with 'origin/simplify_merge'. Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion) Finished bench [optimized] target(s) in 38.02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae Gnuplot not found, using plotters backend aggregate_query_no_group_by 15 12 time: [715.17 us 722.60 us 730.19 us] change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe aggregate_query_group_by 15 12 time: [5.6538 ms 5.6695 ms 5.6892 ms] change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [2.6598 ms 2.6665 ms 2.6751 ms] change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe ``` </details> Closes #8453 from jorgecarleitao/simplify_merge Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Currently,
mergeExecusestokio::spawnto parallelize the work, by callingtokio::spawnonce per logical thread. However,tokio::spawnreturns a task / future, whichtokioruntime will then schedule on its thread pool.Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using
rt-threaded, tokio already declares a thread pool from the number of logical threads available.This PR removes the coupling, in
mergeExec, between the number of logical threads (max_concurrency) and the number of created tasks. I observe no change in performance:Benchmark results