-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11962: [Rust][DataFusion] Improve DataFusion docs #9701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,79 @@ | |||
| # Developer's guide | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this file into its own separate file so that it didn't appear on https://crates.io/crates/datafusion
rust/datafusion/DEVELOPERS.md
Outdated
|
|
||
| ### Architecture Overview | ||
|
|
||
| * (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added links to the talks from myself and @andygrove were added -- hopefully that is not too aggrandizing, but I think the content is helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the links are helpful and should be kept.
I do think that since those slides and presentations did not go under the review / PR process, it may be confusing to say that they describe the architecture.
What do you think of
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
| * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database | ||
| * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform | ||
| * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) | ||
| * [ROAPI](https://github.com/roapi/roapi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYi @houqp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta-rs (https://github.com/delta-io/delta-rs) also uses datafusion to for querying delta tables :) for example: https://github.com/delta-io/delta-rs/blob/main/rust/tests/datafusion_test.rs
|
|
||
| * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database | ||
| * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform | ||
| * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @rdettai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/cube-js/cube.js could be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed -- https://github.com/cube-js/cube.js/blob/9a4603a857c55b868fe20e8d45536d1f1188cf44/rust/cubestore/Cargo.toml
I will add to the list. Thank you
FYI @paveltiunov
| Here are some of the projects known to use DataFusion: | ||
|
|
||
| * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database | ||
| * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @andygrove
| capable of parallel execution against partitioned data sources (CSV | ||
| and Parquet) using threads. | ||
|
|
||
| ## Use Cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing this page might benefit from is an example (e.g. look at how tokio does it https://crates.io/crates/tokio)
Perhaps we could just lift the nice example from https://docs.rs/datafusion/3.0.0/datafusion/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I will do so as a follow on PR: #9710
| - [x] Projection push down | ||
| - [x] Predicate push down | ||
| - [x] Constant folding | ||
| - [x] Limit Pushdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+Join reordering
| - [ ] Lists | ||
| - [x] Subqueries | ||
| - [ ] Joins | ||
| - [x] Joins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inner/Left/right/ (no cross and outer joins)
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this, @alamb ; it seriously improved the appearance and selling points :)
rust/datafusion/DEVELOPERS.md
Outdated
|
|
||
| ### Architecture Overview | ||
|
|
||
| * (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the links are helpful and should be kept.
I do think that since those slides and presentations did not go under the review / PR process, it may be confusing to say that they describe the architecture.
What do you think of
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
rust/datafusion/DEVELOPERS.md
Outdated
| * `cargo test` to test | ||
| * etc. | ||
|
|
||
| ### Architecture Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we could keep this section on the README, since it concerns everyone (not just contributors / developers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
rust/datafusion/README.md
Outdated
|
|
||
| ## Why DataFusion? | ||
|
|
||
| * *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have no benchmarks against spark, dask, pandas, etc, so we need to be careful about this claim. I think that @Dandandan was working in running benchmarks against other engines, ARROW-11252, to see it performs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe -- you caught me here selling ahead of the product. You will notice I didn't actually say what DataFusion had superior performance to. I will change this to say "achieves very high" performance without quantifying
| * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database | ||
| * [ROAPI](https://github.com/roapi/roapi) | ||
|
|
||
| (if you know of another project, please submit a PR to add a link!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion-python? ^_^
It might be easiest to see the changes by looking at the rendered markdown here: https://github.com/alamb/arrow/tree/alamb/update_datafusion_docs/rust/datafusion
Rationale
Changes