Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 14, 2021

It might be easiest to see the changes by looking at the rendered markdown here: https://github.com/alamb/arrow/tree/alamb/update_datafusion_docs/rust/datafusion

Rationale

  1. It would be nice to market / explain DataFusion a bit more and explain what it is good for

Changes

  1. Describe usecases for DataFusion (Add some marketing "spin"??)
  2. Add links to other projects using DataFusion
  3. Add links to tech talks about DataFusion
  4. Split out developer docs into their own file
  5. Use the cool new image that came in ARROW-11860: [Rust] [DataFusion] Add DataFusion logos #9630

@github-actions
Copy link

@@ -0,0 +1,79 @@
# Developer's guide
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this file into its own separate file so that it didn't appear on https://crates.io/crates/datafusion


### Architecture Overview

* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added links to the talks from myself and @andygrove were added -- hopefully that is not too aggrandizing, but I think the content is helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the links are helpful and should be kept.

I do think that since those slides and presentations did not go under the review / PR process, it may be confusing to say that they describe the architecture.

What do you think of

There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
* [ROAPI](https://github.com/roapi/roapi)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYi @houqp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @rdettai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some of the projects known to use DataFusion:

* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

## Use Cases
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing this page might benefit from is an example (e.g. look at how tokio does it https://crates.io/crates/tokio)

Perhaps we could just lift the nice example from https://docs.rs/datafusion/3.0.0/datafusion/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I will do so as a follow on PR: #9710

- [x] Projection push down
- [x] Predicate push down
- [x] Constant folding
- [x] Limit Pushdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Join reordering

- [ ] Lists
- [x] Subqueries
- [ ] Joins
- [x] Joins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner/Left/right/ (no cross and outer joins)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this, @alamb ; it seriously improved the appearance and selling points :)


### Architecture Overview

* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the links are helpful and should be kept.

I do think that since those slides and presentations did not go under the review / PR process, it may be confusing to say that they describe the architecture.

What do you think of

There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

* `cargo test` to test
* etc.

### Architecture Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we could keep this section on the README, since it concerns everyone (not just contributors / developers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do


## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have no benchmarks against spark, dask, pandas, etc, so we need to be careful about this claim. I think that @Dandandan was working in running benchmarks against other engines, ARROW-11252, to see it performs.

Copy link
Contributor Author

@alamb alamb Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe -- you caught me here selling ahead of the product. You will notice I didn't actually say what DataFusion had superior performance to. I will change this to say "achieves very high" performance without quantifying

* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [ROAPI](https://github.com/roapi/roapi)

(if you know of another project, please submit a PR to add a link!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb closed this in 23a32fb Mar 15, 2021
@alamb alamb deleted the alamb/update_datafusion_docs branch March 15, 2021 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants