From d57927d567ec849706d4a073c68dba9133b7ad67 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 14 Mar 2021 07:42:20 -0400 Subject: [PATCH 1/6] ARROW-11962: [Rust][DataFusion] Improve DataFusion docs --- rust/datafusion/DEVELOPERS.md | 79 +++++++++++++++++++ rust/datafusion/README.md | 143 +++++++++++++++------------------- rust/datafusion/src/lib.rs | 3 +- 3 files changed, 142 insertions(+), 83 deletions(-) create mode 100644 rust/datafusion/DEVELOPERS.md diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md new file mode 100644 index 00000000000..e92fc3d9a95 --- /dev/null +++ b/rust/datafusion/DEVELOPERS.md @@ -0,0 +1,79 @@ +# Developer's guide + +This section describes how you can get started at developing DataFusion. + +### Bootstrap environment + +DataFusion is written in Rust and it uses a standard rust toolkit: + +* `cargo build` +* `cargo fmt` to format the code +* `cargo test` to test +* etc. + +### Architecture Overview + +* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) +* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) + + +## How to add a new scalar function + +Below is a checklist of what you need to do to add a new scalar function to DataFusion: + +* Add the actual implementation of the function: + * [here](src/physical_plan/string_expressions.rs) for string functions + * [here](src/physical_plan/math_expressions.rs) for math functions + * [here](src/physical_plan/datetime_expressions.rs) for datetime functions + * create a new module [here](src/physical_plan) for other functions +* In [src/physical_plan/functions](src/physical_plan/functions.rs), add: + * a new variant to `BuiltinScalarFunction` + * a new entry to `FromStr` with the name of the function as called by SQL + * a new line in `return_type` with the expected return type of the function, given an incoming type + * a new line in `signature` with the signature of the function (number and types of its arguments) + * a new line in `create_physical_expr` mapping the built-in to the implementation + * tests to the function. +* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. +* In [src/logical_plan/expr](src/logical_plan/expr.rs), add: + * a new entry of the `unary_scalar_expr!` macro for the new function. +* In [src/logical_plan/mod](src/logical_plan/mod.rs), add: + * a new entry in the `pub use expr::{}` set. + +## How to add a new aggregate function + +Below is a checklist of what you need to do to add a new aggregate function to DataFusion: + +* Add the actual implementation of an `Accumulator` and `AggregateExpr`: + * [here](src/physical_plan/string_expressions.rs) for string functions + * [here](src/physical_plan/math_expressions.rs) for math functions + * [here](src/physical_plan/datetime_expressions.rs) for datetime functions + * create a new module [here](src/physical_plan) for other functions +* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: + * a new variant to `BuiltinAggregateFunction` + * a new entry to `FromStr` with the name of the function as called by SQL + * a new line in `return_type` with the expected return type of the function, given an incoming type + * a new line in `signature` with the signature of the function (number and types of its arguments) + * a new line in `create_aggregate_expr` mapping the built-in to the implementation + * tests to the function. +* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. + +## How to display plans graphically + +The query plans represented by `LogicalPlan` nodes can be graphically +rendered using [Graphviz](http://www.graphviz.org/). + +To do so, save the output of the `display_graphviz` function to a file.: + +```rust +// Create plan somehow... +let mut output = File::create("/tmp/plan.dot")?; +write!(output, "{}", plan.display_graphviz()); +``` + +Then, use the `dot` command line tool to render it into a file that +can be displayed. For example, the following command creates a +`/tmp/plan.pdf` file: + +```bash +dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf +``` diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index f8d0d92b516..9b31968e093 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -19,11 +19,48 @@ # DataFusion -DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data. + + +DataFusion is an extensible query execution framework, written in +Rust, that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. + +DataFusion supports both an SQL and a DataFrame API for building +logical query plans as well as a query optimizer and execution engine +capable of parallel execution against partitioned data sources (CSV +and Parquet) using threads. + +## Use Cases + +DataFusion is used to create modern, fast and efficient data +pipelines, ETL processes, and database systems, which need the +performance of Rust and Apache Arrow and want to provide their users +the convenience of an SQL interface or a DataFrame API. + +## Why DataFusion? + +* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance +* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem +* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase +* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. + +## Known Uses + +Here are some of the projects known to use DataFusion: + +* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database +* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform +* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) +* [ROAPI](https://github.com/roapi/roapi) + +(if you know of another project, please submit a PR to add a link!) + ## Using DataFusion as a library -DataFusion can be used as a library by adding the following to your `Cargo.toml` file. +DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/). + +To get started, add the following to your `Cargo.toml` file: ```toml [dependencies] @@ -32,7 +69,7 @@ datafusion = "4.0.0-SNAPSHOT" ## Using DataFusion as a binary -DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. +DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. # Status @@ -41,8 +78,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] SQL Parser - [x] SQL Query Planner - [x] Query Optimizer -- [x] Projection push down -- [x] Predicate push down + - [x] Projection push down + - [x] Predicate push down + - [x] Constant folding + - [x] Limit Pushdown - [x] Type coercion - [x] Parallel query execution @@ -53,12 +92,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] Filter post-aggregate (HAVING) - [x] Limit - [x] Aggregate -- [x] UDFs (user-defined functions) -- [x] UDAFs (user-defined aggregate functions) - [x] Common math functions -- String functions +- Postgres compatible String functions - [x] ascii - - [x] bit_Length + - [x] bit_length - [x] btrim - [x] char_length - [x] character_length @@ -97,7 +134,7 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [ ] Nested types - [ ] Lists - [x] Subqueries -- [ ] Joins +- [x] Joins - [ ] Window ## Data Sources @@ -106,6 +143,19 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] Parquet primitive types - [ ] Parquet nested types + +## Extensibility + +DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: + +- [x] User Defined Functions (UDFs) +- [x] User Defined Aggregate Functions (UDAFs) +- [x] User Defined Table Source (`TableProvider`) for tables +- [x] User Defined `Optimizer` passes (plan rewrites) +- [x] User Defined `LogicalPlan` nodes +- [x] User Defined `ExecutionPlan` nodes + + # Supported SQL This library currently supports the following SQL constructs: @@ -125,7 +175,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/ Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations. - ## Supported Data Types DataFusion uses Arrow, and thus the Arrow type system, for query @@ -162,74 +211,4 @@ are mapped to Arrow types according to the following table # Developer's guide -This section describes how you can get started at developing DataFusion. - -### Bootstrap environment - -DataFusion is written in Rust and it uses a standard rust toolkit: - -* `cargo build` -* `cargo fmt` to format the code -* `cargo test` to test -* etc. - -## How to add a new scalar function - -Below is a checklist of what you need to do to add a new scalar function to DataFusion: - -* Add the actual implementation of the function: - * [here](src/physical_plan/string_expressions.rs) for string functions - * [here](src/physical_plan/math_expressions.rs) for math functions - * [here](src/physical_plan/datetime_expressions.rs) for datetime functions - * create a new module [here](src/physical_plan) for other functions -* In [src/physical_plan/functions](src/physical_plan/functions.rs), add: - * a new variant to `BuiltinScalarFunction` - * a new entry to `FromStr` with the name of the function as called by SQL - * a new line in `return_type` with the expected return type of the function, given an incoming type - * a new line in `signature` with the signature of the function (number and types of its arguments) - * a new line in `create_physical_expr` mapping the built-in to the implementation - * tests to the function. -* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. -* In [src/logical_plan/expr](src/logical_plan/expr.rs), add: - * a new entry of the `unary_scalar_expr!` macro for the new function. -* In [src/logical_plan/mod](src/logical_plan/mod.rs), add: - * a new entry in the `pub use expr::{}` set. - -## How to add a new aggregate function - -Below is a checklist of what you need to do to add a new aggregate function to DataFusion: - -* Add the actual implementation of an `Accumulator` and `AggregateExpr`: - * [here](src/physical_plan/string_expressions.rs) for string functions - * [here](src/physical_plan/math_expressions.rs) for math functions - * [here](src/physical_plan/datetime_expressions.rs) for datetime functions - * create a new module [here](src/physical_plan) for other functions -* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: - * a new variant to `BuiltinAggregateFunction` - * a new entry to `FromStr` with the name of the function as called by SQL - * a new line in `return_type` with the expected return type of the function, given an incoming type - * a new line in `signature` with the signature of the function (number and types of its arguments) - * a new line in `create_aggregate_expr` mapping the built-in to the implementation - * tests to the function. -* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. - -## How to display plans graphically - -The query plans represented by `LogicalPlan` nodes can be graphically -rendered using [Graphviz](http://www.graphviz.org/). - -To do so, save the output of the `display_graphviz` function to a file.: - -```rust -// Create plan somehow... -let mut output = File::create("/tmp/plan.dot")?; -write!(output, "{}", plan.display_graphviz()); -``` - -Then, use the `dot` command line tool to render it into a file that -can be displayed. For example, the following command creates a -`/tmp/plan.pdf` file: - -```bash -dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf -``` +Please see [Developers Guide](DEVELOPERS.md) diff --git a/rust/datafusion/src/lib.rs b/rust/datafusion/src/lib.rs index 73dca51a8cf..5126f90f89d 100644 --- a/rust/datafusion/src/lib.rs +++ b/rust/datafusion/src/lib.rs @@ -23,7 +23,8 @@ clippy::type_complexity )] -//! DataFusion is an extensible query execution framework that uses +//! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion) +//! is an extensible query execution framework that uses //! [Apache Arrow](https://arrow.apache.org) as its in-memory format. //! //! DataFusion supports both an SQL and a DataFrame API for building logical query plans From 41b7fc83b6c9e2f3af35792f5a2fc6bdc8dfd443 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 14 Mar 2021 08:24:38 -0400 Subject: [PATCH 2/6] add apache license --- rust/datafusion/DEVELOPERS.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md index e92fc3d9a95..ea3a389c86b 100644 --- a/rust/datafusion/DEVELOPERS.md +++ b/rust/datafusion/DEVELOPERS.md @@ -1,3 +1,22 @@ + + # Developer's guide This section describes how you can get started at developing DataFusion. From 3e956ec26350d4d0017ebf5206e44d6b828f0856 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 15 Mar 2021 09:30:58 -0400 Subject: [PATCH 3/6] Add other projects, update feature list --- rust/datafusion/README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index 9b31968e093..25cbe5ba72e 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -48,9 +48,11 @@ the convenience of an SQL interface or a DataFrame API. Here are some of the projects known to use DataFusion: -* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) +* [Cube.js](https://github.com/cube-js/cube.js) +* [delta-rs](https://github.com/delta-io/delta-rs) +* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database * [ROAPI](https://github.com/roapi/roapi) (if you know of another project, please submit a PR to add a link!) @@ -78,10 +80,11 @@ DataFusion also includes a simple command-line interactive SQL utility. See the - [x] SQL Parser - [x] SQL Query Planner - [x] Query Optimizer - - [x] Projection push down - - [x] Predicate push down - [x] Constant folding + - [x] Join Reordering - [x] Limit Pushdown + - [x] Projection push down + - [x] Predicate push down - [x] Type coercion - [x] Parallel query execution @@ -135,6 +138,9 @@ DataFusion also includes a simple command-line interactive SQL utility. See the - [ ] Lists - [x] Subqueries - [x] Joins + - [x] INNER JOIN + - [ ] CROSS JOIN + - [] OUTER JOIN - [ ] Window ## Data Sources From 4b0af441657c2765a9d2685fd14941406a33341d Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 15 Mar 2021 11:48:40 -0400 Subject: [PATCH 4/6] More improvements --- rust/datafusion/DEVELOPERS.md | 6 ------ rust/datafusion/README.md | 14 ++++++++++++-- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md index ea3a389c86b..aa80cb71d3b 100644 --- a/rust/datafusion/DEVELOPERS.md +++ b/rust/datafusion/DEVELOPERS.md @@ -30,12 +30,6 @@ DataFusion is written in Rust and it uses a standard rust toolkit: * `cargo test` to test * etc. -### Architecture Overview - -* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) -* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) - - ## How to add a new scalar function Below is a checklist of what you need to do to add a new scalar function to DataFusion: diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index 25cbe5ba72e..6d658f29106 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -39,7 +39,7 @@ the convenience of an SQL interface or a DataFrame API. ## Why DataFusion? -* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance +* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance * *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem * *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase * *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. @@ -51,6 +51,7 @@ Here are some of the projects known to use DataFusion: * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) * [Cube.js](https://github.com/cube-js/cube.js) +* [datafusion-python](https://pypi.org/project/datafusion) * [delta-rs](https://github.com/delta-io/delta-rs) * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database * [ROAPI](https://github.com/roapi/roapi) @@ -215,6 +216,15 @@ are mapped to Arrow types according to the following table | `CUSTOM` | *Not yet supported* | | `ARRAY` | *Not yet supported* | + +# Architecture Overview + +There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together. + +* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) +* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) + + # Developer's guide -Please see [Developers Guide](DEVELOPERS.md) +Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion. From 5ef9f51ffa4a8701820172b460acddce44f89eed Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 15 Mar 2021 11:50:27 -0400 Subject: [PATCH 5/6] tweak --- rust/datafusion/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index 6d658f29106..5fc5687bad2 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -141,7 +141,7 @@ DataFusion also includes a simple command-line interactive SQL utility. See the - [x] Joins - [x] INNER JOIN - [ ] CROSS JOIN - - [] OUTER JOIN + - [ ] OUTER JOIN - [ ] Window ## Data Sources @@ -165,7 +165,7 @@ DataFusion is designed to be extensible at all points. To that end, you can prov # Supported SQL -This library currently supports the following SQL constructs: +This library currently supports many SQL constructs, including * `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations * `SELECT ... FROM ...` together with any expression From a57ea367ddcfb1bc26632dd610bbd7415f5e3002 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 15 Mar 2021 11:50:58 -0400 Subject: [PATCH 6/6] fix indent --- rust/datafusion/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index 5fc5687bad2..2b69b8a3dec 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -139,9 +139,9 @@ DataFusion also includes a simple command-line interactive SQL utility. See the - [ ] Lists - [x] Subqueries - [x] Joins - - [x] INNER JOIN - - [ ] CROSS JOIN - - [ ] OUTER JOIN + - [x] INNER JOIN + - [ ] CROSS JOIN + - [ ] OUTER JOIN - [ ] Window ## Data Sources