diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md new file mode 100644 index 00000000000..aa80cb71d3b --- /dev/null +++ b/rust/datafusion/DEVELOPERS.md @@ -0,0 +1,92 @@ + + +# Developer's guide + +This section describes how you can get started at developing DataFusion. + +### Bootstrap environment + +DataFusion is written in Rust and it uses a standard rust toolkit: + +* `cargo build` +* `cargo fmt` to format the code +* `cargo test` to test +* etc. + +## How to add a new scalar function + +Below is a checklist of what you need to do to add a new scalar function to DataFusion: + +* Add the actual implementation of the function: + * [here](src/physical_plan/string_expressions.rs) for string functions + * [here](src/physical_plan/math_expressions.rs) for math functions + * [here](src/physical_plan/datetime_expressions.rs) for datetime functions + * create a new module [here](src/physical_plan) for other functions +* In [src/physical_plan/functions](src/physical_plan/functions.rs), add: + * a new variant to `BuiltinScalarFunction` + * a new entry to `FromStr` with the name of the function as called by SQL + * a new line in `return_type` with the expected return type of the function, given an incoming type + * a new line in `signature` with the signature of the function (number and types of its arguments) + * a new line in `create_physical_expr` mapping the built-in to the implementation + * tests to the function. +* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. +* In [src/logical_plan/expr](src/logical_plan/expr.rs), add: + * a new entry of the `unary_scalar_expr!` macro for the new function. +* In [src/logical_plan/mod](src/logical_plan/mod.rs), add: + * a new entry in the `pub use expr::{}` set. + +## How to add a new aggregate function + +Below is a checklist of what you need to do to add a new aggregate function to DataFusion: + +* Add the actual implementation of an `Accumulator` and `AggregateExpr`: + * [here](src/physical_plan/string_expressions.rs) for string functions + * [here](src/physical_plan/math_expressions.rs) for math functions + * [here](src/physical_plan/datetime_expressions.rs) for datetime functions + * create a new module [here](src/physical_plan) for other functions +* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: + * a new variant to `BuiltinAggregateFunction` + * a new entry to `FromStr` with the name of the function as called by SQL + * a new line in `return_type` with the expected return type of the function, given an incoming type + * a new line in `signature` with the signature of the function (number and types of its arguments) + * a new line in `create_aggregate_expr` mapping the built-in to the implementation + * tests to the function. +* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. + +## How to display plans graphically + +The query plans represented by `LogicalPlan` nodes can be graphically +rendered using [Graphviz](http://www.graphviz.org/). + +To do so, save the output of the `display_graphviz` function to a file.: + +```rust +// Create plan somehow... +let mut output = File::create("/tmp/plan.dot")?; +write!(output, "{}", plan.display_graphviz()); +``` + +Then, use the `dot` command line tool to render it into a file that +can be displayed. For example, the following command creates a +`/tmp/plan.pdf` file: + +```bash +dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf +``` diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md index f8d0d92b516..2b69b8a3dec 100644 --- a/rust/datafusion/README.md +++ b/rust/datafusion/README.md @@ -19,11 +19,51 @@ # DataFusion -DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data. + + +DataFusion is an extensible query execution framework, written in +Rust, that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. + +DataFusion supports both an SQL and a DataFrame API for building +logical query plans as well as a query optimizer and execution engine +capable of parallel execution against partitioned data sources (CSV +and Parquet) using threads. + +## Use Cases + +DataFusion is used to create modern, fast and efficient data +pipelines, ETL processes, and database systems, which need the +performance of Rust and Apache Arrow and want to provide their users +the convenience of an SQL interface or a DataFrame API. + +## Why DataFusion? + +* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem +* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase +* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. + +## Known Uses + +Here are some of the projects known to use DataFusion: + +* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform +* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) +* [Cube.js](https://github.com/cube-js/cube.js) +* [datafusion-python](https://pypi.org/project/datafusion) +* [delta-rs](https://github.com/delta-io/delta-rs) +* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database +* [ROAPI](https://github.com/roapi/roapi) + +(if you know of another project, please submit a PR to add a link!) + ## Using DataFusion as a library -DataFusion can be used as a library by adding the following to your `Cargo.toml` file. +DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/). + +To get started, add the following to your `Cargo.toml` file: ```toml [dependencies] @@ -32,7 +72,7 @@ datafusion = "4.0.0-SNAPSHOT" ## Using DataFusion as a binary -DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. +DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. # Status @@ -41,8 +81,11 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] SQL Parser - [x] SQL Query Planner - [x] Query Optimizer -- [x] Projection push down -- [x] Predicate push down + - [x] Constant folding + - [x] Join Reordering + - [x] Limit Pushdown + - [x] Projection push down + - [x] Predicate push down - [x] Type coercion - [x] Parallel query execution @@ -53,12 +96,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] Filter post-aggregate (HAVING) - [x] Limit - [x] Aggregate -- [x] UDFs (user-defined functions) -- [x] UDAFs (user-defined aggregate functions) - [x] Common math functions -- String functions +- Postgres compatible String functions - [x] ascii - - [x] bit_Length + - [x] bit_length - [x] btrim - [x] char_length - [x] character_length @@ -97,7 +138,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [ ] Nested types - [ ] Lists - [x] Subqueries -- [ ] Joins +- [x] Joins + - [x] INNER JOIN + - [ ] CROSS JOIN + - [ ] OUTER JOIN - [ ] Window ## Data Sources @@ -106,9 +150,22 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI - [x] Parquet primitive types - [ ] Parquet nested types + +## Extensibility + +DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: + +- [x] User Defined Functions (UDFs) +- [x] User Defined Aggregate Functions (UDAFs) +- [x] User Defined Table Source (`TableProvider`) for tables +- [x] User Defined `Optimizer` passes (plan rewrites) +- [x] User Defined `LogicalPlan` nodes +- [x] User Defined `ExecutionPlan` nodes + + # Supported SQL -This library currently supports the following SQL constructs: +This library currently supports many SQL constructs, including * `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations * `SELECT ... FROM ...` together with any expression @@ -125,7 +182,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/ Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations. - ## Supported Data Types DataFusion uses Arrow, and thus the Arrow type system, for query @@ -160,76 +216,15 @@ are mapped to Arrow types according to the following table | `CUSTOM` | *Not yet supported* | | `ARRAY` | *Not yet supported* | -# Developer's guide -This section describes how you can get started at developing DataFusion. - -### Bootstrap environment - -DataFusion is written in Rust and it uses a standard rust toolkit: - -* `cargo build` -* `cargo fmt` to format the code -* `cargo test` to test -* etc. - -## How to add a new scalar function - -Below is a checklist of what you need to do to add a new scalar function to DataFusion: - -* Add the actual implementation of the function: - * [here](src/physical_plan/string_expressions.rs) for string functions - * [here](src/physical_plan/math_expressions.rs) for math functions - * [here](src/physical_plan/datetime_expressions.rs) for datetime functions - * create a new module [here](src/physical_plan) for other functions -* In [src/physical_plan/functions](src/physical_plan/functions.rs), add: - * a new variant to `BuiltinScalarFunction` - * a new entry to `FromStr` with the name of the function as called by SQL - * a new line in `return_type` with the expected return type of the function, given an incoming type - * a new line in `signature` with the signature of the function (number and types of its arguments) - * a new line in `create_physical_expr` mapping the built-in to the implementation - * tests to the function. -* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. -* In [src/logical_plan/expr](src/logical_plan/expr.rs), add: - * a new entry of the `unary_scalar_expr!` macro for the new function. -* In [src/logical_plan/mod](src/logical_plan/mod.rs), add: - * a new entry in the `pub use expr::{}` set. - -## How to add a new aggregate function - -Below is a checklist of what you need to do to add a new aggregate function to DataFusion: - -* Add the actual implementation of an `Accumulator` and `AggregateExpr`: - * [here](src/physical_plan/string_expressions.rs) for string functions - * [here](src/physical_plan/math_expressions.rs) for math functions - * [here](src/physical_plan/datetime_expressions.rs) for datetime functions - * create a new module [here](src/physical_plan) for other functions -* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: - * a new variant to `BuiltinAggregateFunction` - * a new entry to `FromStr` with the name of the function as called by SQL - * a new line in `return_type` with the expected return type of the function, given an incoming type - * a new line in `signature` with the signature of the function (number and types of its arguments) - * a new line in `create_aggregate_expr` mapping the built-in to the implementation - * tests to the function. -* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. - -## How to display plans graphically - -The query plans represented by `LogicalPlan` nodes can be graphically -rendered using [Graphviz](http://www.graphviz.org/). - -To do so, save the output of the `display_graphviz` function to a file.: - -```rust -// Create plan somehow... -let mut output = File::create("/tmp/plan.dot")?; -write!(output, "{}", plan.display_graphviz()); -``` +# Architecture Overview -Then, use the `dot` command line tool to render it into a file that -can be displayed. For example, the following command creates a -`/tmp/plan.pdf` file: +There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together. -```bash -dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf -``` +* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) +* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) + + +# Developer's guide + +Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion. diff --git a/rust/datafusion/src/lib.rs b/rust/datafusion/src/lib.rs index 73dca51a8cf..5126f90f89d 100644 --- a/rust/datafusion/src/lib.rs +++ b/rust/datafusion/src/lib.rs @@ -23,7 +23,8 @@ clippy::type_complexity )] -//! DataFusion is an extensible query execution framework that uses +//! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion) +//! is an extensible query execution framework that uses //! [Apache Arrow](https://arrow.apache.org) as its in-memory format. //! //! DataFusion supports both an SQL and a DataFrame API for building logical query plans