-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11962: [Rust][DataFusion] Improve DataFusion docs #9701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d57927d
41b7fc8
3e956ec
4b0af44
5ef9f51
a57ea36
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| <!--- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| # Developer's guide | ||
|
|
||
| This section describes how you can get started at developing DataFusion. | ||
|
|
||
| ### Bootstrap environment | ||
|
|
||
| DataFusion is written in Rust and it uses a standard rust toolkit: | ||
|
|
||
| * `cargo build` | ||
| * `cargo fmt` to format the code | ||
| * `cargo test` to test | ||
| * etc. | ||
|
|
||
| ## How to add a new scalar function | ||
|
|
||
| Below is a checklist of what you need to do to add a new scalar function to DataFusion: | ||
|
|
||
| * Add the actual implementation of the function: | ||
| * [here](src/physical_plan/string_expressions.rs) for string functions | ||
| * [here](src/physical_plan/math_expressions.rs) for math functions | ||
| * [here](src/physical_plan/datetime_expressions.rs) for datetime functions | ||
| * create a new module [here](src/physical_plan) for other functions | ||
| * In [src/physical_plan/functions](src/physical_plan/functions.rs), add: | ||
| * a new variant to `BuiltinScalarFunction` | ||
| * a new entry to `FromStr` with the name of the function as called by SQL | ||
| * a new line in `return_type` with the expected return type of the function, given an incoming type | ||
| * a new line in `signature` with the signature of the function (number and types of its arguments) | ||
| * a new line in `create_physical_expr` mapping the built-in to the implementation | ||
| * tests to the function. | ||
| * In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. | ||
| * In [src/logical_plan/expr](src/logical_plan/expr.rs), add: | ||
| * a new entry of the `unary_scalar_expr!` macro for the new function. | ||
| * In [src/logical_plan/mod](src/logical_plan/mod.rs), add: | ||
| * a new entry in the `pub use expr::{}` set. | ||
|
|
||
| ## How to add a new aggregate function | ||
|
|
||
| Below is a checklist of what you need to do to add a new aggregate function to DataFusion: | ||
|
|
||
| * Add the actual implementation of an `Accumulator` and `AggregateExpr`: | ||
| * [here](src/physical_plan/string_expressions.rs) for string functions | ||
| * [here](src/physical_plan/math_expressions.rs) for math functions | ||
| * [here](src/physical_plan/datetime_expressions.rs) for datetime functions | ||
| * create a new module [here](src/physical_plan) for other functions | ||
| * In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: | ||
| * a new variant to `BuiltinAggregateFunction` | ||
| * a new entry to `FromStr` with the name of the function as called by SQL | ||
| * a new line in `return_type` with the expected return type of the function, given an incoming type | ||
| * a new line in `signature` with the signature of the function (number and types of its arguments) | ||
| * a new line in `create_aggregate_expr` mapping the built-in to the implementation | ||
| * tests to the function. | ||
| * In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. | ||
|
|
||
| ## How to display plans graphically | ||
|
|
||
| The query plans represented by `LogicalPlan` nodes can be graphically | ||
| rendered using [Graphviz](http://www.graphviz.org/). | ||
|
|
||
| To do so, save the output of the `display_graphviz` function to a file.: | ||
|
|
||
| ```rust | ||
| // Create plan somehow... | ||
| let mut output = File::create("/tmp/plan.dot")?; | ||
| write!(output, "{}", plan.display_graphviz()); | ||
| ``` | ||
|
|
||
| Then, use the `dot` command line tool to render it into a file that | ||
| can be displayed. For example, the following command creates a | ||
| `/tmp/plan.pdf` file: | ||
|
|
||
| ```bash | ||
| dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,11 +19,51 @@ | |
|
|
||
| # DataFusion | ||
|
|
||
| DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data. | ||
| <img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/> | ||
|
|
||
| DataFusion is an extensible query execution framework, written in | ||
| Rust, that uses [Apache Arrow](https://arrow.apache.org) as its | ||
| in-memory format. | ||
|
|
||
| DataFusion supports both an SQL and a DataFrame API for building | ||
| logical query plans as well as a query optimizer and execution engine | ||
| capable of parallel execution against partitioned data sources (CSV | ||
| and Parquet) using threads. | ||
|
|
||
| ## Use Cases | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One other thing this page might benefit from is an example (e.g. look at how tokio does it https://crates.io/crates/tokio) Perhaps we could just lift the nice example from https://docs.rs/datafusion/3.0.0/datafusion/ ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI I will do so as a follow on PR: #9710 |
||
|
|
||
| DataFusion is used to create modern, fast and efficient data | ||
| pipelines, ETL processes, and database systems, which need the | ||
| performance of Rust and Apache Arrow and want to provide their users | ||
| the convenience of an SQL interface or a DataFrame API. | ||
|
|
||
| ## Why DataFusion? | ||
|
|
||
| * *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance | ||
| * *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem | ||
| * *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase | ||
| * *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. | ||
|
|
||
| ## Known Uses | ||
|
|
||
| Here are some of the projects known to use DataFusion: | ||
|
|
||
| * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fyi @andygrove |
||
| * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fyi @rdettai
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://github.com/cube-js/cube.js could be added?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed -- https://github.com/cube-js/cube.js/blob/9a4603a857c55b868fe20e8d45536d1f1188cf44/rust/cubestore/Cargo.toml FYI @paveltiunov |
||
| * [Cube.js](https://github.com/cube-js/cube.js) | ||
| * [datafusion-python](https://pypi.org/project/datafusion) | ||
| * [delta-rs](https://github.com/delta-io/delta-rs) | ||
| * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database | ||
| * [ROAPI](https://github.com/roapi/roapi) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYi @houqp
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. delta-rs (https://github.com/delta-io/delta-rs) also uses datafusion to for querying delta tables :) for example: https://github.com/delta-io/delta-rs/blob/main/rust/tests/datafusion_test.rs |
||
|
|
||
| (if you know of another project, please submit a PR to add a link!) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. datafusion-python? ^_^ |
||
|
|
||
|
|
||
| ## Using DataFusion as a library | ||
|
|
||
| DataFusion can be used as a library by adding the following to your `Cargo.toml` file. | ||
| DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/). | ||
|
|
||
| To get started, add the following to your `Cargo.toml` file: | ||
|
|
||
| ```toml | ||
| [dependencies] | ||
|
|
@@ -32,7 +72,7 @@ datafusion = "4.0.0-SNAPSHOT" | |
|
|
||
| ## Using DataFusion as a binary | ||
|
|
||
| DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. | ||
| DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information. | ||
|
|
||
| # Status | ||
|
|
||
|
|
@@ -41,8 +81,11 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI | |
| - [x] SQL Parser | ||
| - [x] SQL Query Planner | ||
| - [x] Query Optimizer | ||
| - [x] Projection push down | ||
| - [x] Predicate push down | ||
| - [x] Constant folding | ||
| - [x] Join Reordering | ||
| - [x] Limit Pushdown | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +Join reordering |
||
| - [x] Projection push down | ||
| - [x] Predicate push down | ||
| - [x] Type coercion | ||
| - [x] Parallel query execution | ||
|
|
||
|
|
@@ -53,12 +96,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI | |
| - [x] Filter post-aggregate (HAVING) | ||
| - [x] Limit | ||
| - [x] Aggregate | ||
| - [x] UDFs (user-defined functions) | ||
| - [x] UDAFs (user-defined aggregate functions) | ||
| - [x] Common math functions | ||
| - String functions | ||
| - Postgres compatible String functions | ||
| - [x] ascii | ||
| - [x] bit_Length | ||
| - [x] bit_length | ||
| - [x] btrim | ||
| - [x] char_length | ||
| - [x] character_length | ||
|
|
@@ -97,7 +138,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI | |
| - [ ] Nested types | ||
| - [ ] Lists | ||
| - [x] Subqueries | ||
| - [ ] Joins | ||
| - [x] Joins | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inner/Left/right/ (no cross and outer joins) |
||
| - [x] INNER JOIN | ||
| - [ ] CROSS JOIN | ||
| - [ ] OUTER JOIN | ||
| - [ ] Window | ||
|
|
||
| ## Data Sources | ||
|
|
@@ -106,9 +150,22 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI | |
| - [x] Parquet primitive types | ||
| - [ ] Parquet nested types | ||
|
|
||
|
|
||
| ## Extensibility | ||
|
|
||
| DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: | ||
|
|
||
| - [x] User Defined Functions (UDFs) | ||
| - [x] User Defined Aggregate Functions (UDAFs) | ||
| - [x] User Defined Table Source (`TableProvider`) for tables | ||
| - [x] User Defined `Optimizer` passes (plan rewrites) | ||
| - [x] User Defined `LogicalPlan` nodes | ||
| - [x] User Defined `ExecutionPlan` nodes | ||
|
|
||
|
|
||
| # Supported SQL | ||
|
|
||
| This library currently supports the following SQL constructs: | ||
| This library currently supports many SQL constructs, including | ||
|
|
||
| * `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations | ||
| * `SELECT ... FROM ...` together with any expression | ||
|
|
@@ -125,7 +182,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/ | |
|
|
||
| Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations. | ||
|
|
||
|
|
||
| ## Supported Data Types | ||
|
|
||
| DataFusion uses Arrow, and thus the Arrow type system, for query | ||
|
|
@@ -160,76 +216,15 @@ are mapped to Arrow types according to the following table | |
| | `CUSTOM` | *Not yet supported* | | ||
| | `ARRAY` | *Not yet supported* | | ||
|
|
||
| # Developer's guide | ||
|
|
||
| This section describes how you can get started at developing DataFusion. | ||
|
|
||
| ### Bootstrap environment | ||
|
|
||
| DataFusion is written in Rust and it uses a standard rust toolkit: | ||
|
|
||
| * `cargo build` | ||
| * `cargo fmt` to format the code | ||
| * `cargo test` to test | ||
| * etc. | ||
|
|
||
| ## How to add a new scalar function | ||
|
|
||
| Below is a checklist of what you need to do to add a new scalar function to DataFusion: | ||
|
|
||
| * Add the actual implementation of the function: | ||
| * [here](src/physical_plan/string_expressions.rs) for string functions | ||
| * [here](src/physical_plan/math_expressions.rs) for math functions | ||
| * [here](src/physical_plan/datetime_expressions.rs) for datetime functions | ||
| * create a new module [here](src/physical_plan) for other functions | ||
| * In [src/physical_plan/functions](src/physical_plan/functions.rs), add: | ||
| * a new variant to `BuiltinScalarFunction` | ||
| * a new entry to `FromStr` with the name of the function as called by SQL | ||
| * a new line in `return_type` with the expected return type of the function, given an incoming type | ||
| * a new line in `signature` with the signature of the function (number and types of its arguments) | ||
| * a new line in `create_physical_expr` mapping the built-in to the implementation | ||
| * tests to the function. | ||
| * In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. | ||
| * In [src/logical_plan/expr](src/logical_plan/expr.rs), add: | ||
| * a new entry of the `unary_scalar_expr!` macro for the new function. | ||
| * In [src/logical_plan/mod](src/logical_plan/mod.rs), add: | ||
| * a new entry in the `pub use expr::{}` set. | ||
|
|
||
| ## How to add a new aggregate function | ||
|
|
||
| Below is a checklist of what you need to do to add a new aggregate function to DataFusion: | ||
|
|
||
| * Add the actual implementation of an `Accumulator` and `AggregateExpr`: | ||
| * [here](src/physical_plan/string_expressions.rs) for string functions | ||
| * [here](src/physical_plan/math_expressions.rs) for math functions | ||
| * [here](src/physical_plan/datetime_expressions.rs) for datetime functions | ||
| * create a new module [here](src/physical_plan) for other functions | ||
| * In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add: | ||
| * a new variant to `BuiltinAggregateFunction` | ||
| * a new entry to `FromStr` with the name of the function as called by SQL | ||
| * a new line in `return_type` with the expected return type of the function, given an incoming type | ||
| * a new line in `signature` with the signature of the function (number and types of its arguments) | ||
| * a new line in `create_aggregate_expr` mapping the built-in to the implementation | ||
| * tests to the function. | ||
| * In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result. | ||
|
|
||
| ## How to display plans graphically | ||
|
|
||
| The query plans represented by `LogicalPlan` nodes can be graphically | ||
| rendered using [Graphviz](http://www.graphviz.org/). | ||
|
|
||
| To do so, save the output of the `display_graphviz` function to a file.: | ||
|
|
||
| ```rust | ||
| // Create plan somehow... | ||
| let mut output = File::create("/tmp/plan.dot")?; | ||
| write!(output, "{}", plan.display_graphviz()); | ||
| ``` | ||
| # Architecture Overview | ||
|
|
||
| Then, use the `dot` command line tool to render it into a file that | ||
| can be displayed. For example, the following command creates a | ||
| `/tmp/plan.pdf` file: | ||
| There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together. | ||
|
|
||
| ```bash | ||
| dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf | ||
| ``` | ||
| * (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) | ||
| * (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) | ||
|
|
||
|
|
||
| # Developer's guide | ||
|
|
||
| Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this file into its own separate file so that it didn't appear on https://crates.io/crates/datafusion