apache · alamb · Mar 14, 2021 · Mar 14, 2021 · Mar 15, 2021 · Mar 15, 2021
diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md
@@ -0,0 +1,92 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Developer's guide
+
+This section describes how you can get started at developing DataFusion.
+
+### Bootstrap environment
+
+DataFusion is written in Rust and it uses a standard rust toolkit:
+
+* `cargo build`
+* `cargo fmt` to format the code
+* `cargo test` to test
+* etc.
+
+## How to add a new scalar function
+
+Below is a checklist of what you need to do to add a new scalar function to DataFusion:
+
+* Add the actual implementation of the function:
+  * [here](src/physical_plan/string_expressions.rs) for string functions
+  * [here](src/physical_plan/math_expressions.rs) for math functions
+  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
+  * create a new module [here](src/physical_plan) for other functions
+* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
+  * a new variant to `BuiltinScalarFunction`
+  * a new entry to `FromStr` with the name of the function as called by SQL
+  * a new line in `return_type` with the expected return type of the function, given an incoming type
+  * a new line in `signature` with the signature of the function (number and types of its arguments)
+  * a new line in `create_physical_expr` mapping the built-in to the implementation
+  * tests to the function.
+* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
+  * a new entry of the `unary_scalar_expr!` macro for the new function.
+* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
+  * a new entry in the `pub use expr::{}` set.
+
+## How to add a new aggregate function
+
+Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
+
+* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
+  * [here](src/physical_plan/string_expressions.rs) for string functions
+  * [here](src/physical_plan/math_expressions.rs) for math functions
+  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
+  * create a new module [here](src/physical_plan) for other functions
+* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
+  * a new variant to `BuiltinAggregateFunction`
+  * a new entry to `FromStr` with the name of the function as called by SQL
+  * a new line in `return_type` with the expected return type of the function, given an incoming type
+  * a new line in `signature` with the signature of the function (number and types of its arguments)
+  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
+  * tests to the function.
+* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+
+## How to display plans graphically
+
+The query plans represented by `LogicalPlan` nodes can be graphically
+rendered using [Graphviz](http://www.graphviz.org/).
+
+To do so, save the output of the `display_graphviz` function to a file.:
+
+```rust
+// Create plan somehow...
+let mut output = File::create("/tmp/plan.dot")?;
+write!(output, "{}", plan.display_graphviz());
+```
+
+Then, use the `dot` command line tool to render it into a file that
+can be displayed. For example, the following command creates a
+`/tmp/plan.pdf` file:
+
+```bash
+dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
+```
diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
@@ -19,11 +19,51 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
+<img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/>
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [Cube.js](https://github.com/cube-js/cube.js)
+* [datafusion-python](https://pypi.org/project/datafusion)
+* [delta-rs](https://github.com/delta-io/delta-rs)
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+* [ROAPI](https://github.com/roapi/roapi)
+
+(if you know of another project, please submit a PR to add a link!)
+
 
 ## Using DataFusion as a library
 
-DataFusion can be used as a library by adding the following to your `Cargo.toml` file.
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
 
 ```toml
 [dependencies]
@@ -32,7 +72,7 @@ datafusion = "4.0.0-SNAPSHOT"
 
 ## Using DataFusion as a binary
 
-DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
+DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
 
 # Status
 
@@ -41,8 +81,11 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] SQL Parser
 - [x] SQL Query Planner
 - [x] Query Optimizer
-- [x] Projection push down
-- [x] Predicate push down
+ - [x] Constant folding
+ - [x] Join Reordering
+ - [x] Limit Pushdown
+ - [x] Projection push down
+ - [x] Predicate push down
 - [x] Type coercion
 - [x] Parallel query execution
 
@@ -53,12 +96,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] Filter post-aggregate (HAVING)
 - [x] Limit
 - [x] Aggregate
-- [x] UDFs (user-defined functions)
-- [x] UDAFs (user-defined aggregate functions)
 - [x] Common math functions
-- String functions
+- Postgres compatible String functions
   - [x] ascii
-  - [x] bit_Length
+  - [x] bit_length
   - [x] btrim
   - [x] char_length
   - [x] character_length
@@ -97,7 +138,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [ ] Nested types
 - [ ] Lists
 - [x] Subqueries
-- [ ] Joins
+- [x] Joins
+  - [x] INNER JOIN
+  - [ ] CROSS JOIN
+  - [ ] OUTER JOIN
 - [ ] Window
 
 ## Data Sources
@@ -106,9 +150,22 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] Parquet primitive types
 - [ ] Parquet nested types
 
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+
 # Supported SQL
 
-This library currently supports the following SQL constructs:
+This library currently supports many SQL constructs, including
 
 * `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
 * `SELECT ... FROM ...` together with any expression
@@ -125,7 +182,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/
 
 Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
 
-
 ## Supported Data Types
 
 DataFusion uses Arrow, and thus the Arrow type system, for query
@@ -160,76 +216,15 @@ are mapped to Arrow types according to the following table
 | `CUSTOM`        | *Not yet supported*              |
 | `ARRAY`         | *Not yet supported*              |
 
-# Developer's guide
 
-This section describes how you can get started at developing DataFusion.
-
-### Bootstrap environment
-
-DataFusion is written in Rust and it uses a standard rust toolkit:
-
-* `cargo build`
-* `cargo fmt` to format the code
-* `cargo test` to test
-* etc.
-
-## How to add a new scalar function
-
-Below is a checklist of what you need to do to add a new scalar function to DataFusion:
-
-* Add the actual implementation of the function:
-  * [here](src/physical_plan/string_expressions.rs) for string functions
-  * [here](src/physical_plan/math_expressions.rs) for math functions
-  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](src/physical_plan) for other functions
-* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
-  * a new variant to `BuiltinScalarFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_physical_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
-  * a new entry of the `unary_scalar_expr!` macro for the new function.
-* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
-  * a new entry in the `pub use expr::{}` set.
-
-## How to add a new aggregate function
-
-Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
-
-* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
-  * [here](src/physical_plan/string_expressions.rs) for string functions
-  * [here](src/physical_plan/math_expressions.rs) for math functions
-  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](src/physical_plan) for other functions
-* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
-  * a new variant to `BuiltinAggregateFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-
-## How to display plans graphically
-
-The query plans represented by `LogicalPlan` nodes can be graphically
-rendered using [Graphviz](http://www.graphviz.org/).
-
-To do so, save the output of the `display_graphviz` function to a file.:
-
-```rust
-// Create plan somehow...
-let mut output = File::create("/tmp/plan.dot")?;
-write!(output, "{}", plan.display_graphviz());
-```
+# Architecture Overview
 
-Then, use the `dot` command line tool to render it into a file that
-can be displayed. For example, the following command creates a
-`/tmp/plan.pdf` file:
+There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
 
-```bash
-dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
-```
+* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+
+
+# Developer's guide
+
+Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion.
diff --git a/rust/datafusion/src/lib.rs b/rust/datafusion/src/lib.rs
@@ -23,7 +23,8 @@
     clippy::type_complexity
 )]
 
-//! DataFusion is an extensible query execution framework that uses
+//! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion)
+//! is an extensible query execution framework that uses
 //! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
 //!
 //! DataFusion supports both an SQL and a DataFrame API for building logical query plans