From d57927d567ec849706d4a073c68dba9133b7ad67 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Sun, 14 Mar 2021 07:42:20 -0400
Subject: [PATCH 1/6] ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

---
 rust/datafusion/DEVELOPERS.md |  79 +++++++++++++++++++
 rust/datafusion/README.md     | 143 +++++++++++++++-------------------
 rust/datafusion/src/lib.rs    |   3 +-
 3 files changed, 142 insertions(+), 83 deletions(-)
 create mode 100644 rust/datafusion/DEVELOPERS.md

diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md
new file mode 100644
index 00000000000..e92fc3d9a95
--- /dev/null
+++ b/rust/datafusion/DEVELOPERS.md
@@ -0,0 +1,79 @@
+# Developer's guide
+
+This section describes how you can get started at developing DataFusion.
+
+### Bootstrap environment
+
+DataFusion is written in Rust and it uses a standard rust toolkit:
+
+* `cargo build`
+* `cargo fmt` to format the code
+* `cargo test` to test
+* etc.
+
+### Architecture Overview
+
+* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+
+
+## How to add a new scalar function
+
+Below is a checklist of what you need to do to add a new scalar function to DataFusion:
+
+* Add the actual implementation of the function:
+  * [here](src/physical_plan/string_expressions.rs) for string functions
+  * [here](src/physical_plan/math_expressions.rs) for math functions
+  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
+  * create a new module [here](src/physical_plan) for other functions
+* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
+  * a new variant to `BuiltinScalarFunction`
+  * a new entry to `FromStr` with the name of the function as called by SQL
+  * a new line in `return_type` with the expected return type of the function, given an incoming type
+  * a new line in `signature` with the signature of the function (number and types of its arguments)
+  * a new line in `create_physical_expr` mapping the built-in to the implementation
+  * tests to the function.
+* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
+  * a new entry of the `unary_scalar_expr!` macro for the new function.
+* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
+  * a new entry in the `pub use expr::{}` set.
+
+## How to add a new aggregate function
+
+Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
+
+* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
+  * [here](src/physical_plan/string_expressions.rs) for string functions
+  * [here](src/physical_plan/math_expressions.rs) for math functions
+  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
+  * create a new module [here](src/physical_plan) for other functions
+* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
+  * a new variant to `BuiltinAggregateFunction`
+  * a new entry to `FromStr` with the name of the function as called by SQL
+  * a new line in `return_type` with the expected return type of the function, given an incoming type
+  * a new line in `signature` with the signature of the function (number and types of its arguments)
+  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
+  * tests to the function.
+* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+
+## How to display plans graphically
+
+The query plans represented by `LogicalPlan` nodes can be graphically
+rendered using [Graphviz](http://www.graphviz.org/).
+
+To do so, save the output of the `display_graphviz` function to a file.:
+
+```rust
+// Create plan somehow...
+let mut output = File::create("/tmp/plan.dot")?;
+write!(output, "{}", plan.display_graphviz());
+```
+
+Then, use the `dot` command line tool to render it into a file that
+can be displayed. For example, the following command creates a
+`/tmp/plan.pdf` file:
+
+```bash
+dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
+```
diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
index f8d0d92b516..9b31968e093 100644
--- a/rust/datafusion/README.md
+++ b/rust/datafusion/README.md
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
+<img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/>
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [ROAPI](https://github.com/roapi/roapi)
+
+(if you know of another project, please submit a PR to add a link!)
+
 
 ## Using DataFusion as a library
 
-DataFusion can be used as a library by adding the following to your `Cargo.toml` file.
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
 
 ```toml
 [dependencies]
@@ -32,7 +69,7 @@ datafusion = "4.0.0-SNAPSHOT"
 
 ## Using DataFusion as a binary
 
-DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
+DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
 
 # Status
 
@@ -41,8 +78,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] SQL Parser
 - [x] SQL Query Planner
 - [x] Query Optimizer
-- [x] Projection push down
-- [x] Predicate push down
+ - [x] Projection push down
+ - [x] Predicate push down
+ - [x] Constant folding
+ - [x] Limit Pushdown
 - [x] Type coercion
 - [x] Parallel query execution
 
@@ -53,12 +92,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] Filter post-aggregate (HAVING)
 - [x] Limit
 - [x] Aggregate
-- [x] UDFs (user-defined functions)
-- [x] UDAFs (user-defined aggregate functions)
 - [x] Common math functions
-- String functions
+- Postgres compatible String functions
   - [x] ascii
-  - [x] bit_Length
+  - [x] bit_length
   - [x] btrim
   - [x] char_length
   - [x] character_length
@@ -97,7 +134,7 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [ ] Nested types
 - [ ] Lists
 - [x] Subqueries
-- [ ] Joins
+- [x] Joins
 - [ ] Window
 
 ## Data Sources
@@ -106,6 +143,19 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] Parquet primitive types
 - [ ] Parquet nested types
 
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+
 # Supported SQL
 
 This library currently supports the following SQL constructs:
@@ -125,7 +175,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/
 
 Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
 
-
 ## Supported Data Types
 
 DataFusion uses Arrow, and thus the Arrow type system, for query
@@ -162,74 +211,4 @@ are mapped to Arrow types according to the following table
 
 # Developer's guide
 
-This section describes how you can get started at developing DataFusion.
-
-### Bootstrap environment
-
-DataFusion is written in Rust and it uses a standard rust toolkit:
-
-* `cargo build`
-* `cargo fmt` to format the code
-* `cargo test` to test
-* etc.
-
-## How to add a new scalar function
-
-Below is a checklist of what you need to do to add a new scalar function to DataFusion:
-
-* Add the actual implementation of the function:
-  * [here](src/physical_plan/string_expressions.rs) for string functions
-  * [here](src/physical_plan/math_expressions.rs) for math functions
-  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](src/physical_plan) for other functions
-* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
-  * a new variant to `BuiltinScalarFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_physical_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
-  * a new entry of the `unary_scalar_expr!` macro for the new function.
-* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
-  * a new entry in the `pub use expr::{}` set.
-
-## How to add a new aggregate function
-
-Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
-
-* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
-  * [here](src/physical_plan/string_expressions.rs) for string functions
-  * [here](src/physical_plan/math_expressions.rs) for math functions
-  * [here](src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](src/physical_plan) for other functions
-* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
-  * a new variant to `BuiltinAggregateFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-
-## How to display plans graphically
-
-The query plans represented by `LogicalPlan` nodes can be graphically
-rendered using [Graphviz](http://www.graphviz.org/).
-
-To do so, save the output of the `display_graphviz` function to a file.:
-
-```rust
-// Create plan somehow...
-let mut output = File::create("/tmp/plan.dot")?;
-write!(output, "{}", plan.display_graphviz());
-```
-
-Then, use the `dot` command line tool to render it into a file that
-can be displayed. For example, the following command creates a
-`/tmp/plan.pdf` file:
-
-```bash
-dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
-```
+Please see [Developers Guide](DEVELOPERS.md)
diff --git a/rust/datafusion/src/lib.rs b/rust/datafusion/src/lib.rs
index 73dca51a8cf..5126f90f89d 100644
--- a/rust/datafusion/src/lib.rs
+++ b/rust/datafusion/src/lib.rs
@@ -23,7 +23,8 @@
     clippy::type_complexity
 )]
 
-//! DataFusion is an extensible query execution framework that uses
+//! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion)
+//! is an extensible query execution framework that uses
 //! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
 //!
 //! DataFusion supports both an SQL and a DataFrame API for building logical query plans

From 41b7fc83b6c9e2f3af35792f5a2fc6bdc8dfd443 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Sun, 14 Mar 2021 08:24:38 -0400
Subject: [PATCH 2/6] add apache license

---
 rust/datafusion/DEVELOPERS.md | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md
index e92fc3d9a95..ea3a389c86b 100644
--- a/rust/datafusion/DEVELOPERS.md
+++ b/rust/datafusion/DEVELOPERS.md
@@ -1,3 +1,22 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 # Developer's guide
 
 This section describes how you can get started at developing DataFusion.

From 3e956ec26350d4d0017ebf5206e44d6b828f0856 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Mon, 15 Mar 2021 09:30:58 -0400
Subject: [PATCH 3/6] Add other projects, update feature list

---
 rust/datafusion/README.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
index 9b31968e093..25cbe5ba72e 100644
--- a/rust/datafusion/README.md
+++ b/rust/datafusion/README.md
@@ -48,9 +48,11 @@ the convenience of an SQL interface or a DataFrame API.
 
 Here are some of the projects known to use DataFusion:
 
-* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
 * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
 * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [Cube.js](https://github.com/cube-js/cube.js)
+* [delta-rs](https://github.com/delta-io/delta-rs)
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
 * [ROAPI](https://github.com/roapi/roapi)
 
 (if you know of another project, please submit a PR to add a link!)
@@ -78,10 +80,11 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [x] SQL Parser
 - [x] SQL Query Planner
 - [x] Query Optimizer
- - [x] Projection push down
- - [x] Predicate push down
  - [x] Constant folding
+ - [x] Join Reordering
  - [x] Limit Pushdown
+ - [x] Projection push down
+ - [x] Predicate push down
 - [x] Type coercion
 - [x] Parallel query execution
 
@@ -135,6 +138,9 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [ ] Lists
 - [x] Subqueries
 - [x] Joins
+ - [x] INNER JOIN
+ - [ ] CROSS JOIN
+ - [] OUTER JOIN
 - [ ] Window
 
 ## Data Sources

From 4b0af441657c2765a9d2685fd14941406a33341d Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Mon, 15 Mar 2021 11:48:40 -0400
Subject: [PATCH 4/6] More improvements

---
 rust/datafusion/DEVELOPERS.md |  6 ------
 rust/datafusion/README.md     | 14 ++++++++++++--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/rust/datafusion/DEVELOPERS.md b/rust/datafusion/DEVELOPERS.md
index ea3a389c86b..aa80cb71d3b 100644
--- a/rust/datafusion/DEVELOPERS.md
+++ b/rust/datafusion/DEVELOPERS.md
@@ -30,12 +30,6 @@ DataFusion is written in Rust and it uses a standard rust toolkit:
 * `cargo test` to test
 * etc.
 
-### Architecture Overview
-
-* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
-
 ## How to add a new scalar function
 
 Below is a checklist of what you need to do to add a new scalar function to DataFusion:
diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
index 25cbe5ba72e..6d658f29106 100644
--- a/rust/datafusion/README.md
+++ b/rust/datafusion/README.md
@@ -39,7 +39,7 @@ the convenience of an SQL interface or a DataFrame API.
 
 ## Why DataFusion?
 
-* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves superior performance
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
 * *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
 * *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
 * *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
@@ -51,6 +51,7 @@ Here are some of the projects known to use DataFusion:
 * [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
 * [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
 * [Cube.js](https://github.com/cube-js/cube.js)
+* [datafusion-python](https://pypi.org/project/datafusion)
 * [delta-rs](https://github.com/delta-io/delta-rs)
 * [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
 * [ROAPI](https://github.com/roapi/roapi)
@@ -215,6 +216,15 @@ are mapped to Arrow types according to the following table
 | `CUSTOM`        | *Not yet supported*              |
 | `ARRAY`         | *Not yet supported*              |
 
+
+# Architecture Overview
+
+There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
+
+* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+
+
 # Developer's guide
 
-Please see [Developers Guide](DEVELOPERS.md)
+Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion.

From 5ef9f51ffa4a8701820172b460acddce44f89eed Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Mon, 15 Mar 2021 11:50:27 -0400
Subject: [PATCH 5/6] tweak

---
 rust/datafusion/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
index 6d658f29106..5fc5687bad2 100644
--- a/rust/datafusion/README.md
+++ b/rust/datafusion/README.md
@@ -141,7 +141,7 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [x] Joins
  - [x] INNER JOIN
  - [ ] CROSS JOIN
- - [] OUTER JOIN
+ - [ ] OUTER JOIN
 - [ ] Window
 
 ## Data Sources
@@ -165,7 +165,7 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
 
 # Supported SQL
 
-This library currently supports the following SQL constructs:
+This library currently supports many SQL constructs, including
 
 * `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
 * `SELECT ... FROM ...` together with any expression

From a57ea367ddcfb1bc26632dd610bbd7415f5e3002 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Mon, 15 Mar 2021 11:50:58 -0400
Subject: [PATCH 6/6] fix indent

---
 rust/datafusion/README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/rust/datafusion/README.md b/rust/datafusion/README.md
index 5fc5687bad2..2b69b8a3dec 100644
--- a/rust/datafusion/README.md
+++ b/rust/datafusion/README.md
@@ -139,9 +139,9 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [ ] Lists
 - [x] Subqueries
 - [x] Joins
- - [x] INNER JOIN
- - [ ] CROSS JOIN
- - [ ] OUTER JOIN
+  - [x] INNER JOIN
+  - [ ] CROSS JOIN
+  - [ ] OUTER JOIN
 - [ ] Window
 
 ## Data Sources