Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions rust/datafusion/DEVELOPERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Developer's guide
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this file into its own separate file so that it didn't appear on https://crates.io/crates/datafusion


This section describes how you can get started at developing DataFusion.

### Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

* `cargo build`
* `cargo fmt` to format the code
* `cargo test` to test
* etc.

## How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

* Add the actual implementation of the function:
* [here](src/physical_plan/string_expressions.rs) for string functions
* [here](src/physical_plan/math_expressions.rs) for math functions
* [here](src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](src/physical_plan) for other functions
* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
* a new variant to `BuiltinScalarFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_physical_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
* a new entry of the `unary_scalar_expr!` macro for the new function.
* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
* a new entry in the `pub use expr::{}` set.

## How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
* [here](src/physical_plan/string_expressions.rs) for string functions
* [here](src/physical_plan/math_expressions.rs) for math functions
* [here](src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](src/physical_plan) for other functions
* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
* a new variant to `BuiltinAggregateFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_aggregate_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.

## How to display plans graphically

The query plans represented by `LogicalPlan` nodes can be graphically
rendered using [Graphviz](http://www.graphviz.org/).

To do so, save the output of the `display_graphviz` function to a file.:

```rust
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
```

Then, use the `dot` command line tool to render it into a file that
can be displayed. For example, the following command creates a
`/tmp/plan.pdf` file:

```bash
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
```
159 changes: 77 additions & 82 deletions rust/datafusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,51 @@

# DataFusion

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
<img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/>

DataFusion is an extensible query execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
in-memory format.

DataFusion supports both an SQL and a DataFrame API for building
logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

## Use Cases
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing this page might benefit from is an example (e.g. look at how tokio does it https://crates.io/crates/tokio)

Perhaps we could just lift the nice example from https://docs.rs/datafusion/3.0.0/datafusion/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I will do so as a follow on PR: #9710


DataFusion is used to create modern, fast and efficient data
pipelines, ETL processes, and database systems, which need the
performance of Rust and Apache Arrow and want to provide their users
the convenience of an SQL interface or a DataFrame API.

## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

## Known Uses

Here are some of the projects known to use DataFusion:

* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute Platform
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @rdettai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* [Cube.js](https://github.com/cube-js/cube.js)
* [datafusion-python](https://pypi.org/project/datafusion)
* [delta-rs](https://github.com/delta-io/delta-rs)
* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [ROAPI](https://github.com/roapi/roapi)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYi @houqp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


(if you know of another project, please submit a PR to add a link!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



## Using DataFusion as a library

DataFusion can be used as a library by adding the following to your `Cargo.toml` file.
DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).

To get started, add the following to your `Cargo.toml` file:

```toml
[dependencies]
Expand All @@ -32,7 +72,7 @@ datafusion = "4.0.0-SNAPSHOT"

## Using DataFusion as a binary

DataFusion includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.
DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](docs/cli.md) for more information.

# Status

Expand All @@ -41,8 +81,11 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
- [x] SQL Parser
- [x] SQL Query Planner
- [x] Query Optimizer
- [x] Projection push down
- [x] Predicate push down
- [x] Constant folding
- [x] Join Reordering
- [x] Limit Pushdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Join reordering

- [x] Projection push down
- [x] Predicate push down
- [x] Type coercion
- [x] Parallel query execution

Expand All @@ -53,12 +96,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
- [x] Filter post-aggregate (HAVING)
- [x] Limit
- [x] Aggregate
- [x] UDFs (user-defined functions)
- [x] UDAFs (user-defined aggregate functions)
- [x] Common math functions
- String functions
- Postgres compatible String functions
- [x] ascii
- [x] bit_Length
- [x] bit_length
- [x] btrim
- [x] char_length
- [x] character_length
Expand Down Expand Up @@ -97,7 +138,10 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
- [ ] Nested types
- [ ] Lists
- [x] Subqueries
- [ ] Joins
- [x] Joins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner/Left/right/ (no cross and outer joins)

- [x] INNER JOIN
- [ ] CROSS JOIN
- [ ] OUTER JOIN
- [ ] Window

## Data Sources
Expand All @@ -106,9 +150,22 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
- [x] Parquet primitive types
- [ ] Parquet nested types


## Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:

- [x] User Defined Functions (UDFs)
- [x] User Defined Aggregate Functions (UDAFs)
- [x] User Defined Table Source (`TableProvider`) for tables
- [x] User Defined `Optimizer` passes (plan rewrites)
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes


# Supported SQL

This library currently supports the following SQL constructs:
This library currently supports many SQL constructs, including

* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
* `SELECT ... FROM ...` together with any expression
Expand All @@ -125,7 +182,6 @@ DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https:/

Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.


## Supported Data Types

DataFusion uses Arrow, and thus the Arrow type system, for query
Expand Down Expand Up @@ -160,76 +216,15 @@ are mapped to Arrow types according to the following table
| `CUSTOM` | *Not yet supported* |
| `ARRAY` | *Not yet supported* |

# Developer's guide

This section describes how you can get started at developing DataFusion.

### Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

* `cargo build`
* `cargo fmt` to format the code
* `cargo test` to test
* etc.

## How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

* Add the actual implementation of the function:
* [here](src/physical_plan/string_expressions.rs) for string functions
* [here](src/physical_plan/math_expressions.rs) for math functions
* [here](src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](src/physical_plan) for other functions
* In [src/physical_plan/functions](src/physical_plan/functions.rs), add:
* a new variant to `BuiltinScalarFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_physical_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
* In [src/logical_plan/expr](src/logical_plan/expr.rs), add:
* a new entry of the `unary_scalar_expr!` macro for the new function.
* In [src/logical_plan/mod](src/logical_plan/mod.rs), add:
* a new entry in the `pub use expr::{}` set.

## How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
* [here](src/physical_plan/string_expressions.rs) for string functions
* [here](src/physical_plan/math_expressions.rs) for math functions
* [here](src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](src/physical_plan) for other functions
* In [src/physical_plan/aggregates](src/physical_plan/aggregates.rs), add:
* a new variant to `BuiltinAggregateFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_aggregate_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.

## How to display plans graphically

The query plans represented by `LogicalPlan` nodes can be graphically
rendered using [Graphviz](http://www.graphviz.org/).

To do so, save the output of the `display_graphviz` function to a file.:

```rust
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
```
# Architecture Overview

Then, use the `dot` command line tool to render it into a file that
can be displayed. For example, the following command creates a
`/tmp/plan.pdf` file:
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

```bash
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
```
* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)


# Developer's guide

Please see [Developers Guide](DEVELOPERS.md) for information about developing DataFusion.
3 changes: 2 additions & 1 deletion rust/datafusion/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@
clippy::type_complexity
)]

//! DataFusion is an extensible query execution framework that uses
//! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion)
//! is an extensible query execution framework that uses
//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
//!
//! DataFusion supports both an SQL and a DataFrame API for building logical query plans
Expand Down