Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions DEVELOPERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,19 @@ can be displayed. For example, the following command creates a
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
```

## Specification

We formalize Datafusion semantics and behaviors through specification
documents. These specifications are useful to be used as references to help
resolve ambiguities during development or code reviews.

You are also welcome to propose changes to existing specifications or create
new specifications as you see fit.

Here is the list current active specifications:

* [Output field name semantic](docs/specification/output-field-name-semantic.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## How to format `.md` document

We are using `prettier` to format `.md` files.
Expand Down
220 changes: 220 additions & 0 deletions docs/specification/output-field-name-semantic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Datafusion output field name semantic

This specification documents how field names in output record batches should be
generated based on given user queries. The filed name rules apply to
Datafusion queries planned from both SQL queries and Dataframe APIs.

## Field name rules

* All field names MUST not contain relation/table qualifier.
* Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id`
* Function names MUST be converted to lowercase.
* `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
* Literal string MUST not be wrapped with quotes or double quotes.
* `SELECT 'foo'` SHOULD result in field name: `foo`
* Operator expressions MUST be wrapped with parentheses.
* `SELECT -2` SHOULD result in field name: `(- 2)`
* Operator and operand MUST be separated by spaces.
* `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
* Function arguments MUST be separated by a comma `,` and a space.
* `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])` SHOULD result in field name: `f(c1, c2)`

## Appendices

### Examples and comparison with other systems

Data schema for test sample queries:

```
CREATE TABLE t1 (id INT, a VARCHAR(5));
INSERT INTO t1 (id, a) VALUES (1, 'foo');
INSERT INTO t1 (id, a) VALUES (2, 'bar');

CREATE TABLE t2 (id INT, b VARCHAR(5));
INSERT INTO t2 (id, b) VALUES (1, 'hello');
INSERT INTO t2 (id, b) VALUES (2, 'world');
```

#### Projected columns

Query:

```
SELECT t1.id, a, t2.id, b
FROM t1
JOIN t2 ON t1.id = t2.id
```

Datafusion Arrow record batches output:

| id | a | id | b |
|----|-----|----|-------|
| 1 | foo | 1 | hello |
| 2 | bar | 2 | world |


Spark, MySQL 8 and PostgreSQL 13 output:

| id | a | id | b |
|----|-----|----|-------|
| 1 | foo | 1 | hello |
| 2 | bar | 2 | world |

SQLite 3 output:

| id | a | b |
|----|-----|-------|
| 1 | foo | hello |
| 2 | bar | world |


#### Function transformed columns

Query:

```
SELECT ABS(t1.id), abs(-id) FROM t1;
```

Datafusion Arrow record batches output:

| abs(id) | abs((- id)) |
|---------|-------------|
| 1 | 1 |
| 2 | 2 |


Spark output:

| abs(id) | abs((- id)) |
|---------|-------------|
| 1 | 1 |
| 2 | 2 |


MySQL 8 output:

| ABS(t1.id) | abs(-id) |
|------------|----------|
| 1 | 1 |
| 2 | 2 |

PostgreSQL 13 output:

| abs | abs |
|-----|-----|
| 1 | 1 |
| 2 | 2 |

SQlite 3 output:

| ABS(t1.id) | abs(-id) |
|------------|----------|
| 1 | 1 |
| 2 | 2 |


#### Function with operators

Query:

```
SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
```

Datafusion Arrow record batches output:

| id + abs(id) | abs(id * id) |
|--------------|--------------|
| 2 | 1 |
| 4 | 4 |


Spark output:

| id + abs(id) | abs(id * id) |
|--------------|--------------|
| 2 | 1 |
| 4 | 4 |

MySQL 8 output:

| t1.id + ABS(id) | ABS(id * t1.id) |
|-----------------|-----------------|
| 2 | 1 |
| 4 | 4 |

PostgreSQL output:

| ?column? | abs |
|----------|-----|
| 2 | 1 |
| 4 | 4 |

SQLite output:

| t1.id + ABS(id) | ABS(id * t1.id) |
|-----------------|-----------------|
| 2 | 1 |
| 4 | 4 |


#### Project literals

Query:

```
SELECT 1, 2+5, 'foo_bar';
```

Datafusion Arrow record batches output:

| 1 | (2 + 5) | foo_bar |
|---|---------|---------|
| 1 | 7 | foo_bar |


Spark output:

| 1 | (2 + 5) | foo_bar |
|---|---------|---------|
| 1 | 7 | foo_bar |

MySQL output:

| 1 | 2+5 | foo_bar |
|---|-----|---------|
| 1 | 7 | foo_bar |


PostgreSQL output:

| ?column? | ?column? | ?column? |
|----------|----------|----------|
| 1 | 7 | foo_bar |


SQLite 3 output:

| 1 | 2+5 | 'foo_bar' |
|---|-----|-----------|
| 1 | 7 | foo_bar |