From a74460a82cac2cf48c19ec892f59960ab88cc763 Mon Sep 17 00:00:00 2001
From: Qingping Hou <dave2008713@gmail.com>
Date: Mon, 24 May 2021 22:24:29 -0700
Subject: [PATCH 1/3] add output field name rfc

---
 docs/rfcs/README.md                     |  28 +++
 docs/rfcs/output-field-name-semantic.md | 236 ++++++++++++++++++++++++
 2 files changed, 264 insertions(+)
 create mode 100644 docs/rfcs/README.md
 create mode 100644 docs/rfcs/output-field-name-semantic.md

diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md
new file mode 100644
index 0000000000000..3f2a84c30e142
--- /dev/null
+++ b/docs/rfcs/README.md
@@ -0,0 +1,28 @@
+# Datafusion RFCs
+
+## Motivation
+
+The RFCs (request for comments) provides a consistent and controlled path for
+Datafusion developers to propose formalized semantics and non-trivial changes
+to the project.
+
+## Creating new RFC
+
+* Create a new markdown file within the `rfcs` folder with RFC title as the file name.
+  * At the very minimal, a RFC should contain the following sections:
+    * Summary
+    * Motivation
+    * Detailed design
+    * Unresolved questions
+* Send a PR for proposed RFC.
+* Once a RFC PR is reviewed and merged, the RFC is considered accepted and active.
+
+## Updating existing RFC
+
+Minor changes can be applied to the existing RFCs directly via follow-up PRs.
+Exactly what counts as minor changes is up to the committers to decide.
+
+## Archiving RFC
+
+If an active RFC becomes inactive for some reason, it should be marked as so at
+the beginning of the document right under the title.
diff --git a/docs/rfcs/output-field-name-semantic.md b/docs/rfcs/output-field-name-semantic.md
new file mode 100644
index 0000000000000..f15447a43382a
--- /dev/null
+++ b/docs/rfcs/output-field-name-semantic.md
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`
+* Operator and operand MUST be separated by spaces.
+  * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
+* Function arguments MUST be separated by a comma `,` and a space.
+  * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)`
+
+
+### Examples and comparison with other systems
+
+Data schema for test sample queries:
+
+```
+CREATE TABLE t1 (id INT, a VARCHAR(5));
+INSERT INTO t1 (id, a) VALUES (1, 'foo');
+INSERT INTO t1 (id, a) VALUES (2, 'bar');
+
+CREATE TABLE t2 (id INT, b VARCHAR(5));
+INSERT INTO t2 (id, b) VALUES (1, 'hello');
+INSERT INTO t2 (id, b) VALUES (2, 'world');
+```
+
+#### Projected columns
+
+Query:
+
+```
+SELECT t1.id, a, t2.id, b
+FROM t1
+JOIN t2 ON t1.id = t2.id
+```
+
+Datafusion Arrow record batches output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+
+Spark, MySQL 8 and PostgreSQL 13 output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+SQLite 3 output:
+
+| id | a   | b     |
+|----|-----|-------|
+| 1  | foo | hello |
+| 2  | bar | world |
+
+
+#### Function transformed columns
+
+Query:
+
+```
+SELECT ABS(t1.id), abs(-id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+Spark output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+MySQL 8 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+PostgreSQL 13 output:
+
+| abs | abs |
+|-----|-----|
+| 1   | 1   |
+| 2   | 2   |
+
+SQlite 3 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+
+#### Function with operators
+
+Query:
+
+```
+SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+
+Spark output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+MySQL 8 output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+PostgreSQL output:
+
+| ?column? | abs |
+|----------|-----|
+| 2        | 1   |
+| 4        | 4   |
+
+SQLite output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+
+#### Project literals
+
+Query:
+
+```
+SELECT 1, 2+5, 'foo_bar';
+```
+
+Datafusion Arrow record batches output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+
+Spark output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+MySQL output:
+
+| 1 | 2+5 | foo_bar |
+|---|-----|---------|
+| 1 | 7   | foo_bar |
+
+
+PostgreSQL output:
+
+| ?column? | ?column? | ?column? |
+|----------|----------|----------|
+| 1        | 7        | foo_bar  |
+
+
+SQLite 3 output:
+
+| 1 | 2+5 | 'foo_bar' |
+|---|-----|-----------|
+| 1 | 7   | foo_bar   |
+
+
+## Alternatives
+
+Postgres's behavior is too simple. It defaults to `?column?` as the column name
+in many cases, which makes output less readable than other designs.
+
+MySQL and SQLite preserve user query input as the field name. This adds extra
+implementation and runtime overhead with little gain for end uers.
+
+In the long run, we could make output field semantic configurable so users can
+pick their own preferred semantic for full compatibility with system of their
+choice.
+
+## Unresolved questions
+
+None so far.

From 78ca11d5ac05c9b1a7a8db39f751d9958d45c6ca Mon Sep 17 00:00:00 2001
From: Qingping Hou <dave2008713@gmail.com>
Date: Wed, 26 May 2021 22:19:23 -0700
Subject: [PATCH 2/3] move to spec model

---
 .../output-field-name-semantic.md             | 44 ++-----------------
 1 file changed, 3 insertions(+), 41 deletions(-)
 rename docs/{rfcs => specification}/output-field-name-semantic.md (73%)

diff --git a/docs/rfcs/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md
similarity index 73%
rename from docs/rfcs/output-field-name-semantic.md
rename to docs/specification/output-field-name-semantic.md
index f15447a43382a..08ce9a9fb05ac 100644
--- a/docs/rfcs/output-field-name-semantic.md
+++ b/docs/specification/output-field-name-semantic.md
@@ -1,29 +1,6 @@
 # Datafusion output field name semantic
 
-Start Date: 2020-05-24
-
-## Summary
-
-Formally specify how Datafusion should construct its output field names based on
-provided user query.
-
-## Motivation
-
-By formalizing the output field name semantic, users will be able to access
-query output using consistent field names.
-
-## Detailed design
-
-The proposed semantic is chosen for the following reasons:
-
-* Ease of implementation, field names can be derived from physical expression
-without having to add extra logic to pass along arbitrary user provided input.
-Users are encouraged to use ALIAS expressions for full field name control.
-* Mostly compatible with Spark’s behavior except literal string handling.
-* Mostly backward compatible with current Datafusion’s behavior other than
-function name cases and parenthesis around operator expressions.
-
-###  Field name rules
+##  Field name rules
 
 * All field names MUST not contain relation qualifier.
   * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
@@ -39,6 +16,8 @@ function name cases and parenthesis around operator expressions.
   * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)`
 
 
+## Appendices
+
 ### Examples and comparison with other systems
 
 Data schema for test sample queries:
@@ -217,20 +196,3 @@ SQLite 3 output:
 | 1 | 2+5 | 'foo_bar' |
 |---|-----|-----------|
 | 1 | 7   | foo_bar   |
-
-
-## Alternatives
-
-Postgres's behavior is too simple. It defaults to `?column?` as the column name
-in many cases, which makes output less readable than other designs.
-
-MySQL and SQLite preserve user query input as the field name. This adds extra
-implementation and runtime overhead with little gain for end uers.
-
-In the long run, we could make output field semantic configurable so users can
-pick their own preferred semantic for full compatibility with system of their
-choice.
-
-## Unresolved questions
-
-None so far.

From 3a0193c7c27df344a08a2fd067c64d5111757a3f Mon Sep 17 00:00:00 2001
From: Qingping Hou <dave2008713@gmail.com>
Date: Wed, 26 May 2021 22:41:44 -0700
Subject: [PATCH 3/3] add link to developers docs & add ASF header

---
 DEVELOPERS.md                                 | 13 ++++++++
 docs/rfcs/README.md                           | 28 -----------------
 .../output-field-name-semantic.md             | 30 ++++++++++++++++---
 3 files changed, 39 insertions(+), 32 deletions(-)
 delete mode 100644 docs/rfcs/README.md

diff --git a/DEVELOPERS.md b/DEVELOPERS.md
index 60048c868e6c1..1278eb8df191f 100644
--- a/DEVELOPERS.md
+++ b/DEVELOPERS.md
@@ -93,3 +93,16 @@ can be displayed. For example, the following command creates a
 ```bash
 dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
 ```
+
+## Specification
+
+We formalize Datafusion semantics and behaviors through specification
+documents. These specifications are useful to be used as references to help
+resolve ambiguities during development or code reviews.
+
+You are also welcome to propose changes to existing specifications or create
+new specifications as you see fit.
+
+Here is the list current active specifications:
+
+* [Output field name semantic](docs/specification/output-field-name-semantic.md)
diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md
deleted file mode 100644
index 3f2a84c30e142..0000000000000
--- a/docs/rfcs/README.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# Datafusion RFCs
-
-## Motivation
-
-The RFCs (request for comments) provides a consistent and controlled path for
-Datafusion developers to propose formalized semantics and non-trivial changes
-to the project.
-
-## Creating new RFC
-
-* Create a new markdown file within the `rfcs` folder with RFC title as the file name.
-  * At the very minimal, a RFC should contain the following sections:
-    * Summary
-    * Motivation
-    * Detailed design
-    * Unresolved questions
-* Send a PR for proposed RFC.
-* Once a RFC PR is reviewed and merged, the RFC is considered accepted and active.
-
-## Updating existing RFC
-
-Minor changes can be applied to the existing RFCs directly via follow-up PRs.
-Exactly what counts as minor changes is up to the committers to decide.
-
-## Archiving RFC
-
-If an active RFC becomes inactive for some reason, it should be marked as so at
-the beginning of the document right under the title.
diff --git a/docs/specification/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md
index 08ce9a9fb05ac..fd28d118921b2 100644
--- a/docs/specification/output-field-name-semantic.md
+++ b/docs/specification/output-field-name-semantic.md
@@ -1,9 +1,32 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 # Datafusion output field name semantic
 
+This specification documents how field names in output record batches should be
+generated based on given user queries. The filed name rules apply to
+Datafusion queries planned from both SQL queries and Dataframe APIs.
+
 ##  Field name rules
 
-* All field names MUST not contain relation qualifier.
-  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* All field names MUST not contain relation/table qualifier.
+  * Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id`
 * Function names MUST be converted to lowercase.
   * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
 * Literal string MUST not be wrapped with quotes or double quotes.
@@ -13,8 +36,7 @@
 * Operator and operand MUST be separated by spaces.
   * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
 * Function arguments MUST be separated by a comma `,` and a space.
-  * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)`
-
+  * `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])`  SHOULD result in field name: `f(c1, c2)`
 
 ## Appendices