From a74460a82cac2cf48c19ec892f59960ab88cc763 Mon Sep 17 00:00:00 2001 From: Qingping Hou Date: Mon, 24 May 2021 22:24:29 -0700 Subject: [PATCH 1/3] add output field name rfc --- docs/rfcs/README.md | 28 +++ docs/rfcs/output-field-name-semantic.md | 236 ++++++++++++++++++++++++ 2 files changed, 264 insertions(+) create mode 100644 docs/rfcs/README.md create mode 100644 docs/rfcs/output-field-name-semantic.md diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md new file mode 100644 index 0000000000000..3f2a84c30e142 --- /dev/null +++ b/docs/rfcs/README.md @@ -0,0 +1,28 @@ +# Datafusion RFCs + +## Motivation + +The RFCs (request for comments) provides a consistent and controlled path for +Datafusion developers to propose formalized semantics and non-trivial changes +to the project. + +## Creating new RFC + +* Create a new markdown file within the `rfcs` folder with RFC title as the file name. + * At the very minimal, a RFC should contain the following sections: + * Summary + * Motivation + * Detailed design + * Unresolved questions +* Send a PR for proposed RFC. +* Once a RFC PR is reviewed and merged, the RFC is considered accepted and active. + +## Updating existing RFC + +Minor changes can be applied to the existing RFCs directly via follow-up PRs. +Exactly what counts as minor changes is up to the committers to decide. + +## Archiving RFC + +If an active RFC becomes inactive for some reason, it should be marked as so at +the beginning of the document right under the title. diff --git a/docs/rfcs/output-field-name-semantic.md b/docs/rfcs/output-field-name-semantic.md new file mode 100644 index 0000000000000..f15447a43382a --- /dev/null +++ b/docs/rfcs/output-field-name-semantic.md @@ -0,0 +1,236 @@ +# Datafusion output field name semantic + +Start Date: 2020-05-24 + +## Summary + +Formally specify how Datafusion should construct its output field names based on +provided user query. + +## Motivation + +By formalizing the output field name semantic, users will be able to access +query output using consistent field names. + +## Detailed design + +The proposed semantic is chosen for the following reasons: + +* Ease of implementation, field names can be derived from physical expression +without having to add extra logic to pass along arbitrary user provided input. +Users are encouraged to use ALIAS expressions for full field name control. +* Mostly compatible with Spark’s behavior except literal string handling. +* Mostly backward compatible with current Datafusion’s behavior other than +function name cases and parenthesis around operator expressions. + +### Field name rules + +* All field names MUST not contain relation qualifier. + * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id` +* Function names MUST be converted to lowercase. + * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)` +* Literal string MUST not be wrapped with quotes or double quotes. + * `SELECT 'foo'` SHOULD result in field name: `foo` +* Operator expressions MUST be wrapped with parentheses. + * `SELECT -2` SHOULD result in field name: `(- 2)` +* Operator and operand MUST be separated by spaces. + * `SELECT 1+2` SHOULD result in field name: `(1 + 2)` +* Function arguments MUST be separated by a comma `,` and a space. + * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)` + + +### Examples and comparison with other systems + +Data schema for test sample queries: + +``` +CREATE TABLE t1 (id INT, a VARCHAR(5)); +INSERT INTO t1 (id, a) VALUES (1, 'foo'); +INSERT INTO t1 (id, a) VALUES (2, 'bar'); + +CREATE TABLE t2 (id INT, b VARCHAR(5)); +INSERT INTO t2 (id, b) VALUES (1, 'hello'); +INSERT INTO t2 (id, b) VALUES (2, 'world'); +``` + +#### Projected columns + +Query: + +``` +SELECT t1.id, a, t2.id, b +FROM t1 +JOIN t2 ON t1.id = t2.id +``` + +Datafusion Arrow record batches output: + +| id | a | id | b | +|----|-----|----|-------| +| 1 | foo | 1 | hello | +| 2 | bar | 2 | world | + + +Spark, MySQL 8 and PostgreSQL 13 output: + +| id | a | id | b | +|----|-----|----|-------| +| 1 | foo | 1 | hello | +| 2 | bar | 2 | world | + +SQLite 3 output: + +| id | a | b | +|----|-----|-------| +| 1 | foo | hello | +| 2 | bar | world | + + +#### Function transformed columns + +Query: + +``` +SELECT ABS(t1.id), abs(-id) FROM t1; +``` + +Datafusion Arrow record batches output: + +| abs(id) | abs((- id)) | +|---------|-------------| +| 1 | 1 | +| 2 | 2 | + + +Spark output: + +| abs(id) | abs((- id)) | +|---------|-------------| +| 1 | 1 | +| 2 | 2 | + + +MySQL 8 output: + +| ABS(t1.id) | abs(-id) | +|------------|----------| +| 1 | 1 | +| 2 | 2 | + +PostgreSQL 13 output: + +| abs | abs | +|-----|-----| +| 1 | 1 | +| 2 | 2 | + +SQlite 3 output: + +| ABS(t1.id) | abs(-id) | +|------------|----------| +| 1 | 1 | +| 2 | 2 | + + +#### Function with operators + +Query: + +``` +SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1; +``` + +Datafusion Arrow record batches output: + +| id + abs(id) | abs(id * id) | +|--------------|--------------| +| 2 | 1 | +| 4 | 4 | + + +Spark output: + +| id + abs(id) | abs(id * id) | +|--------------|--------------| +| 2 | 1 | +| 4 | 4 | + +MySQL 8 output: + +| t1.id + ABS(id) | ABS(id * t1.id) | +|-----------------|-----------------| +| 2 | 1 | +| 4 | 4 | + +PostgreSQL output: + +| ?column? | abs | +|----------|-----| +| 2 | 1 | +| 4 | 4 | + +SQLite output: + +| t1.id + ABS(id) | ABS(id * t1.id) | +|-----------------|-----------------| +| 2 | 1 | +| 4 | 4 | + + +#### Project literals + +Query: + +``` +SELECT 1, 2+5, 'foo_bar'; +``` + +Datafusion Arrow record batches output: + +| 1 | (2 + 5) | foo_bar | +|---|---------|---------| +| 1 | 7 | foo_bar | + + +Spark output: + +| 1 | (2 + 5) | foo_bar | +|---|---------|---------| +| 1 | 7 | foo_bar | + +MySQL output: + +| 1 | 2+5 | foo_bar | +|---|-----|---------| +| 1 | 7 | foo_bar | + + +PostgreSQL output: + +| ?column? | ?column? | ?column? | +|----------|----------|----------| +| 1 | 7 | foo_bar | + + +SQLite 3 output: + +| 1 | 2+5 | 'foo_bar' | +|---|-----|-----------| +| 1 | 7 | foo_bar | + + +## Alternatives + +Postgres's behavior is too simple. It defaults to `?column?` as the column name +in many cases, which makes output less readable than other designs. + +MySQL and SQLite preserve user query input as the field name. This adds extra +implementation and runtime overhead with little gain for end uers. + +In the long run, we could make output field semantic configurable so users can +pick their own preferred semantic for full compatibility with system of their +choice. + +## Unresolved questions + +None so far. From 78ca11d5ac05c9b1a7a8db39f751d9958d45c6ca Mon Sep 17 00:00:00 2001 From: Qingping Hou Date: Wed, 26 May 2021 22:19:23 -0700 Subject: [PATCH 2/3] move to spec model --- .../output-field-name-semantic.md | 44 ++----------------- 1 file changed, 3 insertions(+), 41 deletions(-) rename docs/{rfcs => specification}/output-field-name-semantic.md (73%) diff --git a/docs/rfcs/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md similarity index 73% rename from docs/rfcs/output-field-name-semantic.md rename to docs/specification/output-field-name-semantic.md index f15447a43382a..08ce9a9fb05ac 100644 --- a/docs/rfcs/output-field-name-semantic.md +++ b/docs/specification/output-field-name-semantic.md @@ -1,29 +1,6 @@ # Datafusion output field name semantic -Start Date: 2020-05-24 - -## Summary - -Formally specify how Datafusion should construct its output field names based on -provided user query. - -## Motivation - -By formalizing the output field name semantic, users will be able to access -query output using consistent field names. - -## Detailed design - -The proposed semantic is chosen for the following reasons: - -* Ease of implementation, field names can be derived from physical expression -without having to add extra logic to pass along arbitrary user provided input. -Users are encouraged to use ALIAS expressions for full field name control. -* Mostly compatible with Spark’s behavior except literal string handling. -* Mostly backward compatible with current Datafusion’s behavior other than -function name cases and parenthesis around operator expressions. - -### Field name rules +## Field name rules * All field names MUST not contain relation qualifier. * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id` @@ -39,6 +16,8 @@ function name cases and parenthesis around operator expressions. * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)` +## Appendices + ### Examples and comparison with other systems Data schema for test sample queries: @@ -217,20 +196,3 @@ SQLite 3 output: | 1 | 2+5 | 'foo_bar' | |---|-----|-----------| | 1 | 7 | foo_bar | - - -## Alternatives - -Postgres's behavior is too simple. It defaults to `?column?` as the column name -in many cases, which makes output less readable than other designs. - -MySQL and SQLite preserve user query input as the field name. This adds extra -implementation and runtime overhead with little gain for end uers. - -In the long run, we could make output field semantic configurable so users can -pick their own preferred semantic for full compatibility with system of their -choice. - -## Unresolved questions - -None so far. From 3a0193c7c27df344a08a2fd067c64d5111757a3f Mon Sep 17 00:00:00 2001 From: Qingping Hou Date: Wed, 26 May 2021 22:41:44 -0700 Subject: [PATCH 3/3] add link to developers docs & add ASF header --- DEVELOPERS.md | 13 ++++++++ docs/rfcs/README.md | 28 ----------------- .../output-field-name-semantic.md | 30 ++++++++++++++++--- 3 files changed, 39 insertions(+), 32 deletions(-) delete mode 100644 docs/rfcs/README.md diff --git a/DEVELOPERS.md b/DEVELOPERS.md index 60048c868e6c1..1278eb8df191f 100644 --- a/DEVELOPERS.md +++ b/DEVELOPERS.md @@ -93,3 +93,16 @@ can be displayed. For example, the following command creates a ```bash dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf ``` + +## Specification + +We formalize Datafusion semantics and behaviors through specification +documents. These specifications are useful to be used as references to help +resolve ambiguities during development or code reviews. + +You are also welcome to propose changes to existing specifications or create +new specifications as you see fit. + +Here is the list current active specifications: + +* [Output field name semantic](docs/specification/output-field-name-semantic.md) diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md deleted file mode 100644 index 3f2a84c30e142..0000000000000 --- a/docs/rfcs/README.md +++ /dev/null @@ -1,28 +0,0 @@ -# Datafusion RFCs - -## Motivation - -The RFCs (request for comments) provides a consistent and controlled path for -Datafusion developers to propose formalized semantics and non-trivial changes -to the project. - -## Creating new RFC - -* Create a new markdown file within the `rfcs` folder with RFC title as the file name. - * At the very minimal, a RFC should contain the following sections: - * Summary - * Motivation - * Detailed design - * Unresolved questions -* Send a PR for proposed RFC. -* Once a RFC PR is reviewed and merged, the RFC is considered accepted and active. - -## Updating existing RFC - -Minor changes can be applied to the existing RFCs directly via follow-up PRs. -Exactly what counts as minor changes is up to the committers to decide. - -## Archiving RFC - -If an active RFC becomes inactive for some reason, it should be marked as so at -the beginning of the document right under the title. diff --git a/docs/specification/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md index 08ce9a9fb05ac..fd28d118921b2 100644 --- a/docs/specification/output-field-name-semantic.md +++ b/docs/specification/output-field-name-semantic.md @@ -1,9 +1,32 @@ + + # Datafusion output field name semantic +This specification documents how field names in output record batches should be +generated based on given user queries. The filed name rules apply to +Datafusion queries planned from both SQL queries and Dataframe APIs. + ## Field name rules -* All field names MUST not contain relation qualifier. - * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id` +* All field names MUST not contain relation/table qualifier. + * Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id` * Function names MUST be converted to lowercase. * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)` * Literal string MUST not be wrapped with quotes or double quotes. @@ -13,8 +36,7 @@ * Operator and operand MUST be separated by spaces. * `SELECT 1+2` SHOULD result in field name: `(1 + 2)` * Function arguments MUST be separated by a comma `,` and a space. - * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)` - + * `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])` SHOULD result in field name: `f(c1, c2)` ## Appendices