diff --git a/DEVELOPERS.md b/DEVELOPERS.md index c2daf3a72e5e0..9223d990e1d63 100644 --- a/DEVELOPERS.md +++ b/DEVELOPERS.md @@ -94,6 +94,19 @@ can be displayed. For example, the following command creates a dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf ``` +## Specification + +We formalize Datafusion semantics and behaviors through specification +documents. These specifications are useful to be used as references to help +resolve ambiguities during development or code reviews. + +You are also welcome to propose changes to existing specifications or create +new specifications as you see fit. + +Here is the list current active specifications: + +* [Output field name semantic](docs/specification/output-field-name-semantic.md) + ## How to format `.md` document We are using `prettier` to format `.md` files. diff --git a/docs/specification/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md new file mode 100644 index 0000000000000..fd28d118921b2 --- /dev/null +++ b/docs/specification/output-field-name-semantic.md @@ -0,0 +1,220 @@ + + +# Datafusion output field name semantic + +This specification documents how field names in output record batches should be +generated based on given user queries. The filed name rules apply to +Datafusion queries planned from both SQL queries and Dataframe APIs. + +## Field name rules + +* All field names MUST not contain relation/table qualifier. + * Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id` +* Function names MUST be converted to lowercase. + * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)` +* Literal string MUST not be wrapped with quotes or double quotes. + * `SELECT 'foo'` SHOULD result in field name: `foo` +* Operator expressions MUST be wrapped with parentheses. + * `SELECT -2` SHOULD result in field name: `(- 2)` +* Operator and operand MUST be separated by spaces. + * `SELECT 1+2` SHOULD result in field name: `(1 + 2)` +* Function arguments MUST be separated by a comma `,` and a space. + * `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])` SHOULD result in field name: `f(c1, c2)` + +## Appendices + +### Examples and comparison with other systems + +Data schema for test sample queries: + +``` +CREATE TABLE t1 (id INT, a VARCHAR(5)); +INSERT INTO t1 (id, a) VALUES (1, 'foo'); +INSERT INTO t1 (id, a) VALUES (2, 'bar'); + +CREATE TABLE t2 (id INT, b VARCHAR(5)); +INSERT INTO t2 (id, b) VALUES (1, 'hello'); +INSERT INTO t2 (id, b) VALUES (2, 'world'); +``` + +#### Projected columns + +Query: + +``` +SELECT t1.id, a, t2.id, b +FROM t1 +JOIN t2 ON t1.id = t2.id +``` + +Datafusion Arrow record batches output: + +| id | a | id | b | +|----|-----|----|-------| +| 1 | foo | 1 | hello | +| 2 | bar | 2 | world | + + +Spark, MySQL 8 and PostgreSQL 13 output: + +| id | a | id | b | +|----|-----|----|-------| +| 1 | foo | 1 | hello | +| 2 | bar | 2 | world | + +SQLite 3 output: + +| id | a | b | +|----|-----|-------| +| 1 | foo | hello | +| 2 | bar | world | + + +#### Function transformed columns + +Query: + +``` +SELECT ABS(t1.id), abs(-id) FROM t1; +``` + +Datafusion Arrow record batches output: + +| abs(id) | abs((- id)) | +|---------|-------------| +| 1 | 1 | +| 2 | 2 | + + +Spark output: + +| abs(id) | abs((- id)) | +|---------|-------------| +| 1 | 1 | +| 2 | 2 | + + +MySQL 8 output: + +| ABS(t1.id) | abs(-id) | +|------------|----------| +| 1 | 1 | +| 2 | 2 | + +PostgreSQL 13 output: + +| abs | abs | +|-----|-----| +| 1 | 1 | +| 2 | 2 | + +SQlite 3 output: + +| ABS(t1.id) | abs(-id) | +|------------|----------| +| 1 | 1 | +| 2 | 2 | + + +#### Function with operators + +Query: + +``` +SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1; +``` + +Datafusion Arrow record batches output: + +| id + abs(id) | abs(id * id) | +|--------------|--------------| +| 2 | 1 | +| 4 | 4 | + + +Spark output: + +| id + abs(id) | abs(id * id) | +|--------------|--------------| +| 2 | 1 | +| 4 | 4 | + +MySQL 8 output: + +| t1.id + ABS(id) | ABS(id * t1.id) | +|-----------------|-----------------| +| 2 | 1 | +| 4 | 4 | + +PostgreSQL output: + +| ?column? | abs | +|----------|-----| +| 2 | 1 | +| 4 | 4 | + +SQLite output: + +| t1.id + ABS(id) | ABS(id * t1.id) | +|-----------------|-----------------| +| 2 | 1 | +| 4 | 4 | + + +#### Project literals + +Query: + +``` +SELECT 1, 2+5, 'foo_bar'; +``` + +Datafusion Arrow record batches output: + +| 1 | (2 + 5) | foo_bar | +|---|---------|---------| +| 1 | 7 | foo_bar | + + +Spark output: + +| 1 | (2 + 5) | foo_bar | +|---|---------|---------| +| 1 | 7 | foo_bar | + +MySQL output: + +| 1 | 2+5 | foo_bar | +|---|-----|---------| +| 1 | 7 | foo_bar | + + +PostgreSQL output: + +| ?column? | ?column? | ?column? | +|----------|----------|----------| +| 1 | 7 | foo_bar | + + +SQLite 3 output: + +| 1 | 2+5 | 'foo_bar' | +|---|-----|-----------| +| 1 | 7 | foo_bar |