From f17f4577c013663d4a52da8a649d575584e275eb Mon Sep 17 00:00:00 2001 From: Fokko Date: Mon, 24 Mar 2025 20:12:55 +0100 Subject: [PATCH 01/10] Spec: Allow the use of `source-id` in V3 --- format/spec.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/format/spec.md b/format/spec.md index 7d8777f8e4ea..a749455c6fa1 100644 --- a/format/spec.md +++ b/format/spec.md @@ -494,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | Transform name | Description | Source types | Result type | |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type | -| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | +| **`bucket[N]`** | Hash of value, mod `N` (see below) | Any combination of the following `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type | | **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | @@ -540,7 +540,7 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. 3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. 4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. - +5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. #### Partition Evolution @@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|--------------| -| required | required | omitted | **`source-id`** | `JSON int` | 1 | -| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | +| required | required | required¹ | **`source-id`** | `JSON int` | 1 | +| | | required¹ | **`source-ids`** | `JSON list of ints` | `[1,2]` | | | required | required | **`field-id`** | `JSON int` | 1000 | | required | required | required | **`name`** | `JSON string` | `id_bucket` | | required | required | required | **`transform`** | `JSON string` | `bucket[16]` | +Notes: + +1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. + Supported partition transforms are listed below. |Transform or Field|JSON representation|Example| @@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|-------------| -| required | required | required | **`transform`** | `JSON string` | `bucket[4]` | -| required | required | omitted | **`source-id`** | `JSON int` | 1 | +| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` | +| required | required | required¹ | **`source-id`** | `JSON int` | 1 | | | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | | required | required | required | **`direction`** | `JSON string` | `asc` | | required | required | required | **`null-order`** | `JSON string` | `nulls-last`| -In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. +Notes: + +1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them. @@ -1605,13 +1611,8 @@ All readers are required to read tables with unknown partition transforms, ignor Writing v3 metadata: * Partition Field and Sort Field JSON: - * `source-ids` was added and is required - * `source-id` is no longer required and should be omitted; always use `source-ids` instead - -Reading v1 or v2 metadata for v3: - -* Partition Field and Sort Field JSON: - * `source-ids` should default to a single-value list of the value of `source-id` + * `source-ids` was added and is required in case of multi-argument transforms. + * `source-id` should still be written in the case of single-argument transforms. Row-level delete changes: From d60f5e924b5d9a41a4bacce367a18ee8463c1e61 Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Thu, 27 Mar 2025 11:25:28 +0100 Subject: [PATCH 02/10] Add the Co-authored-by: Gang Wu --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index a749455c6fa1..bc988ea7e716 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1611,7 +1611,7 @@ All readers are required to read tables with unknown partition transforms, ignor Writing v3 metadata: * Partition Field and Sort Field JSON: - * `source-ids` was added and is required in case of multi-argument transforms. + * `source-ids` was added and is required in the case of multi-argument transforms. * `source-id` should still be written in the case of single-argument transforms. Row-level delete changes: From 1b25e4a215ca9679c2437954605e4dbf4418640d Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Wed, 2 Apr 2025 08:16:27 +0200 Subject: [PATCH 03/10] How to handle nulls --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index bc988ea7e716..c7108d423e2d 100644 --- a/format/spec.md +++ b/format/spec.md @@ -540,7 +540,7 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. 3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. 4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. -5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. +5. For multi-argument bucketing, the hashes for the not-null input values are `xor`'ed: `(hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. The transform will return `null` when all input values are `null`. #### Partition Evolution From 1de8b8ceab9f900adbbedf2576378786feb89e97 Mon Sep 17 00:00:00 2001 From: Fokko Date: Thu, 3 Apr 2025 13:20:59 +0200 Subject: [PATCH 04/10] Remove the implementation for now --- format/spec.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/format/spec.md b/format/spec.md index a749455c6fa1..1115476271ce 100644 --- a/format/spec.md +++ b/format/spec.md @@ -494,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | Transform name | Description | Source types | Result type | |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type | -| **`bucket[N]`** | Hash of value, mod `N` (see below) | Any combination of the following `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | +| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type | | **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | @@ -540,7 +540,6 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. 3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. 4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. -5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. #### Partition Evolution From c103bab246098d85682f8a8faf49863d971e723b Mon Sep 17 00:00:00 2001 From: Fokko Date: Thu, 3 Apr 2025 13:22:12 +0200 Subject: [PATCH 05/10] Remove conflcit --- format/spec.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/format/spec.md b/format/spec.md index ce7aa05ed1aa..f79ab957701a 100644 --- a/format/spec.md +++ b/format/spec.md @@ -540,10 +540,7 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. 3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. 4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. -<<<<<<< HEAD -======= -5. For multi-argument bucketing, the hashes for the not-null input values are `xor`'ed: `(hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. The transform will return `null` when all input values are `null`. ->>>>>>> 1b25e4a215ca9679c2437954605e4dbf4418640d + #### Partition Evolution From 4deead22fe5a3660a867c07a8d49f1282440adef Mon Sep 17 00:00:00 2001 From: Fokko Date: Fri, 4 Apr 2025 08:42:06 +0200 Subject: [PATCH 06/10] Cleanup --- format/spec.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/format/spec.md b/format/spec.md index f79ab957701a..9aaebe3bca5e 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1414,15 +1414,15 @@ Each partition field in `fields` is stored as a JSON object with the following p | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|--------------| -| required | required | required¹ | **`source-id`** | `JSON int` | 1 | -| | | required¹ | **`source-ids`** | `JSON list of ints` | `[1,2]` | +| required | required | optional | **`source-id`** | `JSON int` | 1 | +| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | | | required | required | **`field-id`** | `JSON int` | 1000 | | required | required | required | **`name`** | `JSON string` | `id_bucket` | | required | required | required | **`transform`** | `JSON string` | `bucket[16]` | Notes: -1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. +1. For partition fields with a transform with a single argument, `source-id` can still be written. Supported partition transforms are listed below. @@ -1611,7 +1611,7 @@ All readers are required to read tables with unknown partition transforms, ignor Writing v3 metadata: * Partition Field and Sort Field JSON: - * `source-ids` was added and is required in the case of multi-argument transforms. + * `source-ids` was added and should be written. * `source-id` should still be written in the case of single-argument transforms. Row-level delete changes: From f1fac5b4ac8023ba539c1285a8c4ca21c8449f5b Mon Sep 17 00:00:00 2001 From: Fokko Date: Mon, 14 Apr 2025 18:26:05 +0200 Subject: [PATCH 07/10] Only write the one or the other --- format/spec.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/format/spec.md b/format/spec.md index 9aaebe3bca5e..3642fd2a31cd 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1415,14 +1415,14 @@ Each partition field in `fields` is stored as a JSON object with the following p | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|--------------| | required | required | optional | **`source-id`** | `JSON int` | 1 | -| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | +| | | optional | **`source-ids`** | `JSON list of ints` | `[1,2]` | | | required | required | **`field-id`** | `JSON int` | 1000 | | required | required | required | **`name`** | `JSON string` | `id_bucket` | | required | required | required | **`transform`** | `JSON string` | `bucket[16]` | Notes: -1. For partition fields with a transform with a single argument, `source-id` can still be written. +1. For partition fields with a transform with a single argument, only `source-id` is written. In case of a multi-argument transform, only `source-ids` is written. Supported partition transforms are listed below. @@ -1611,7 +1611,7 @@ All readers are required to read tables with unknown partition transforms, ignor Writing v3 metadata: * Partition Field and Sort Field JSON: - * `source-ids` was added and should be written. + * `source-ids` was added and should be written in case of a multi-argument transform. * `source-id` should still be written in the case of single-argument transforms. Row-level delete changes: From 9cd65568040861faa76f361ad7ad83dbce638a9c Mon Sep 17 00:00:00 2001 From: Fokko Date: Mon, 14 Apr 2025 18:30:27 +0200 Subject: [PATCH 08/10] A few more changes --- format/spec.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/format/spec.md b/format/spec.md index 3642fd2a31cd..b8e9823f8e5c 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1457,15 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|-------------| -| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` | -| required | required | required¹ | **`source-id`** | `JSON int` | 1 | +| required | required | optional | **`transform`** | `JSON string` | `bucket[4]` | +| required | required | optional | **`source-id`** | `JSON int` | 1 | | | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | | required | required | required | **`direction`** | `JSON string` | `asc` | | required | required | required | **`null-order`** | `JSON string` | `nulls-last`| Notes: -1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. +1. For sort fields with a transform with a single argument, only `source-id` is written. In case of a multi-argument transform, only `source-ids` is written. Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them. From 0069eced9915f634edfd8cafbdb7f58529c708f7 Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Wed, 16 Apr 2025 18:49:06 +0200 Subject: [PATCH 09/10] Fix typo --- format/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/format/spec.md b/format/spec.md index b8e9823f8e5c..2e5f2dcc0d5e 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1457,9 +1457,9 @@ Each sort field in the fields list is stored as an object with the following pro | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|-------------| -| required | required | optional | **`transform`** | `JSON string` | `bucket[4]` | +| required | required | required | **`transform`** | `JSON string` | `bucket[4]` | | required | required | optional | **`source-id`** | `JSON int` | 1 | -| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | +| | | optional | **`source-ids`** | `JSON list of ints` | `[1,2]` | | required | required | required | **`direction`** | `JSON string` | `asc` | | required | required | required | **`null-order`** | `JSON string` | `nulls-last`| From c84e16280d4fe4625333472eae3b9affd69312da Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Wed, 16 Apr 2025 20:57:40 +0200 Subject: [PATCH 10/10] Replace `should` with `must` --- format/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/format/spec.md b/format/spec.md index 2e5f2dcc0d5e..ccbf02ca22da 100644 --- a/format/spec.md +++ b/format/spec.md @@ -1611,8 +1611,8 @@ All readers are required to read tables with unknown partition transforms, ignor Writing v3 metadata: * Partition Field and Sort Field JSON: - * `source-ids` was added and should be written in case of a multi-argument transform. - * `source-id` should still be written in the case of single-argument transforms. + * `source-ids` was added and must be written in the case of a multi-argument transform. + * `source-id` must be written in the case of single-argument transforms. Row-level delete changes: