Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,9 +296,9 @@ Data files are stored in manifests with a tuple of partition values that are use

Tables are configured with a **partition spec** that defines how to produce a tuple of partition values from a record. A partition spec has a list of fields that consist of:

* A **source column id** from the table’s schema
* A **source column id** or a list of **source column ids** from the table’s schema
* A **partition field id** that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs.
* A **transform** that is applied to the source column to produce a partition value
* A **transform** that is applied to the source column(s) to produce a partition value
* A **partition name**

The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
Expand Down Expand Up @@ -383,8 +383,8 @@ Users can sort their data within partitions by columns to gain performance. The

A sort order is defined by a sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of:

* A **source column id** from the table's schema
* A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms).
* A **source column id** or a list of **source column ids** from the table's schema
* A **transform** that is used to produce values to be sorted on from the source column(s). This is the same transform as described in [partition transforms](#partition-transforms).
* A **sort direction**, that can only be either `asc` or `desc`
* A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last`

Expand Down Expand Up @@ -1128,12 +1128,17 @@ Each partition field in the fields list is stored as an object. See the table fo
|**`month`**|`JSON string: "month"`|`"month"`|
|**`day`**|`JSON string: "day"`|`"day"`|
|**`hour`**|`JSON string: "hour"`|`"hour"`|
|**`Partition Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
|**`Partition Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|

In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.

The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details.

Notes:

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
Copy link
Copy Markdown
Contributor

@emkornfield emkornfield Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick. It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1, which as far as I can tell is still potentially a valid field according to the spec (I might have missed it but field IDs simply seem to be defined as integers).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using the first column from the source ID list instead of a fake ID. That way older readers at least see that the transform is associated with one of the correct columns.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a valid source ID here would lead to incorrect results for old clients if a predicate is specified on the column. IIUC invalid ID here makes sure reads should always be correct or fail which seems like better semantics if the aim is forwards compatibility

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1,

Per my understanding, multi-arg transforms will mostly get a new transform name rather than the existing ones. Older readers will treat this multi-arg transform as an UnknownTransform, the persisted source-id is just to make old code happy, see this reply as well: #8579 (comment). So the value of source-id is just a place holder and doesn't make too much sense. It could be a field ID from the reserved range or a negative one since the current reference implementation wouldn't produce a negative field id.
I simply choose -1 as it seems more nature and doesn't need to put a somehow weird reserved field in the MetadataColumns.java , but I think we make always make follow-up pr if there's valid concerns/solutions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a different transform (I wasn't clear on the final status there) I think it makes it less important so at this point it is bike shedding but I think having a clear signal that this field is meaningless might be useful. I think for V3 it might be worthwhile to consider dropping the backwards compatibility.

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I also think older readers will not be able to make any use of the new mulit-arg transforms. So they would only be able to read new tables (though without any partition pushdown), and would fail to write. So I agree , it is moot what to even put for source-id here, though I think choosing a reserved one is a good idea. Is it just so the java reference implementation can properly de-serialize as Unknown and have a better exception message?

Is the idea in v1/v2 to write source-id column as -1/reserved, and in v3, we will write source-ids for everything and drop source-id column?

I guess this is a more general discussion and can wait the new spec pr clarifying v1/2 vs v3 behaviors.


### Sort Orders

Sort orders are serialized as a list of JSON object, each of which contains the following fields:
Expand All @@ -1147,7 +1152,11 @@ Each sort field in the fields list is stored as an object with the following pro

|Field|JSON representation|Example|
|--- |--- |--- |
|**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br />&nbsp;&nbsp;` "transform": "bucket[4]",`<br />&nbsp;&nbsp;` "source-id": 3,`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}`|
|**`Sort Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br />&nbsp;&nbsp;` "transform": "bucket[4]",`<br />&nbsp;&nbsp;` "source-id": 3,`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}`|

Notes:
1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For sort fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.

The following table describes the possible values for the some of the field within sort field:

Expand Down