-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs: Update defaults for distribution mode #10575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/docs/spark-configuration.md
Outdated
| | compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | ||
| | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | ||
| | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | | ||
| | distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined; `None` otherwise | Override this table's distribution mode for this write | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined; `None` otherwise | Override this table's distribution mode for this write | | |
| | distribution-mode | `Range` if sort order is defined ; `Hash` if partition is defined; `None` otherwise | Override this table's distribution mode for this write | |
docs/docs/configuration.md
Outdated
| | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | ||
| | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | | ||
| | write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | | ||
| | write.distribution-mode | none (see engines for specific defaults) | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if it's helpful to hyperlink the engine specific defaults here or in notes , for the convenience of lookup
like
[Spark](spark-configuration.md#write-options)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
@RussellSpitzer do you have time to take a look at this? |
docs/docs/configuration.md
Outdated
| | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | ||
| | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | | ||
| | write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | | ||
| | write.distribution-mode | none. Engines may override this default, for example [Spark](spark-configuration.md#write-options) | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really like this description before, but I think the change is good. Might be nice in a follow up to change this since it isn't clear what "distribution of write data" is.
docs/docs/spark-configuration.md
Outdated
| | compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | ||
| | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | ||
| | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | | ||
| | distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined ; `None` otherwise | Override this table's distribution mode for this write | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to a link to the spark-writes section? May be more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer good point, I just linked from both sections, let me know what you think now?
3d2b21f to
d5aa46a
Compare
d5aa46a to
ab56f03
Compare
|
Looks good to me now. :) Feel free to merge |
|
Thanks @RussellSpitzer @dramaticlly @ajantha-bhat for reviews! |
In recent releases of Iceberg-Spark, there have been changes to Spark defaults for distribution mode, by changes like: #7637 (and previous ones).
I have seen many questions regarding why an extra shuffle is included in writes, and it is often this change in default. One source of confusion I have seen is the doc mentioning table property write.distribution-mode which says default is NONE, so this PR tries to add more to here to explain this.