Docs: Update defaults for distribution mode #10575

szehon-ho · 2024-06-27T21:22:12Z

In recent releases of Iceberg-Spark, there have been changes to Spark defaults for distribution mode, by changes like: #7637 (and previous ones).

I have seen many questions regarding why an extra shuffle is included in writes, and it is often this change in default. One source of confusion I have seen is the doc mentioning table property write.distribution-mode which says default is NONE, so this PR tries to add more to here to explain this.

ajantha-bhat · 2024-06-28T07:01:57Z

docs/docs/spark-configuration.md

 | compression-codec      | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write      |
 | compression-level      | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write |
 | compression-strategy   | Table write.orc.compression-strategy       | Overrides this table's compression strategy for ORC tables for this write |
+| distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined; `None` otherwise | Override this table's distribution mode for this write |


Suggested change

| distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined; `None` otherwise | Override this table's distribution mode for this write |

| distribution-mode | `Range` if sort order is defined ; `Hash` if partition is defined; `None` otherwise | Override this table's distribution mode for this write |

dramaticlly · 2024-07-03T17:44:50Z

docs/docs/configuration.md

 | write.target-file-size-bytes                         | 536870912 (512 MB)          | Controls the size of files generated to target about this many bytes                                                                                                                              |
 | write.delete.target-file-size-bytes                  | 67108864 (64 MB)            | Controls the size of delete files generated to target about this many bytes                                                                                                                       |
-| write.distribution-mode                              | none                        | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |
+| write.distribution-mode | none (see engines for specific defaults) | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |


I am wondering if it's helpful to hyperlink the engine specific defaults here or in notes , for the convenience of lookup

like

[Spark](spark-configuration.md#write-options)

szehon-ho · 2024-07-16T17:08:05Z

@RussellSpitzer do you have time to take a look at this?

RussellSpitzer · 2024-07-16T17:18:06Z

docs/docs/configuration.md

 | write.target-file-size-bytes                         | 536870912 (512 MB)          | Controls the size of files generated to target about this many bytes                                                                                                                              |
 | write.delete.target-file-size-bytes                  | 67108864 (64 MB)            | Controls the size of delete files generated to target about this many bytes                                                                                                                       |
-| write.distribution-mode                              | none                        | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |
+| write.distribution-mode                              | none. Engines may override this default, for example [Spark](spark-configuration.md#write-options) | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |


I didn't really like this description before, but I think the change is good. Might be nice in a follow up to change this since it isn't clear what "distribution of write data" is.

RussellSpitzer · 2024-07-16T17:20:39Z

docs/docs/spark-configuration.md

 | compression-codec      | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write      |
 | compression-level      | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write |
 | compression-strategy   | Table write.orc.compression-strategy       | Overrides this table's compression strategy for ORC tables for this write |
+| distribution-mode | `Range` if Sort Order Defined ; `Hash` if Partition Defined ; `None` otherwise | Override this table's distribution mode for this write |


Can we change this to a link to the spark-writes section? May be more clear

@RussellSpitzer good point, I just linked from both sections, let me know what you think now?

RussellSpitzer · 2024-07-16T20:50:39Z

Looks good to me now. :) Feel free to merge

szehon-ho · 2024-07-16T20:58:24Z

Thanks @RussellSpitzer @dramaticlly @ajantha-bhat for reviews!

Docs: Update defaults for distribution mode

8e230a4

github-actions bot added the docs label Jun 27, 2024

ajantha-bhat reviewed Jun 28, 2024

View reviewed changes

dramaticlly reviewed Jul 3, 2024

View reviewed changes

dramaticlly approved these changes Jul 12, 2024

View reviewed changes

RussellSpitzer reviewed Jul 16, 2024

View reviewed changes

szehon-ho force-pushed the distribution_mode_doc branch from 3d2b21f to d5aa46a Compare July 16, 2024 20:07

Review comments

ab56f03

szehon-ho force-pushed the distribution_mode_doc branch from d5aa46a to ab56f03 Compare July 16, 2024 20:39

RussellSpitzer approved these changes Jul 16, 2024

View reviewed changes

szehon-ho merged commit 4a0ae22 into apache:main Jul 16, 2024

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024

Docs: Clarify defaults for distribution mode (apache#10575)

b68f86f

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Docs: Clarify defaults for distribution mode (apache#10575)

5bfec43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Update defaults for distribution mode #10575

Docs: Update defaults for distribution mode #10575

Uh oh!

szehon-ho commented Jun 27, 2024 •

edited

Loading

Uh oh!

ajantha-bhat Jun 28, 2024

Uh oh!

dramaticlly Jul 3, 2024

Uh oh!

szehon-ho Jul 12, 2024

Uh oh!

szehon-ho commented Jul 16, 2024

Uh oh!

RussellSpitzer Jul 16, 2024

Uh oh!

RussellSpitzer Jul 16, 2024

Uh oh!

szehon-ho Jul 16, 2024

Uh oh!

RussellSpitzer commented Jul 16, 2024

Uh oh!

szehon-ho commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	\| distribution-mode \| `Range` if Sort Order Defined ; `Hash` if Partition Defined; `None` otherwise \| Override this table's distribution mode for this write \|
	\| distribution-mode \| `Range` if sort order is defined ; `Hash` if partition is defined; `None` otherwise \| Override this table's distribution mode for this write \|

Docs: Update defaults for distribution mode #10575

Docs: Update defaults for distribution mode #10575

Uh oh!

Conversation

szehon-ho commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

dramaticlly Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Jul 16, 2024

Uh oh!

RussellSpitzer Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jul 16, 2024

Uh oh!

szehon-ho commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

szehon-ho commented Jun 27, 2024 •

edited

Loading