Flink: Support write.distribution-mode. #2064

openinx · 2021-01-11T07:31:01Z

Provides a switch in org.apache.flink.sink.FlinkSink to shuffle by partition key, so that each partition/bucket will be wrote by only one task. That will reduce lots of small files in partitioned fanout write policy for flink sink.

kbendick

This looks great and will be really useful.

kbendick · 2021-01-11T10:04:13Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+        SimpleDataUtil.createRecord(3, "ccc")
+    ));
+
+    Assert.assertEquals("Should 1 data file in partition 'aaa'", 1, partitionFiles(tableName, "aaa").size());


Small nit: Consider There should be [only] 1 data file in partition 'aaa'. And similarly for the other files.

kbendick · 2021-01-11T10:05:34Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+            WRITE_SHUFFLE_BY_PARTITION,
+            WRITE_SHUFFLE_BY_PARTITION_DEFAULT);
+      } else {
+        return shuffleByPartition;


Somewhat unrelated question: Is it possible to set these values at the cluster level (like with Flink's properties), or is it only code property then table property only?

I think it's flexible to provide job-level and table-level for this option, shuffling by partition for the whole cluster's job seems to be coarse granularity.

That's a fair assessment.

I ask as we typically use job-clusters at my work and then try to specify as much configuration as possible on the cluster's config. This is mostly to provide one easy way to track this, and is definitely tied to how our build and deployment system and configuration system is set up internally at my work.

Outside of job clusters, I would agree that it's too coarse grained. And having the possibility to set it as a job config or a table config is good enough. I'm not even sure if Flink would accept arbitrary configurations at the cluster level (that it isn't aware of).

aokolnychyi · 2021-01-11T22:01:50Z

core/src/main/java/org/apache/iceberg/TableProperties.java

  public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
  public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;

+  public static final String WRITE_SHUFFLE_BY_PARTITION = "write.shuffle-by.partition";


I think we should make sure all query engines are aligned with this.
In my view, we should support the following cases:

local sort using the table sort order

repartition using partition spec and local sort by the table sort order

global sort using the table sort order

In Spark, our plan was to support the following commands:

-- global ALTER TABLE WRITE ORDERED BY p1, bucket(id, 128), c1, c2 -- hash + local sort ALTER TABLE WRITE DISTRIBUTED BY p1, bucket(id, 128) LOCALLY ORDERED BY p1, bucket(id, 128), c1, c2 -- local sort ALTER TABLE WRITE LOCALLY ORDERED BY p1, bucket(id, 128), c1, c2

+1 on more generalized semantics. Current option only works if the data is relatively evenly distrubuted across table partitions. Otherwise, heavy data skew can be problematic for writer. The other problem is that effective writer parallelism now is limited by the number of partition values. Let's say the writer parallelism is 100, but the number of unique partition values are only 10. Then only 10 writer subtasks will get the data.

I will add some notes for the streaming write mode. In a streaming job, it is probably impossible to do true sorting. Instead, what can be useful is some sort of "groupBy/bucketing" shuffle in the streaming sink. It can help with reducing too many concurrent open files per writer and improving read performance (predicate pushdown) with better data locality.

E.g., a table is partition by (event_date, country). Without the shuffle, each writer task can write to ~200 files/countries. However, a simple keyBy is also problematic as it can produce heavy data skew for countries like US. Instead, we should calculate stats for each bucket/country and distribute the data based on the weight of each bucket. E.g., we may allocate 100 downstream subtasks for US, while allocating 1 downstream subtask for multiple small countries (like bin packing).

This can also be extended to non-partition column (as logical partitioning), which can improve read performance with filtering. Similar to the above example with the tweak that country is not a partition column anymore. groupBy/bucketing shuffle can help improve data locality.

I was thinking about a groupBy/orderBy operator where each subtask (running in taskmanager) can constantly report local statistics to operator coordinator (running in jobmanager), which then does the global aggregation and notify subtasks with the globally aggregated stats.

I have the same concern as @stevenzwu that a hash distribution by partition spec would co-locate all entries for the same partition in the same task, potentially leading to having too much data in a task. The global sort in Spark would be a better option here for batch jobs as it will do skew estimation and the sort order can be used to split data for the same partition across multiple tasks.

To sum up, I think we should be flexible and support 3 modes to cover different use cases.

cc @jacques-n @omalley @rdblue as it is related to the discussion we had during the last sync.

FYI @electrum. You may be interested in this discussion for recommended write behavior from table config.

Flink may eventually provide a way to order within data files, but I think that is less important than clustering data across files so that data files can be skipped in queries.

Agreed. Though sorting within data file would be really helpful for page skipping, but that would introduce more cost for streaming processing job. Range distribution by sorted keys is some kind of coarse granularity, but it's good enough for streaming job to cluster keys for filtering among data files, I think it's a better balanced choice when trade off between write efficiency and read performances.

It make sense to me that rewriting those range distributed data files into row-ordering files if there're heavy reads that depends on them.

@rdblue thanks for the pointer. Here are my thoughts on how this would work for Trino (formerly Presto SQL).

Trino does streaming execution between stages -- there is no materialized shuffle phase. This means that global sorting would only be possible using a fixed range, not based on statistics, so it would be vulnerable to skew. I'd like to understand the use case for global "sort" compared to "partition".

For local sorting, I see two choices:

Write arbitrarily large files. Use a fixed size in-memory buffer, sort when full, write to temporary file, then merge files at end. There may be multiple merge passes in order to limit the number of files read at once during the merge. This is what we do for Hive bucketed-sorted tables, since sorting per bucket is required.

Write multiple size-limited files. Use a fixed size in-memory buffer, sort when full, write final output file. Repeat until all input data for writer has been consumed.

I would prefer the second option as it is simpler and uses fewer resources. It satisfies the property that each file is sorted and helps with compression and within-file filtering. The downside is that there are more files, but if they are of sufficient size, it shouldn't affect reads as we split files anyway when reading.

Another option is to sort data using a fixed size buffer before writing each batch of rows. This would help with compression and within-file filtering, but wouldn't provide a guarantee on sorting for readers.

@electrum, as far as what a "local sort" means, I think option 2 sounds good to me for a task-level sort. If that sort is needlessly expensive, then it is okay for Trino to skip it. But I think that if a table has a defined sort order, the right thing would be for Trino to apply it.

For data distribution, it sounds like Trino will only support none and hash modes in the short term. That's reasonable given that you can't stage data and use it twice. Even with shuffle data reuse, global sort in Spark is quite expensive in some cases (doing a large join twice, for example). Eventually, we want to get to where the table metadata has a sketch of the data distribution so you can use that to get ranges for a global ordering.

Eventually, we want to get to where the table metadata has a sketch of the data distribution so you can use that to get ranges for a global ordering.

I was also thinking about how to partition the (-oo, +oo) into several even key ranges ( partition key ranges or sort key ranges) for flink. Seems this idea is similar to the @stevenzwu 's list-of-values column stats from #2064 (comment) . Yes, that helps a lot if we have such fine-grained column range stats.

rdblue · 2021-01-19T17:56:23Z

api/src/main/java/org/apache/iceberg/DistributionMode.java

+ * suitable for the scenarios where rows are located into different partitions with skew distribution.
+ */
+public enum DistributionMode {
+  NONE("none"), HASH("hash-partition"), RANGE("range-partition");


As I noted in the comment thread, I think that we should use "partition" to describe only table partitions. Otherwise, we are going to create confusion. We can use "hash" and "range" here if there is consensus that "partition" and "sort" are not clear, but I don't think that we should use the term "partition" to refer to distribution within a processing engine.

"Partition" and "sort" aren't very clear to me. Both "hash partition" and "range partition" are "partitions". But in the table, they are listed as "partition" and "sort".

The general concepts that @rdblue defined in the table above are still very good guidances for us to think about those dimensions. But if Flink and Spark are going to support different behaviors, maybe it is better for them to define different values to be more accurately describe the behavior.

+1 on not using the term "partition" when talking about distribution.

W.r.t. naming, I did ask myself the question what should be the best names here. I tend to like "hash" and "range" a bit more as it may not be clear that partition refers to the table's partition spec.

I guess the real question here is what does this table property control? Are we allowing to control whether to use hash or range distribution or do we control whether the distribution is based on the partition spec or sort order?

I think it controls whether we use hash distribution or range distribution. I agree that's more clear from a developer's perspective. My concern is that users won't know what hash and range are, but they do understand what a partition is and what sorting is.

Let's go with hash and range for now. I think we can explain it well enough in docs, and we can also add aliases that are more clear if needed.

Sounds good to me.

what if we call config name as write.shuffle-mode? would it make it more clear to user regarding hash vs range?

@stevenzwu , I think write.shuffle-mode is enough to express the write behavior of flink, but not enough to express the write behavior of spark, because spark will distribute those records with local sort or global sort.

+1 on keeping write.distribution-mode and using the hash & range values now ( Though the range does not fully express the sort meaning from spark, but I can not think of a better word to express the exact meaning for both flink and spark ).

rdblue · 2021-01-19T17:57:13Z

api/src/main/java/org/apache/iceberg/DistributionMode.java

+    return name;
+  }
+
+  public static DistributionMode fromName(String name) {


If we used hash and range, then this would just need to use valueOf(name.toUpperCase(Locale.ROOT)) (with a null check, of course).

rdblue · 2021-01-19T17:59:17Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java


-    RowType flinkSchema;
+        case RANGE:
+          throw new UnsupportedOperationException("The write.distribution-mode=range is not supported in flink now");


By throwing an exception here, users could break jobs by setting the distribution mode. Is that okay? I guess it wouldn't affect running jobs because they are already configured.

There are two cases:
Case.1 : people configure the distribution-mode in job-level to RANGE, as we don't support it now so we'd better to throw UnsupportedOperationException now;

Case. 2: people change an existing table's properties from NONE to RANGE, then all running flink jobs wont' be affected unless restarting, the newly started flink job will be required to use NONE or HASH. It's not friendly to break all existing jobs when restarting, let me add a warn log and just keep the default NONE behavior.

rdblue · 2021-01-19T18:03:40Z

flink/src/test/java/org/apache/iceberg/flink/FlinkCatalogTestBase.java

-      config.put(CatalogProperties.WAREHOUSE_LOCATION, "file://" + hiveWarehouse.getRoot());
      config.put(CatalogProperties.HIVE_URI, getURI(hiveConf));
    }
+    config.put(CatalogProperties.WAREHOUSE_LOCATION, String.format("file://%s", warehouseRoot()));


Is this change related? This looks like a fix for something else.

It's not a fix, just for abstraction, so that we could get all data files under the given partition here: https://github.com/apache/iceberg/pull/2064/files#diff-0aaa93576853d5b379da121bc5d6161eb888fe15b88e3597374ed894d8c94917R275

rdblue · 2021-01-19T18:05:37Z

This looks nearly ready. Mainly, I would like to get consensus on the config values.

stevenzwu · 2021-01-19T18:26:52Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+          if (partitionSpec.isUnpartitioned()) {
+            return input;
+          } else {
+            return input.keyBy(new PartitionKeySelector(partitionSpec, iSchema, flinkRowType));


I still have concerns on supporting this hash partition of keyBy the partition key due to the data skew problem that mentioned in the discussion thread.

This isn't going to cover all cases, but I think it is a necessary first step. Data skew is going to require range distribution.

that is fair

rdblue · 2021-01-20T01:35:07Z

I think there is consensus around using "none", "hash", and "range" for the distribution mode. Once that's implemented, I think this is ready to commit. I also had some other minor comments.

rdblue · 2021-01-20T17:00:35Z

Looks great, thanks for working on this @openinx!

And thanks to everyone that helped discuss the configuration!

openinx requested a review from rdblue January 11, 2021 07:31

github-actions bot added flink core labels Jan 11, 2021

kbendick approved these changes Jan 11, 2021

View reviewed changes

aokolnychyi reviewed Jan 11, 2021

View reviewed changes

openinx added 4 commits January 14, 2021 14:42

Flink: Add option to shuffle by partition key in iceberg sink.

ed74c0a

Add table option: write.shuffle-by.partition

32cd834

Minor changes.

0f1d920

Fix compile error.

235a8e3

rdblue mentioned this pull request Jan 16, 2021

Implement logic to group and sort rows before writing rows for MERGE INTO. #2022

Merged

Introduce write.distribution-mode.

63ae689

github-actions bot added the API label Jan 19, 2021

Add javadoc for distribution mode.

b365d72

openinx changed the title ~~Flink: Add option to shuffle by partition key in iceberg sink.~~ Flink: Support write.distribution-mode. Jan 19, 2021

rdblue reviewed Jan 19, 2021

View reviewed changes

stevenzwu reviewed Jan 19, 2021

View reviewed changes

openinx added 3 commits January 20, 2021 12:11

Align the config values.

a38cb14

Fix the assert error messages.

863c0ed

Minor changes

34cc321

rdblue merged commit c75ac35 into apache:master Jan 20, 2021

raunaqmorarka mentioned this pull request Jan 21, 2021

Add sort_by property to hive tables trinodb/trino#6626

Closed

openinx mentioned this pull request Jan 22, 2021

Flink: Support inferring parallelism for batch read. #1936

Merged

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Flink: Support write.distribution-mode (apache#2064)

2bdae72

Flink: Support write.distribution-mode. #2064

Flink: Support write.distribution-mode. #2064

Uh oh!

Conversation

openinx commented Jan 11, 2021

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 20, 2021

Uh oh!

rdblue commented Jan 20, 2021

stevenzwu Jan 12, 2021 •

edited

Loading

aokolnychyi Jan 12, 2021 •

edited

Loading

stevenzwu Jan 19, 2021 •

edited

Loading

aokolnychyi Jan 19, 2021 •

edited

Loading