[FLINK-35237] Allow Sink to Choose HashFunction in PrePartitionOperator #3414

dingxin-tech · 2024-06-14T02:26:13Z

https://issues.apache.org/jira/browse/FLINK-35237

lvyanquan

Thanks for this great contribution, LGTM. left some minor comments about java doc.

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunction.java

lvyanquan

LGTM.
And CC @PatrickRen @leonardBang

lvyanquan · 2024-06-17T01:21:44Z

What's more, considering that the number of buckets and parallelism may not be consistent, should we remove the constraint on EventPartitioner?

dingxin-tech · 2024-06-17T02:07:40Z

What's more, considering that the number of buckets and parallelism may not be consistent, should we remove the constraint on EventPartitioner?

Although the number of buckets and parallelism will differ, we can only distribute based on parallelism rather than buckets, right? We have already distributed the hash values to the various parallelisms here, so I think there's no need to change anything here.

lvyanquan · 2024-06-17T02:57:10Z

Got it, There is indeed no need for adjustment.

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java

dingxin-tech · 2024-06-17T08:04:51Z

Can this PR be merged before the other PR? Both PRs are marked for inclusion in version 3.2, but the other PR depends on this one. I will need some time to make the necessary adjustments.

lvyanquan · 2024-06-17T08:28:09Z

@leonardBang @PatrickRen can you help to review and merge this?

lvyanquan · 2024-06-18T02:21:47Z

@yuxiqian CC.

yuxiqian

Thanks for @dingxin-tech's contribution. I wonder if DefaultHashFunctionProvider implementation could be improved when migrating from existing PrePartitionOperator#HashFunction.

...k-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/DefaultHashFunctionProvider.java

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunction.java

leonardBang · 2024-07-15T09:45:56Z

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java

+    // --------------------------------------------------------------------------------------------
+    default void open() {}
+
+    default void close() {}


Why a simple Provider interface need these life cycle methods? do we use them in any implementation classes?

Since we provided the getHashFunction method based on TableId, connector implementers might use TableId to obtain the actual schema of the database and perform some caching operations. We can establish connections and initialize caches in the open method. These two lifecycle methods were added following @lvyanquan 's suggestion, and he might respond with further additions.

Usually, it will need to obtain partition or bucket information based on TableId.
I wonder if it is possible to cache catalog or connection here to reuse objects or connections.

A function with lifecycle management makes sense to me, why the factory need to care the resource management? could we push the logic that use TableId to obtain partition or buckets info to HashFunction internal? The HashFunction need to open/close may be better, what do you think?

Actually, HashFunctionalProvider is similar to the role of a catalog, and HashFunction is similar to the role of a table, and the open/lose method is called on the catalog role.

Well, Catalog is used as a metadata manager instance instead of a table factory, table is not constructed by the instance. A metadata manager has its life cycle makes sense to me, a factory own lifecycle confuse me a little. Come back to the function itself, Flink also has many functions as well as the function factories, I didn't see the necessary why a function factory need to care the required resource in runtime, but function manages its resources is pretty common.

You can also visit flink code about:

RichFunction & SourceFunctionProvider

InputFormat & InputFormatProvider

So the key point is that what we need is a metadata manager or a function factory. I agree that what we need is one function factory, though it may need the assistance of specific metadata manager.

And the resource utilization is indeed overthinking, because we can release database connection after extracting the parameters for calculation.

Yeah, that's what I want to propose

In my opinion, I believe that HashFunction should not have lifecycle functions. It is merely a hash function, not a Flink operator, and it will be cached and disappear automatically. Its lifecycle can be perceived as a kind of "constant" after creation, so we should not concern ourselves with its lifecycle.

Based on this viewpoint, whether HashFunctionProvider should have lifecycle functions depends solely on whether we need to reuse some resources created at runtime when creating a HashFunction.

In fact, I also think that currently, the connectors do not have any tasks that need to be done within lifecycle functions. Therefore, I have removed the lifecycle functions for now.

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java

dingxin-tech · 2024-07-19T03:33:16Z

hi, @leonardBang, can you help to review again and merge this?

…titionOperator

…irectory.

leonardBang · 2024-07-19T09:57:03Z

hi, @leonardBang, can you help to review again and merge this?

I append one commit to polish the interface and package, Could you also take a look ?

dingxin-tech · 2024-07-19T10:03:48Z

hi, @leonardBang, can you help to review again and merge this?

I append one commit to polish the interface and package, Could you also take a look ?

sure and looks good for me.

github-actions bot added composer common runtime labels Jun 14, 2024

lvyanquan reviewed Jun 14, 2024

View reviewed changes

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java Outdated Show resolved Hide resolved

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunction.java Outdated Show resolved Hide resolved

lvyanquan approved these changes Jun 14, 2024

View reviewed changes

lvyanquan reviewed Jun 17, 2024

View reviewed changes

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java Outdated Show resolved Hide resolved

lvyanquan reviewed Jun 17, 2024

View reviewed changes

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/sink/HashFunctionProvider.java Outdated Show resolved Hide resolved

dingxin-tech requested review from lvyanquan June 17, 2024 05:43

lvyanquan approved these changes Jun 17, 2024

View reviewed changes

yuxiqian reviewed Jul 17, 2024

View reviewed changes

github-actions bot added the reviewed label Jul 17, 2024

yuxiqian approved these changes Jul 17, 2024

View reviewed changes

leonardBang reviewed Jul 17, 2024

View reviewed changes

[FLINK-35237][cdc-common] Allow Sink to choose HashFunction in PrePar…

e8fd452

…titionOperator

leonardBang force-pushed the feature/hashfunction branch from 9084d60 to 5222dfa Compare July 19, 2024 09:49

[FLINK-35237][cdc-common] Improve the interfaces and reorganize the d…

7ac39fd

…irectory.

leonardBang force-pushed the feature/hashfunction branch from 5222dfa to 7ac39fd Compare July 19, 2024 09:55

leonardBang approved these changes Jul 22, 2024

View reviewed changes

github-actions bot added the approved label Jul 22, 2024

leonardBang merged commit 26ff6d2 into apache:master Jul 22, 2024

lvyanquan mentioned this pull request Jul 22, 2024

[FLINK-35873][cdc-connector][paimon] Add HashFunctionProvider implementation for PaimonDataSink. #3486

Closed

[FLINK-35237] Allow Sink to Choose HashFunction in PrePartitionOperator #3414

[FLINK-35237] Allow Sink to Choose HashFunction in PrePartitionOperator #3414

Uh oh!

Conversation

dingxin-tech commented Jun 14, 2024

Uh oh!

lvyanquan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lvyanquan left a comment

Choose a reason for hiding this comment

Uh oh!

lvyanquan commented Jun 17, 2024

Uh oh!

dingxin-tech commented Jun 17, 2024

Uh oh!

lvyanquan commented Jun 17, 2024

Uh oh!

Uh oh!

Uh oh!

dingxin-tech commented Jun 17, 2024

Uh oh!

lvyanquan commented Jun 17, 2024

Uh oh!

lvyanquan commented Jun 18, 2024

Uh oh!

yuxiqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lvyanquan Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dingxin-tech Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dingxin-tech commented Jul 19, 2024

Uh oh!

leonardBang commented Jul 19, 2024

Uh oh!

dingxin-tech commented Jul 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lvyanquan Jul 17, 2024 •

edited

Loading

dingxin-tech Jul 18, 2024 •

edited

Loading