[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

aiwenmo · 2024-12-23T15:57:28Z

Premise

MysqlCDC supports snapshot mode

MysqlCDC in Flink CDC (MySqlSource) supports StartupMode.SNAPSHOT and is of Boundedness.BOUNDED, and can run in RuntimeExecutionMode.BATCH.

Streaming VS Batch

Stream mode is suitable for job types including: jobs with high real-time requirements; in non-real-time scenarios, stateless jobs with many Shuffle steps; jobs that require continuous and stable data processing capabilities; jobs with small states, simple topologies and low fault tolerance costs.

Batch mode is suitable for job types including: in non-real-time scenarios, jobs with a large number of stateful operators; jobs that require high resource utilization; jobs with large states, complex topologies and low fault tolerance costs.

Expectation

Full snapshot synchronization

The FlinkCDC YAML job only reads the full snapshot data of the database and then writes it to the target database in Streaming or Batch mode. It is mainly used for full catch-up.

Currently, the SNAPSHOT startup strategy of the FlinkCDC YAML job can run correctly in the Streaming mode; it cannot run correctly in the Batch mode.

Full-incremental offline

The FlinkCDC YAML job collects full snapshot data + incremental log data from the final Offset of the full-incremental snapshot algorithm to the current EndingOffset for the first run; for subsequent runs, it collects from the last EndingOffset to the current EndingOffset.

The job runs in Batch mode. Users can schedule the job periodically, tolerate data delays for a certain period of time (such as hourly or daily), and ensure eventual consistency. Since the periodically scheduled incremental job only collects logs between the last EndingOffset and the current EndingOffset, duplicate full collection of data is avoided.

Test

Full snapshot synchronization in Batch mode

After removing the PartitionOperator, all operators will be chained into one PipelinedRegion and can run correctly;
When there are multiple PipelinedRegions, only the first PipelinedRegion is in the jobgraph and it cannot run correctly;
After removing the SchemaOperator, when there are multiple PipelinedRegions, a correct jobgraph can also be generated, but the sink requires the registration of the coordinator operator.

Solution

Use StartupMode.SNAPSHOT + Streaming for full snapshot synchronization

There is no need to modify the source code. For MysqlCDC, after specifying StartupMode.SNAPSHOT, the full snapshot synchronization job of the entire database can be run in the streaming mode. Although it is not the optimal solution, this capability can be achieved currently.

Expand the FlinkPipelineComposer applicable to the Batch mode to support full Batch synchronization

Topology graph: Source -> PreTransform -> PostTransform -> Router -> PartitionBy -> Sink

There are no change events in the Batch mode, and Schema Evolution does not need to be considered. In addition, the automatic table creation is completed before the job starts.
The field derivation of transform can be placed before the job starts instead of during runtime. Other operations such as the derivation of Router can also be placed before the job starts.
Workload: Implement the Batch construction strategy of FlinkPipelineComposer. Router needs to be independent, and Sink needs to be extended or transformed to support the implementation that does not require a coordinator (it would be better if Batch writing can be achieved).

Expand StartupMode to support users specifying the Offset range to support incremental offline synchronization

Allow users to specify the collection Offset range of the binlog, and then the user's own platform records the EndingOffset of each execution, as well as the periodic scheduling by the platform.

Discussion

1.Is it necessary to implement support for Batch mode because the benefits brought by Batch are small or the performance is not as good as Streaming. Specifically, which Batch optimizations can be used?

2.Whether the full-incremental offline method should be implemented (users can periodically schedule incremental log synchronization)?

Code implementation

Topology graph: Source -> BatchPreTransform -> PostTransform -> SchemaBatchOperator -> PartitionBy(Batch) -> BatchSink
ps: The data flow only contains CreateTableEvent and DataChangeEvent (insert).

Implementation ideas

1.Source first sends all CreateTableEvents, then sends snapshot data.
2.BatchPreTransform doesn't need to cache the state and resume, and PostTransform is no changes in other cases.
3.When SchemaBatchOperator receives the CreateTableEvent, it is only stored in the cache and no events are sent.
4.When SchemaBatchOperator receives the first DataChangeEvent, the widest downstream table structure is deduced based on the router rule, and then the table creation statement is executed in the external data source. Subsequently, the wide table structure is sent to BatchPrePartition.
5.BatchPrePartition broadcasts the CreateTableEvent to BatchPostPartition. BatchPrePartition partitions and distributes the DataChangeEvent to PostPartition based on the table ID and primary key information.
6.BatchPostPartition issues the CreateTableEvent and DataChangeEvent to BatchSink, and BatchSink performs batch writing.

Implementation effect

Computing node 1: Source -> BatchPreTransform -> PostTransform -> SchemaBatchOperator -> BatchPrePartition
Computing node 2: BatchPostPartition -> BatchSink
Batch mode: Computing node 2 starts computing only after computing node 1 is completely finished.

aiwenmo · 2024-12-25T14:59:47Z

Code implementation

Topology graph: Source -> PreTransform -> PostTransform -> SchemaBatchOperator-> PartitionBy(Batch) -> BatchSink

add SchemaBatchOperator which removed the processing of schema change event and removed Coordinator.
add RegularPrePartitionBatchOperator which removed SchemaEvolutionClient.
add DataBatchSinkFunctionOperator and DataBatchSinkWriterOperator which removed SchemaEvolutionClient.
remove SchemaRegistry in batch mode.

aiwenmo · 2024-12-27T17:04:07Z

DataSource will send CreateTableCompletedEvent after sending all CreateTableEvent.
add CreateTableCompletedEvent to notify SchemaBatchOperator to merge all CreateTableEvent.

aiwenmo · 2024-12-31T01:16:42Z

During the test, a new bug was discovered and has been fixed. This PR relies on this fix. #3826

lvyanquan

Thanks @aiwenmo for this contribution, left some comments.

And a end-to-end test is also welcomed.

...src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaBatchOperator.java

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/pipeline/PipelineOptions.java

...src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataBatchSinkFunctionOperator.java

...e/src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataBatchSinkWriterOperator.java

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java

...rc/main/java/org/apache/flink/cdc/runtime/partitioning/RegularPrePartitionBatchOperator.java

...nector-mysql/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlDataSource.java

...ysql/src/main/java/org/apache/flink/cdc/connectors/mysql/factory/MySqlDataSourceFactory.java

...time/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaDerivator.java

...src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataBatchSinkFunctionOperator.java

...e/src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataBatchSinkWriterOperator.java

...rc/main/java/org/apache/flink/cdc/runtime/operators/transform/PreBatchTransformOperator.java

lvyanquan · 2025-03-12T12:14:51Z

I think an e2e test that run in batch mode with transform module is necessary to verify the whole pipeline is runnable.

aiwenmo · 2025-03-12T12:17:56Z

Hi. I'm in the process of coding and testing.

...eline-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/MySqlToDorisE2eITCase.java

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java

...nector-mysql-cdc/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlSource.java

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java

yuxiqian

Thanks for @aiwenmo's work, just left some comments.

My major concern is I've seen a lot of copy-and-paste codes from streaming mode, which makes maintaining code base much harder in the future. I would suggest extracting common parts into an abstract parent class (for example, put common partitioning logic in AbstractPrePartitionOperator) and extend it in StreamingPrePartitionOperator and BatchPrePartitionOperator.

Ignore that if we don't have enough time to finish it before 3.4.0.

yuxiqian · 2025-03-19T01:39:18Z

flink-cdc-cli/src/test/resources/definitions/pipeline-definition-full.yaml

  parallelism: 4
  schema.change.behavior: evolve
  schema-operator.rpc-timeout: 1 h
+  batch-mode.enabled: false


minor: since batchMode is disabled by default, maybe we can turn it on here to verify if it could be enabled correctly?

Many parameters are not effective in batch mode, so "false" is written here.

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/pipeline/PipelineOptions.java

flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/FlinkPipelineComposer.java

yuxiqian · 2025-03-19T01:47:07Z

...omposer/src/main/java/org/apache/flink/cdc/composer/flink/translator/DataSinkTranslator.java

-        // TODO: Hard coding stream mode and checkpoint
-        boolean isBatchMode = false;
+        // TODO: Hard coding checkpoint
        boolean isCheckpointingEnabled = true;


Just curious, is it possible to enable checkpointing in batch mode?

Some sinks need to rely on checkpointing to complete the writing.

...time/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaDerivator.java

yuxiqian · 2025-03-19T02:06:56Z

...untime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/TableIdRouter.java

        return TableId.parse(route.f1);
    }
+
+    public List<Set<TableId>> groupSourceTablesByRouteRule(Set<TableId> tableIdSet) {


I doubt if it's really a generic and reusable method, without corresponding JavaDocs and test cases. Maybe just write it as a for loop in SchemaDerivator?

It makes use of "routes". In the latest code, I've added documentation for it.
Now that I think about it, it can also be placed in SchemaDerivator. I'll give it a try tonight.

Sorry. The attempt couldn't be achieved. The current way of writing is more optimal.

...src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataSinkWriterOperatorFactory.java

...ser/src/main/java/org/apache/flink/cdc/composer/flink/translator/PartitioningTranslator.java

...rc/main/java/org/apache/flink/cdc/runtime/operators/transform/PreBatchTransformOperator.java

aiwenmo · 2025-03-21T01:21:55Z

PTAL @leonardBang @lvyanquan @yuxiqian

yuxiqian · 2025-03-26T08:58:20Z

Hi @aiwenmo, could you please rebase this PR with latest master when available?

Code style verifier has been updated to enforce JUnit 5 + AssertJ framework and these classes might need to be migrated:

JUnit 4 style test annotations should be changed to JUnit 5 equivalents
- org.junit.Test => org.junit.jupiter.api.Test
- @Before, @BeforeClass => @BeforeEach, @BeforeAll
- @After, @AfterClass => @AfterEach, @AfterAll
JUnit Assertions / Hamcrest Assertions are not allowed, including:
- org.junit.Assert
- org.junit.jupiter.api.Assertions
- org.hamcrest.*

Use org.assertj.core.api.Assertions instead.

aiwenmo · 2025-03-26T14:23:36Z

Hi. I have rebased this PR.

yuxiqian

Thanks for @aiwenmo's quick response.

...ser/src/main/java/org/apache/flink/cdc/composer/flink/translator/PartitioningTranslator.java

yuxiqian · 2025-03-28T02:06:00Z

...ysql/src/main/java/org/apache/flink/cdc/connectors/mysql/factory/MySqlDataSourceFactory.java

+        // Batch mode only supports StartupMode.SNAPSHOT.
+        Configuration pipelineConfiguration = context.getPipelineConfiguration();
+        if (pipelineConfiguration != null
+                && pipelineConfiguration.contains(PipelineOptions.PIPELINE_BATCH_MODE_ENABLED)) {
+            startupOptions = StartupOptions.snapshot();
+        }


Just a suggestion, what about throwing an exception explicitly if one enables batch mode with a non-snapshotting source? That would prevent some silent behavior change of startupOptions config.

Alternatively, we may add a interface in DataSourceFactory like this to make things clearer:

@PublicEvolving public interface DataSourceFactory extends Factory { /** Creates a {@link DataSource} instance. */ DataSource createDataSource(Context context); /** Checking if this {@link DataSource} could be created in batch mode. */ boolean supportsBatchPipeline(Context context); }

and verifies it during translating pipeline job graph. WDYT?

yuxiqian · 2025-03-28T02:09:11Z

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java

-        if (isLowWatermarkEvent(element) && splitState.isSnapshotSplitState()) {
+        if (StartupOptions.snapshot().equals(sourceConfig.getStartupOptions())) {
+            // In snapshot mode, we simply emit all schemas at once.
+            if (!alreadySendAllCreateTableTables) {


In snapshot mode, we will:

obtain and check startup mode

then check the flag

In any other modes, we will:

obtain and check startup mode

I would suggest naming alreadySendAllCreateTableTables => shouldEmitAllCtesInSnapshotMode, and set it to true in snapshot mode, false (in other modes), and the checking could be simplified to:

if (shouldEmitAllCtesInSnapshotMode) { createTableEventCache.forEach( (tableId, createTableEvent) -> output.collect(createTableEvent) ); shouldEmitAllCtesInSnapshotMode = false; }

so we don't have to check the startup mode every time when we receive a SourceRecord.

yuxiqian · 2025-03-28T02:14:29Z

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java

+                createTableEventCache.forEach(
+                        (tableId, createTableEvent) -> {
+                            output.collect(createTableEvent);
+                            alreadySendAllCreateTableTables = true;


nit: move alreadySendAllCreateTableTables = true out of the loop

yuxiqian · 2025-03-28T02:17:52Z

...pipeline-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/TransformE2eITCase.java

I think TransformE2e and UdfE2e has nothing to do with batch mode. Shall we refactor original cases with @ParameterizedTest, or just remove these cases to avoid code inflation and longer CI execution time?

yuxiqian · 2025-03-28T02:19:06Z

...time/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaDerivator.java

    }
+
+    /** Deduce merged CreateTableEvent in batch mode. */
+    public static List<CreateTableEvent> deduceMergedCreateTableEventInBatchMode(


Yes, but SchemaDerivator itself shouldn't be aware of streaming / batch execution mode. Similar initial deducing logic could be ported to streaming mode later.

yuxiqian · 2025-03-28T02:20:39Z

...src/main/java/org/apache/flink/cdc/runtime/operators/schema/regular/SchemaBatchOperator.java

minor: use consistent names for new operator classes: either BatchXXXOperator (like BatchPreTransformOp) or XXXBatchOperator (SchemaBatchOperator).

Personally I prefer the former one since it is easy to distinguish it from normal Streaming operators.

yuxiqian · 2025-03-28T02:22:51Z

.../src/test/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaDerivatorTest.java

    }
+
+    @Test
+    void testDeduceMergedCreateTableEventInBatchMode() {


ditto, remove inBatchMode

…holeDatabaseWithCanalJsonInBatchMode

…12) * nit: simplify UdfE2eITCase * fix: ci

leonardBang

Thanks @aiwenmo for the great work and all for the review, LGTM

This closes apache#3812 Co-authored-by: yuxiqian <34335406+yuxiqian@users.noreply.github.com>

github-actions bot added values-pipeline-connector composer common labels Dec 23, 2024

aiwenmo force-pushed the FLINK-36931 branch from 04b88da to 9b3b534 Compare December 24, 2024 15:54

github-actions bot added the runtime label Dec 25, 2024

github-actions bot added the mysql-pipeline-connector label Dec 26, 2024

aiwenmo force-pushed the FLINK-36931 branch from d50609f to 352951b Compare February 10, 2025 16:53

lvyanquan reviewed Mar 3, 2025

View reviewed changes

aiwenmo force-pushed the FLINK-36931 branch from 030f792 to 3da452c Compare March 3, 2025 13:56

lvyanquan reviewed Mar 11, 2025

View reviewed changes

github-actions bot added the cli label Mar 11, 2025

lvyanquan reviewed Mar 12, 2025

View reviewed changes

github-actions bot added the e2e-tests label Mar 12, 2025

lvyanquan reviewed Mar 13, 2025

View reviewed changes

...eline-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/MySqlToDorisE2eITCase.java Show resolved Hide resolved

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java Show resolved Hide resolved

aiwenmo force-pushed the FLINK-36931 branch from c125a9e to c59a7f8 Compare March 13, 2025 12:59

github-actions bot added the mysql-cdc-connector label Mar 17, 2025

lvyanquan reviewed Mar 18, 2025

View reviewed changes

...nector-mysql-cdc/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlSource.java Outdated Show resolved Hide resolved

lvyanquan reviewed Mar 18, 2025

View reviewed changes

...ain/java/org/apache/flink/cdc/connectors/mysql/source/reader/MySqlPipelineRecordEmitter.java Outdated Show resolved Hide resolved

yuxiqian requested changes Mar 19, 2025

View reviewed changes

github-actions bot added the docs Improvements or additions to documentation label Mar 19, 2025

aiwenmo force-pushed the FLINK-36931 branch from 8262623 to ad298b2 Compare March 26, 2025 14:11

yuxiqian requested changes Mar 28, 2025

View reviewed changes

aiwenmo and others added 24 commits April 21, 2025 22:57

Fix test after rebased master

aefd646

Fix the random field order in merged tables.

a177d52

Remove CreateTableCompletedEvent

bf25db5

Remove the useless code

1ab0f40

add parse test

8f18e8b

Add e2e test

c6ebb46

Fix rebase

42b1cc7

Fix deduceMergedCreateTableEventInBatchMode

bce0c60

Fix groupSourceTablesByRouteRule

3360e49

Remove testSyncWholeDatabaseWithDebeziumJsonInBatchMode and testSyncW…

2e1b651

…holeDatabaseWithCanalJsonInBatchMode

Fix multi-parallelism

3f989fa

Fix sink multi-parallelism

d5ad269

Optimize code

8deb837

Optimize code

dc536c6

Update TableIdRouter.java

fba8e48

Update OceanBaseE2eITCase.java

b53e63f

Add verifyBatchMode

18874a9

Add scan.startup.mode: snapshot into e2e test

574121a

add verifyBatchMode into test

266e6ff

[FLINK-36931-contrib] Simplify UdfE2eITCase using @ParameterizedTest (#…

4ef41df

…12) * nit: simplify UdfE2eITCase * fix: ci

add some logs

e2be0cf

Modify batch-mode.enabled to runtime-mode

96f6341

add Javadoc comment

417eeb9

Optimize code

518feac

aiwenmo force-pushed the FLINK-36931 branch from ccf4558 to 518feac Compare April 21, 2025 15:03

leonardBang approved these changes Apr 22, 2025

View reviewed changes

github-actions bot added the approved label Apr 22, 2025

leonardBang merged commit 61702a6 into apache:master Apr 22, 2025
28 of 29 checks passed

morozov mentioned this pull request Apr 23, 2025

[FLINK-37604] Generate static UIDs for pipeline operators #3977

Merged

linjianchang pushed a commit to linjianchang/flink-cdc that referenced this pull request May 16, 2025

[FLINK-36931][cdc-runtime] Supports batch pipeline for CDC YAML

8a569a6

This closes apache#3812 Co-authored-by: yuxiqian <34335406+yuxiqian@users.noreply.github.com>

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

Uh oh!

Conversation

aiwenmo commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Premise

MysqlCDC supports snapshot mode

Streaming VS Batch

Expectation

Full snapshot synchronization

Full-incremental offline

Test

Full snapshot synchronization in Batch mode

Solution

Use StartupMode.SNAPSHOT + Streaming for full snapshot synchronization

Expand the FlinkPipelineComposer applicable to the Batch mode to support full Batch synchronization

Expand StartupMode to support users specifying the Offset range to support incremental offline synchronization

Discussion

Code implementation

Implementation ideas

Implementation effect

Uh oh!

aiwenmo commented Dec 25, 2024

Code implementation

Uh oh!

aiwenmo commented Dec 27, 2024

Uh oh!

aiwenmo commented Dec 31, 2024

Uh oh!

lvyanquan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvyanquan commented Mar 12, 2025

Uh oh!

aiwenmo commented Mar 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuxiqian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aiwenmo commented Dec 23, 2024 •

edited

Loading

yuxiqian Mar 28, 2025 •

edited

Loading