[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs #4907

tsreaper · 2025-01-14T08:21:50Z

Purpose

In #4380 we introduce pre-commit compact for changelog files. Multiple changelog files from the same partition will be merged into one big file in one worker parallelism to decrease the number of small files.

However, when the number of changelog files to merge is large (while each file itself is small enough), the copying process will be slow, because opening these many files from the filesystem takes a lot of time.

In this PR, we add a thread pool to the worker operator, so that when performing pre-commit compact for changelogs, we can copy the bytes with multiple threads, thus speeding up the process.

Tests

Existing IT cases should cover this change. This PR also adds a unit test for the coordinator operator.

API and Format

No format changes.

Documentation

Document is also updated.

wwj6591812 · 2025-01-14T14:07:58Z

...n-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/FlinkConnectorOptions.java

+                    .intType()
+                    .noDefaultValue()
+                    .withDescription(
+                            "Maximum number of threads to copy bytes form small changelog files. "


form -> from

wwj6591812 · 2025-01-14T14:08:37Z

.../src/main/java/org/apache/paimon/flink/compact/changelog/ChangelogCompactWorkerOperator.java

+        int numThreads =
+                options.getOptional(FlinkConnectorOptions.CHANGELOG_PRECOMMIT_COMPACT_THREAD_NUM)
+                        .orElse(Runtime.getRuntime().availableProcessors());
+        LOG.info("Creating thread poll of size {} for changelog compaction.", numThreads);


poll -> pool

wwj6591812 · 2025-01-14T15:08:58Z

...ink-common/src/main/java/org/apache/paimon/flink/compact/changelog/ChangelogCompactTask.java

+        private void readFully() {
+            try {
+                result = IOUtils.readFully(table.fileIO().newInputStream(path), true);
+                table.fileIO().deleteQuietly(path);


If job failover after table.fileIO().deleteQuietly(path); and before copy all files into a new big file.
Is there a risk of file loss here？

If job fails then no changelog will be committed, thus no risk of file loss.

wwj6591812 · 2025-01-15T12:41:29Z

+1

JingsongLi · 2025-01-15T14:20:20Z

...ink-common/src/main/java/org/apache/paimon/flink/compact/changelog/ChangelogCompactTask.java

+                ThreadPoolUtils.randomlyExecuteSequentialReturn(
+                        executor,
+                        t -> {
+                            // Total lengths of all bytes will not exceed `targetFileSize * 2`,


I feel it is better to use workers and queue and consumer. Even max targetFileSize * 2, if target file size is 1GB, this is too still large.

Workers and queue is safer.

…mpact for changelogs

JingsongLi · 2025-04-06T11:27:04Z

+1

wwj6591812 reviewed Jan 14, 2025

View reviewed changes

JingsongLi reviewed Jan 15, 2025

View reviewed changes

[flink] Copy bytes with multiple threads when preforming precommit co…

b299078

…mpact for changelogs

JingsongLi merged commit 241ac76 into apache:master Apr 6, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs #4907

[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs #4907

Uh oh!

tsreaper commented Jan 14, 2025

Uh oh!

wwj6591812 Jan 14, 2025

Uh oh!

wwj6591812 Jan 14, 2025

Uh oh!

wwj6591812 Jan 14, 2025

Uh oh!

tsreaper Jan 15, 2025

Uh oh!

wwj6591812 commented Jan 15, 2025

Uh oh!

JingsongLi Jan 15, 2025

Uh oh!

JingsongLi commented Apr 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs #4907

[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs #4907

Uh oh!

Conversation

tsreaper commented Jan 14, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

wwj6591812 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

wwj6591812 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

wwj6591812 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

tsreaper Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

wwj6591812 commented Jan 15, 2025

Uh oh!

JingsongLi Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Apr 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants