Skip to content

Conversation

@LsomeYeah
Copy link
Contributor

Purpose

Linked issue: close #xxx

Add a Coordinator node to small changelog files compaction pipeline to decide how to concatenate it into a target file size result file, which can be one or multiple files, and add a worker node to merge those small files.

Tests

API and Format

Documentation

tsreaper and others added 2 commits October 25, 2024 14:09
# Conflicts:
#	paimon-flink/paimon-flink-common/src/test/java/org/apache/paimon/flink/PrimaryKeyFileStoreTableITCase.java
}

private void emitPartitionChangelogCompactTask(BinaryRow partition) {
PartitionChangelog partitionChangelog = partitionChangelogs.get(partition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelog may be null or not?

private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
private final Map<Integer, List<DataFileMeta>> compactChangelogFiles;

public long totalFileSize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mothod has not be called, delete it?


private static class PartitionChangelog {
private long totalFileSize;
private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newChangelogFiles

partitionChangelogs.remove(partition);
}

private void emitAllPartitionsChanglogCompactTask() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelogs.keySet().forEach(this::emitPartitionChangelogCompactTask);

}

private void emitPartitionChangelogCompactTask(BinaryRow partition) {
PartitionChangelog partitionChangelog = partitionChangelogs.get(partition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelog may be null or not?

public class ChangelogCompactTask implements Serializable {
private final long checkpointId;
private final BinaryRow partition;
private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newChangelogFiles

public List<Committable> doCompact(FileStoreTable table) throws Exception {
FileStorePathFactory pathFactory = table.store().pathFactory();

// copy all changelog files to a new big file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two for statement has lots of some code, you can avoid this.


// copy all changelog files to a new big file
for (Map.Entry<Integer, List<DataFileMeta>> entry : newFileChangelogFiles.entrySet()) {
Integer bucket = entry.getKey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int

+ CompactedChangelogReadOnlyFormat.getIdentifier(
baseResult.meta.fileFormat())));

List<Committable> newCommittables = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List newCommittables = new ArrayList<>(bucketedResults.entrySet().size());

import org.apache.flink.api.common.typeutils.TypeSerializer;

/** Type information for {@link ChangelogCompactTask}. */
public class ChangelogTaskTypeInfo extends TypeInformation<ChangelogCompactTask> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line

private void copyFile(
FileStoreTable table, Path path, int bucket, boolean isCompactResult, DataFileMeta meta)
throws Exception {
if (outputStream == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyFile is only called in doCompact, so outputStream can be a local variable instead of a class member.

assertThat(compactedChangelogs2).hasSize(2);
assertThat(listAllFilesWithPrefix("changelog-")).isEmpty();

// write update data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces

+ "'changelog-producer' = 'lookup', "
+ "'lookup-wait' = '%s', "
+ "'deletion-vectors.enabled' = '%s', "
+ "'changelog.compact.parallelism' = '%s'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this table option? Also why do you change write buffer size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants