Skip to content

KAFKA-16770; [1/N] Coalesce records into bigger batches#15964

Merged
dajac merged 3 commits intoapache:trunkfrom
dajac:KAFKA-16770
May 21, 2024
Merged

KAFKA-16770; [1/N] Coalesce records into bigger batches#15964
dajac merged 3 commits intoapache:trunkfrom
dajac:KAFKA-16770

Conversation

@dajac
Copy link
Copy Markdown
Member

@dajac dajac commented May 15, 2024

We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots.

This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@dajac dajac added the KIP-848 The Next Generation of the Consumer Rebalance Protocol label May 15, 2024
@dajac dajac requested a review from jolshan May 15, 2024 14:48
MockTimer timer = new MockTimer();
// The partition writer only accept on write.
MockPartitionWriter writer = new MockPartitionWriter(2);
// The partition writer only accept one write.
Copy link
Copy Markdown
Member

@jolshan jolshan May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, we always batched the (in this case 2) records that were part of the same write operation. For now we aren't changing this, but moving the logic to the coordinator runtime to make space for the batching logic as a followup?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it right. A write operation produces a single batch with all the records generated by it. This patch does not change it but change where the memory record is built. The next patch will add the logic to keep the batch open until full or until a linger time is reached. With this, records produced by many write operations will end up in the same batch.

@jolshan
Copy link
Copy Markdown
Member

jolshan commented May 16, 2024

I took a first pass to get a general understanding. I will come back tomorrow and take a deeper dive in some of the minor changes and let you know if i think of anything missed.

result
VerificationGuard.SENTINEL,
MemoryRecords.withEndTransactionMarker(
time.milliseconds(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we didn't specify this time value before. Was that a bug? I guess it also just gets the system time in the method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withEndTransactionMarker takes the current time if we don't specify it. The reason why I set it explicitly here is to ensure that the mock time is used in tests.

byte magic = logConfig.recordVersion().value;
int maxBatchSize = logConfig.maxMessageSize();
long currentTimeMs = time.milliseconds();
ByteBuffer buffer = context.bufferSupplier.get(Math.min(16384, maxBatchSize));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice we got rid of the thread local. 👍

// coordinator is the single writer to the underlying partition so we can
// deduce it like this.
for (int i = 0; i < result.records().size(); i++) {
MemoryRecordsBuilder builder = MemoryRecords.builder(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there a benefit from putting this here and not right before the append method?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder is used in the above loop (L801) so we need it here.

/**
* Listener allowing to listen to high watermark changes. This is meant
* to be used in conjunction with {{@link PartitionWriter#append(TopicPartition, List)}}.
* to be used in conjunction with {{@link PartitionWriter#append(TopicPartition, VerificationGuard, MemoryRecords)}}.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a programatic way to check if these links are broken due to refactoring, or do you need to do it manually?

Just wondering if there is an easy way to check you did them all :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intellij reports them as warning. I suppose that we would get warning when we generate the javadoc too.

*/
public class InMemoryPartitionWriter<T> implements PartitionWriter<T> {

public static class LogEntry {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice that we could just use the real memory records

}

@Test
def testWriteRecords(): Unit = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an equivalent test for the writing of the records in CoordinatorRuntimeTest? I didn't really notice new tests, but saw we have some of the builder logic there. Is it tested by checking equality between the records generated by the helper methods and the output from running the CoordinatorRuntime code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We have many tests in CoordinatorRuntimeTest doing writes. As we fully validate the records now, they cover this.

@dajac
Copy link
Copy Markdown
Member Author

dajac commented May 17, 2024

@jolshan Thanks for your comments. I replied to them.

@dajac dajac merged commit b4c2d66 into apache:trunk May 21, 2024
@dajac dajac deleted the KAFKA-16770 branch May 21, 2024 06:47
rreddy-22 pushed a commit to rreddy-22/kafka-rreddy that referenced this pull request May 24, 2024
We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots.

This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes.

Reviewers: Justine Olshan <jolshan@confluent.io>
TaiJuWu pushed a commit to TaiJuWu/kafka that referenced this pull request Jun 8, 2024
We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots.

This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes.

Reviewers: Justine Olshan <jolshan@confluent.io>
dajac added a commit that referenced this pull request Jun 12, 2024
This patch is the continuation of #15964. It introduces the records coalescing to the CoordinatorRuntime. It also introduces a new configuration `group.coordinator.append.linger.ms` which allows administrators to chose the linger time or disable it with zero. The new configuration defaults to 10ms.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
dajac added a commit that referenced this pull request Jun 12, 2024
This patch is the continuation of #15964. It introduces the records coalescing to the CoordinatorRuntime. It also introduces a new configuration `group.coordinator.append.linger.ms` which allows administrators to chose the linger time or disable it with zero. The new configuration defaults to 10ms.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
gongxuanzhang pushed a commit to gongxuanzhang/kafka that referenced this pull request Jun 12, 2024
We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots.

This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes.

Reviewers: Justine Olshan <jolshan@confluent.io>
gongxuanzhang pushed a commit to gongxuanzhang/kafka that referenced this pull request Jun 12, 2024
This patch is the continuation of apache#15964. It introduces the records coalescing to the CoordinatorRuntime. It also introduces a new configuration `group.coordinator.append.linger.ms` which allows administrators to chose the linger time or disable it with zero. The new configuration defaults to 10ms.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
apourchet added a commit to apourchet/kafka that referenced this pull request Jun 12, 2024
commit 9368ef8
Author: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>
Date:   Wed Jun 12 16:04:24 2024 +0100

    KAFKA-16865: Add IncludeTopicAuthorizedOperations option for DescribeTopicPartitionsRequest (apache#16136)

    Reviewers: Mickael Maison <mickael.maison@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Calvin Liu <caliu@confluent.io>, Andrew Schofield <andrew_schofield@live.com>, Apoorv Mittal <amittal@confluent.io>

commit 46eb081
Author: gongxuanzhang <gongxuanzhang@foxmail.com>
Date:   Wed Jun 12 22:23:39 2024 +0800

    KAFKA-10787 Apply spotless to log4j-appender, trogdor, jmh-benchmarks, examples, shell and generator (apache#16296)

    Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

commit 79b9c44
Author: gongxuanzhang <gongxuanzhang@foxmail.com>
Date:   Wed Jun 12 22:19:47 2024 +0800

    KAFKA-10787 Apply spotless to connect module (apache#16299)

    Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

commit b5fb654
Author: Abhijeet Kumar <abhijeet.cse.kgp@gmail.com>
Date:   Wed Jun 12 19:47:46 2024 +0530

    KAFKA-15265: Dynamic broker configs for remote fetch/copy quotas (apache#16078)

    Reviewers: Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>

commit faee6a4
Author: Dmitry Werner <grimekillah@gmail.com>
Date:   Wed Jun 12 15:44:11 2024 +0500

    MINOR: Use predetermined dir IDs in ReplicationQuotasTest

    Use predetermined directory IDs instead of Uuid.randomUuid() in ReplicationQuotasTest.

    Reviewers: Igor Soarez <soarez@apple.com>

commit 638844f
Author: David Jacot <djacot@confluent.io>
Date:   Wed Jun 12 08:29:50 2024 +0200

    KAFKA-16770; [2/2] Coalesce records into bigger batches (apache#16215)

    This patch is the continuation of apache#15964. It introduces the records coalescing to the CoordinatorRuntime. It also introduces a new configuration `group.coordinator.append.linger.ms` which allows administrators to chose the linger time or disable it with zero. The new configuration defaults to 10ms.

    Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>

commit 39ffdea
Author: Bruno Cadonna <cadonna@apache.org>
Date:   Wed Jun 12 07:51:38 2024 +0200

    KAFKA-10199: Enable state updater by default (apache#16107)

    We have already enabled the state updater by default once.
    However, we ran into issues that forced us to disable it again.
    We think that we fixed those issues. So we want to enable the
    state updater again by default.

    Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Matthias J. Sax <matthias@confluent.io>

commit 0782232
Author: Antoine Pourchet <antoine@responsive.dev>
Date:   Tue Jun 11 22:31:43 2024 -0600

    KAFKA-15045: (KIP-924 pt. 22) Add RackAwareOptimizationParams and other minor TaskAssignmentUtils changes (apache#16294)

    We now provide a way to more easily customize the rack aware
    optimizations that we provide by way of a configuration class called
    RackAwareOptimizationParams.

    We also simplified the APIs for the optimizeXYZ utility functions since
    they were mutating the inputs anyway.

    Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>

commit 226ac5e
Author: Murali Basani <muralidhar.basani@aiven.io>
Date:   Wed Jun 12 05:38:50 2024 +0200

    KAFKA-16922 Adding unit tests for NewTopic  (apache#16255)

    Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

commit 23fe71d
Author: Abhijeet Kumar <abhijeet.cse.kgp@gmail.com>
Date:   Wed Jun 12 06:27:02 2024 +0530

    KAFKA-15265: Integrate RLMQuotaManager for throttling copies to remote storage (apache#15820)

    - Added the integration of the quota manager to throttle copy requests to the remote storage. Reference KIP-956
    - Added unit-tests for the copy throttling logic.

    Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>

commit 2fa2c72
Author: Chris Egerton <chrise@aiven.io>
Date:   Tue Jun 11 23:15:07 2024 +0200

    MINOR: Wait for embedded clusters to start before using them in Connect OffsetsApiIntegrationTest (apache#16286)

    Reviewers: Greg Harris <greg.harris@aiven.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

KIP-848 The Next Generation of the Consumer Rebalance Protocol

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants