Skip to content

KAFKA-17705: Add Transactions V2 system tests and mark as production ready#18132

Merged
jolshan merged 7 commits intoapache:trunkfrom
jolshan:kafka-17705
Dec 21, 2024
Merged

KAFKA-17705: Add Transactions V2 system tests and mark as production ready#18132
jolshan merged 7 commits intoapache:trunkfrom
jolshan:kafka-17705

Conversation

@jolshan
Copy link
Copy Markdown
Member

@jolshan jolshan commented Dec 10, 2024

Added transaction version 2 to some of the system tests. Also marking TV2 as production ready.
Will share the results of the tests when I get them.

@github-actions github-actions bot added triage PRs from the community core Kafka Broker producer clients small Small PRs labels Dec 10, 2024
@mumrah
Copy link
Copy Markdown
Member

mumrah commented Dec 11, 2024

@jolshan #17881 adds a "triage" label to PRs from non-committers. Turns out this also affect committers if their membership visibility in the ASF GitHub org is not public. I added instructions for setting your membership visibility to public https://github.com/apache/kafka/blob/trunk/.github/workflows/README.md#pr-triage

@mumrah mumrah removed the triage PRs from the community label Dec 11, 2024
@github-actions github-actions bot added the kraft label Dec 11, 2024
MetadataVersion.latestProduction().featureLevel()));
for (Feature feature : Feature.PRODUCTION_FEATURES) {
short maxVersion = enableUnstable ? feature.latestTesting() : feature.latestProduction();
short maxVersion = enableUnstable ? feature.latestTesting() : feature.defaultLevel(MetadataVersion.LATEST_PRODUCTION);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao @dongnuo123 I noticed we didn't change the defaults here on the previous PR. I have done so here. A test was failing since the production version for transaction version is now not the same as the default based on the latest production MV.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jolshan : Not sure that I understand this change. The result of defaultFeatureMap is used for Controller/Broker registration. So, it seems that we should pass in the max supported version of each feature, instead of the default version, right? In fact, defaultFeatureMap should be renamed to sth like supportedFeatureMap.

A test was failing since the production version for transaction version is now not the same as the default based on the latest production MV.

Hmm, I thought that with #17886, it's ok for the latest production version for TV to be different from the default. It just needs to be larger.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My understanding from #17886 was that we want a separate production vs default value.

I thought these methods were also meant to create the default features, not the max supported ones. It's my bad if I misunderstood that. I will take another look and if that is the case, fix the test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the changes to the name and the test

@github-actions github-actions bot removed the small Small PRs label Dec 11, 2024
Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jolshan : Thanks for the updated PR. Just a minor comment.

latestFinalizedFeaturesEpoch = info.finalizedFeaturesEpoch;
Short transactionVersion = info.finalizedFeatures.get("transaction.version");
isTransactionV2Enabled = transactionVersion != null && transactionVersion >= 2;
log.debug("Updating isTV2 enabled to {} at with FinalizedFeaturesEpoch {}", isTransactionV2Enabled, latestFinalizedFeaturesEpoch);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at with FinalizedFeaturesEpoch => with FinalizedFeaturesEpoch

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. good catch :)

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 12, 2024

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jolshan : Thanks for the updated PR. Is the unit test failure related? Also, it seems that a bunch of system tests timed out.

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 12, 2024

Thanks Jun. Taking a look.

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 12, 2024

Here's the new run of just the changed tests: https://confluent-open-source-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/trunk/2024-12-12--001.65845de4-1d44-4e7d-8d23-ac15c9440cb2--1733986841--jolshan--kafka-17705--92dfdf6028/report.html

Still is a bit flaky even with the timeout increased. Will look at that. I also need to see if the consumer failures are unique to this PR or something that was in trunk at the time I branched.

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 12, 2024

The unit test failure does not seem related.

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 12, 2024

This explains the main divergence from trunk failures #18036. I do see some issues with fencing in the changes to the tests, so I will continue to investigate

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 13, 2024

System tests uncovered a bug! Will fix that and come back here :) https://issues.apache.org/jira/browse/KAFKA-18227

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 19, 2024

I've merged #18176, so I will update this now and rerun tests.

@junrao
Copy link
Copy Markdown
Contributor

junrao commented Dec 19, 2024

@jolshan : Thanks for rerunning the tests. Is the build scan failure related to this PR?

@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 20, 2024

@junrao nope -- I confirmed with David Arthur that the build scan issue for quarantined tests is unrelated.

Here are the latest test results (without the log4j change that seems to be causing issues for our test running infra)
https://confluent-open-source-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/trunk/2024-12-20--001.06d0f058-3bac-4707-89dd-16e641f2bd13--1734724283--jolshan--testing-17705-2--2f7729dfca/report.html

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jolshan : Thanks for looking into the test failure. The PR LGTM.

@jolshan jolshan merged commit 8bd3746 into apache:trunk Dec 21, 2024
jolshan added a commit that referenced this pull request Dec 21, 2024
…ready (#18132)

Added transaction version 2 to some of the system tests. Also marking TV2 as production ready.

Also fixes the defaultVersion test. 

Reviewers: Jun Rao <jun@confluent.io>
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
…ready (apache#18132)

Added transaction version 2 to some of the system tests. Also marking TV2 as production ready.

Also fixes the defaultVersion test. 

Reviewers: Jun Rao <jun@confluent.io>
cmd += " --standalone"
self.standalone_controller_bootstrapped = True
if self.use_transactions_v2:
cmd += " --feature transaction.version=2"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We missed setting transaction.version for isolated kraft, resulting in the cluster using v0. Since this nullifies the suit use_transactions_v2=True, we will submit a patch later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is there a separate place to see this? That is unfortunate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed this issue while running tests/kafkatest/tests/core/transactions_test.py

The settings "isolated_kraft" and "use_transactions_v2=true" failed to enable tv2 on the cluster

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how this maps to the kafka.py file? Ie, why this line of code doesn't apply for isolated kraft?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how this maps to the kafka.py file? Ie, why this line of code doesn't apply for isolated kraft?

see #21164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants