Skip to content

KAFKA-18346: Fix e2e TestKRaftUpgrade from v3.3.2#18386

Merged
chia7712 merged 7 commits intoapache:trunkfrom
TaiJuWu:e2e_kraft
Jan 15, 2025
Merged

KAFKA-18346: Fix e2e TestKRaftUpgrade from v3.3.2#18386
chia7712 merged 7 commits intoapache:trunkfrom
TaiJuWu:e2e_kraft

Conversation

@TaiJuWu
Copy link
Copy Markdown
Collaborator

@TaiJuWu TaiJuWu commented Jan 4, 2025

jira: https://issues.apache.org/jira/browse/KAFKA-18346

This failure related to #13130

see #18386 (comment)

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@github-actions github-actions bot added triage PRs from the community tests Test fixes (including flaky tests) small Small PRs labels Jan 4, 2025
@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 4, 2025

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id:       2025-01-04--014
run time:         4 minutes 44.207 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
================================================================================
test_id:    kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade_downgrade.from_kafka_version=3.3.2.metadata_quorum=ISOLATED_KRAFT
status:     PASS
run time:   4 minutes 44.023 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_combined_mode_upgrade.from_kafka_version=3.3.2.metadata_quorum=COMBINED_KRAFT
status:     PASS
run time:   3 minutes 28.845 seconds

@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 10, 2025

Hi @ijuma , I also hope this PR can bring your attention and I have written the root cause.
But I don't think we should apply this fix.

@TaiJuWu TaiJuWu closed this Jan 10, 2025
@TaiJuWu TaiJuWu deleted the e2e_kraft branch January 10, 2025 16:50
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 10, 2025

I had missed this - don't we need it anymore? cc @mumrah @cmccabe

@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 10, 2025

I had missed this - don't we need it anymore? cc @mumrah @cmccabe

Even if we apply this fix, just hide this issue in system test but it still exists and the only way is back port to 3.3 but it won’t be released.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 10, 2025

It seems like we are testing that we can upgrade from 3.3 with a single log directory - is that right? If so, that makes sense to me.

@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 10, 2025

We are testing DOWNGRADE from 4.0 to 3.3 and it ONLY work for single folder but in KRAFT mode it should often be three folders.

The error throw

kafka.common.InconsistentBrokerMetadataException: BrokerMetadata is not consistent across log.dirs. This could happen if multiple brokers shared a log directory (log.dirs) or partial data was manually copied from another broker. Found:
- /mnt/kafka/kafka-metadata-logs -> {node.id=3001, directory.id=hqW-A8W0pbE8dJmM2otDPQ, version=1, cluster.id=I2eXt9rvSnyhct8BYmW6-w}
- /mnt/kafka/kafka-data-logs-2 -> {node.id=3001, directory.id=nbiO8yKUacnyIijAkoFcFg, version=1, cluster.id=I2eXt9rvSnyhct8BYmW6-w}
- /mnt/kafka/kafka-data-logs-1 -> {node.id=3001, directory.id=2-xUs6eYlJtkVUcl6lJmtQ, version=1, cluster.id=I2eXt9rvSnyhct8BYmW6-w}

What #13130 do is split ZK mode and KRAFT mode, and if we work on KRAFT mode we only check clusterID and nodeID to avoid compatibility issue.
That why I said the only solution is back port that PR to 3.3, does it make sense?

By the way, ZK mode we check nodeID + clusterID + directoryID so we have such issue.

@chia7712
Copy link
Copy Markdown
Member

@TaiJuWu I feel that approach has two issue.

  1. In rolling-upgrade, the version is changed from 3.3 to trunk. Hence, the metadata folder will be inconsistent as we don't change the path when the version is not 3.3. That means all metadata gets lost after upgrading in e2e.
  2. the path in this PR is incorrect. config_property.METADATA_LOG_DIR is key rather than a path

@chia7712
Copy link
Copy Markdown
Member

To mitigate the 3.3 issue, we could add a flag to the KafkaService class to indicate whether different log directories have been used. During node restarts, this flag can be checked when generating properties.

@TaiJuWu what are your thoughts? You have identified the root cause, so it would be excellent if you could complete this PR 😃

@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 11, 2025

To mitigate the 3.3 issue, we could add a flag to the KafkaService class to indicate whether different log directories have been used. During node restarts, this flag can be checked when generating properties.

This is the issue about this fix and is not related to the root cause, I would like to discuss the root cause about this issue but I still very appreciate for your review.

@TaiJuWu what are your thoughts? You have identified the root cause, so it would be excellent if you could complete this PR 😃

After discussion offline with @chia7712 , the main question becomes should we limit the downgrade path to single folder in our system test or we want to remove this version directly?
The issue about multiple folder downgrade is a known issue (#13130) and we won't release v3.3 again.

The former means we still support limited downgrade and the latter means we don't give downgrade promission about v3.3.
No matter what we should document this.

cc @ijuma @mumrah @cmccabe
ping @dajac since you are RM for 4.0

@TaiJuWu
Copy link
Copy Markdown
Collaborator Author

TaiJuWu commented Jan 11, 2025

reopen for discussion.

@TaiJuWu TaiJuWu restored the e2e_kraft branch January 11, 2025 07:37
@TaiJuWu TaiJuWu reopened this Jan 11, 2025
@chia7712
Copy link
Copy Markdown
Member

@TaiJuWu you're correct. 3.3 had some issues, and we didn't backport the fix (#13130). However, 3.3 is an outdated release, and we don't need to prioritize "how to downgrade multi-folders Kraft to 3.3." I believe this is a rare use case with limited benefits. In this PR, we can revise the test case to ensure a path exists for downgrading to 3.3. I think this is a sufficient approach."

@chia7712
Copy link
Copy Markdown
Member

we want to remove this version directly?

I prefer to include 3.3 in the upgrading/downgrading tests, as we already know how to successfully execute these scenarios, right?

Comment thread tests/kafkatest/services/kafka/kafka.py Outdated
@TaiJuWu TaiJuWu force-pushed the e2e_kraft branch 2 times, most recently from ad847fa to f40321b Compare January 11, 2025 15:07
@TaiJuWu TaiJuWu marked this pull request as ready for review January 11, 2025 15:08
@github-actions github-actions bot removed the triage PRs from the community label Jan 12, 2025
Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaiJuWu thanks for this fix.

configs.update(override_configs)

prop_file = self.render_configs(configs)
filtered_configs = {k: v for k, v in configs.items() if v not in [None, ""]}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this behavior to the document (line#258)

- Finally, validate that every message acked by the producer was consumed by the consumer.
"""

# Due to 3.3 compatability issue we need to set one folder
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please revise the comment with more details?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments are address, please take a look.

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

run the e2e tests/kafkatest/tests/core/kraft_upgrade_test.py on my local with this patch. all pass

@chia7712 chia7712 merged commit ceee1a7 into apache:trunk Jan 15, 2025
chia7712 pushed a commit to chia7712/kafka that referenced this pull request Jan 15, 2025
Due to an issue with handling folders in Kafka version 3.3.2 (see apache#13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
@TaiJuWu TaiJuWu deleted the e2e_kraft branch January 15, 2025 12:44
chia7712 pushed a commit that referenced this pull request Jan 15, 2025
Due to an issue with handling folders in Kafka version 3.3.2 (see #13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
pranavt84 pushed a commit to pranavt84/kafka that referenced this pull request Jan 27, 2025
Due to an issue with handling folders in Kafka version 3.3.2 (see apache#13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
airlock-confluentinc bot pushed a commit to confluentinc/kafka that referenced this pull request Jan 27, 2025
Due to an issue with handling folders in Kafka version 3.3.2 (see apache#13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
Due to an issue with handling folders in Kafka version 3.3.2 (see apache#13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants