KAFKA-16195: ignore metadata.log.dir failure in ZK mode by gaurav-narula · Pull Request #15262 · apache/kafka

gaurav-narula · 2024-01-25T13:56:30Z

This change ensures we check that the broker is running in Kraft mode or is undergoing a migration while handling a failure for metadata.log.dir.

This avoids halting a broker in ZK mode when the metadata.log.dir fails. When unconfigured, it defaults to the first log directory.

This change ensures we check that the broker is running in Kraft mode or is undergoing a migration while handling a failure for metadata.log.dir. This avoids halting a broker in ZK mode when the metadata.log.dir fails. When unconfigured, it defaults to the first log directory.

gaurav-narula · 2024-01-25T13:58:12Z

CC: @cmccabe @showuon @OmniaGM

showuon

I think the impact of this bug is that if there are more than 1 log dirs in ZK broker, and when the 1st of them is failed, we will shutdown the broker unexpectedly. But if there's only 1 log dir, it should be fine to shutdown the broker since no available online log dir. (just have a strange log saying Shutdown broker because the metadata log dir. Is my understanding correct?

If so, I think the change makes sense. Could you add tests for it?

gaurav-narula · 2024-01-26T12:00:43Z

I think the impact of this bug is that if there are more than 1 log dirs in ZK broker, and when the 1st of them is failed, we will shutdown the broker unexpectedly. But if there's only 1 log dir, it should be fine to shutdown the broker since no available online log dir. (just have a strange log saying Shutdown broker because the metadata log dir. Is my understanding correct?

If so, I think the change makes sense. Could you add tests for it?

That's correct. The case where all log directories fail is handled in logManager.handleLogDirFailure which is invoked on line 2588.

Will follow up with a test shortly.

gaurav-narula · 2024-01-26T13:54:34Z

@showuon added a test with commit d0858de. Please take a look

OmniaGM · 2024-01-29T14:37:27Z

Tests Compilation is failing with

[2024-01-26T16:18:05.648Z] > Task :core:compileTestScala
[2024-01-26T16:18:05.648Z] [Error] /home/jenkins/jenkins-agent/712657a4/workspace/Kafka_kafka-pr_PR-15262/core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:6410:23: Invalid literal number
[2024-01-26T16:18:06.555Z] [Error] /home/jenkins/jenkins-agent/712657a4/workspace/Kafka_kafka-pr_PR-15262/core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:6413:5: ')' expected but '}' found.
[2024-01-26T16:18:07.582Z] two errors found
[2024-01-26T16:18:07.582Z]

OmniaGM

The change looks straightforward! @showuon can you have another look please assuming the tests are okay after the latest fix from @gaurav-narula?

cc: @rondagostino, @cmccabe and @pprovenzano

pprovenzano

LGTM

cmccabe · 2024-01-31T17:10:50Z

    val uuid = logManager.directoryId(dir)
    logManager.handleLogDirFailure(dir)
-    if (dir == config.metadataLogDir) {
+    if (dir == new File(config.metadataLogDir).getAbsolutePath && (zkClient.isEmpty || config.migrationEnabled)) {


This is not quite the correct check... you should check config.processRoles (probably config.processRoles.isNotEmpty || config.migrationEnabled )

Addressed in b28e21a

cmccabe

LGTM. Thanks, all.

In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>

jolshan · 2024-02-10T01:08:32Z

Hey folks. We've seen a large increase in LogDirFailureTest after this PR. Can we take a look and see if something here caused it?

gradle enterprise: https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.container=kafka.server.LogDirFailureTest
jira: https://issues.apache.org/jira/browse/KAFKA-16225

jolshan · 2024-02-10T01:12:14Z

The issues started for 3.7 on the same day so it is one of the 3 commits backported feb 2
https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=3.7&tests.container=kafka.server.LogDirFailureTest

OmniaGM · 2024-02-12T15:09:37Z

The issues started for 3.7 on the same day so it is one of the 3 commits backported feb 2 https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=3.7&tests.container=kafka.server.LogDirFailureTest

Hi @jolshan just introduced a fix here #15354

In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>

showuon reviewed Jan 26, 2024

View reviewed changes

KAFKA-16195: add test

eb134d3

gaurav-narula force-pushed the KAFKA-16195 branch from d0858de to eb134d3 Compare January 29, 2024 14:37

OmniaGM approved these changes Jan 29, 2024

View reviewed changes

pprovenzano approved these changes Jan 29, 2024

View reviewed changes

cmccabe reviewed Jan 31, 2024

View reviewed changes

KAFKA-16195: use config.processRoles.nonEmpty

b28e21a

cmccabe approved these changes Feb 2, 2024

View reviewed changes

cmccabe merged commit 3d95a69 into apache:trunk Feb 2, 2024

gaurav-narula mentioned this pull request Feb 12, 2024

KAFKA-16225: Set metadata.log.dir to broker in KRAFT mode in integration test #15354

Merged

3 tasks

chia7712 mentioned this pull request Sep 9, 2024

KAFKA-17417: Backport KAFKA-15751 and KAFKA-15752 to 3.8 and 3.7 #17102

Merged

3 tasks

Conversation

gaurav-narula commented Jan 25, 2024

Uh oh!

gaurav-narula commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

gaurav-narula commented Jan 26, 2024

Uh oh!

gaurav-narula commented Jan 26, 2024

Uh oh!

OmniaGM commented Jan 29, 2024

Uh oh!

OmniaGM left a comment

Choose a reason for hiding this comment

Uh oh!

pprovenzano left a comment

Choose a reason for hiding this comment

Uh oh!

cmccabe Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaurav-narula Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

cmccabe left a comment

Choose a reason for hiding this comment

Uh oh!

jolshan commented Feb 10, 2024

Uh oh!

jolshan commented Feb 10, 2024

Uh oh!

OmniaGM commented Feb 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gaurav-narula commented Jan 25, 2024 •

edited

Loading

cmccabe Jan 31, 2024 •

edited

Loading