Skip to content

KAFKA-16195: ignore metadata.log.dir failure in ZK mode#15262

Merged
cmccabe merged 3 commits intoapache:trunkfrom
gaurav-narula:KAFKA-16195
Feb 2, 2024
Merged

KAFKA-16195: ignore metadata.log.dir failure in ZK mode#15262
cmccabe merged 3 commits intoapache:trunkfrom
gaurav-narula:KAFKA-16195

Conversation

@gaurav-narula
Copy link
Copy Markdown
Contributor

This change ensures we check that the broker is running in Kraft mode or is undergoing a migration while handling a failure for metadata.log.dir.

This avoids halting a broker in ZK mode when the metadata.log.dir fails. When unconfigured, it defaults to the first log directory.

This change ensures we check that the broker is running in Kraft mode
or is undergoing a migration while handling a failure for metadata.log.dir.

This avoids halting a broker in ZK mode when the metadata.log.dir fails.
When unconfigured, it defaults to the first log directory.
@gaurav-narula
Copy link
Copy Markdown
Contributor Author

gaurav-narula commented Jan 25, 2024

CC: @cmccabe @showuon @OmniaGM

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the impact of this bug is that if there are more than 1 log dirs in ZK broker, and when the 1st of them is failed, we will shutdown the broker unexpectedly. But if there's only 1 log dir, it should be fine to shutdown the broker since no available online log dir. (just have a strange log saying Shutdown broker because the metadata log dir. Is my understanding correct?

If so, I think the change makes sense. Could you add tests for it?

@gaurav-narula
Copy link
Copy Markdown
Contributor Author

I think the impact of this bug is that if there are more than 1 log dirs in ZK broker, and when the 1st of them is failed, we will shutdown the broker unexpectedly. But if there's only 1 log dir, it should be fine to shutdown the broker since no available online log dir. (just have a strange log saying Shutdown broker because the metadata log dir. Is my understanding correct?

If so, I think the change makes sense. Could you add tests for it?

That's correct. The case where all log directories fail is handled in logManager.handleLogDirFailure which is invoked on line 2588.

Will follow up with a test shortly.

@gaurav-narula
Copy link
Copy Markdown
Contributor Author

@showuon added a test with commit d0858de. Please take a look

@OmniaGM
Copy link
Copy Markdown
Contributor

OmniaGM commented Jan 29, 2024

Tests Compilation is failing with

[2024-01-26T16:18:05.648Z] > Task :core:compileTestScala
[2024-01-26T16:18:05.648Z] [Error] /home/jenkins/jenkins-agent/712657a4/workspace/Kafka_kafka-pr_PR-15262/core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:6410:23: Invalid literal number
[2024-01-26T16:18:06.555Z] [Error] /home/jenkins/jenkins-agent/712657a4/workspace/Kafka_kafka-pr_PR-15262/core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:6413:5: ')' expected but '}' found.
[2024-01-26T16:18:07.582Z] two errors found
[2024-01-26T16:18:07.582Z] 

Copy link
Copy Markdown
Contributor

@OmniaGM OmniaGM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks straightforward! @showuon can you have another look please assuming the tests are okay after the latest fix from @gaurav-narula?

cc: @rondagostino, @cmccabe and @pprovenzano

Copy link
Copy Markdown
Contributor

@pprovenzano pprovenzano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

val uuid = logManager.directoryId(dir)
logManager.handleLogDirFailure(dir)
if (dir == config.metadataLogDir) {
if (dir == new File(config.metadataLogDir).getAbsolutePath && (zkClient.isEmpty || config.migrationEnabled)) {
Copy link
Copy Markdown
Contributor

@cmccabe cmccabe Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite the correct check... you should check config.processRoles (probably config.processRoles.isNotEmpty || config.migrationEnabled )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b28e21a

Copy link
Copy Markdown
Contributor

@cmccabe cmccabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, all.

@cmccabe cmccabe merged commit 3d95a69 into apache:trunk Feb 2, 2024
cmccabe pushed a commit that referenced this pull request Feb 2, 2024
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata
log. This log is stored in a single log directory which is configured via metadata.log.dir. If
there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we
may support multiple metadata log directories, but we don't yet. For now, we must abort the
process when this log directory fails.

In ZK mode, it is not necessary to abort the process when this directory fails, since there is no
__cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK
mode and do not abort in that scenario (unless we lost the final remaining log directory. of
course.)

Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 10, 2024

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 10, 2024

The issues started for 3.7 on the same day so it is one of the 3 commits backported feb 2
https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=3.7&tests.container=kafka.server.LogDirFailureTest

@OmniaGM
Copy link
Copy Markdown
Contributor

OmniaGM commented Feb 12, 2024

yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata
log. This log is stored in a single log directory which is configured via metadata.log.dir. If
there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we
may support multiple metadata log directories, but we don't yet. For now, we must abort the
process when this log directory fails.

In ZK mode, it is not necessary to abort the process when this directory fails, since there is no
__cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK
mode and do not abort in that scenario (unless we lost the final remaining log directory. of
course.)

Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata
log. This log is stored in a single log directory which is configured via metadata.log.dir. If
there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we
may support multiple metadata log directories, but we don't yet. For now, we must abort the
process when this log directory fails.

In ZK mode, it is not necessary to abort the process when this directory fails, since there is no
__cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK
mode and do not abort in that scenario (unless we lost the final remaining log directory. of
course.)

Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
Phuc-Hong-Tran pushed a commit to Phuc-Hong-Tran/kafka that referenced this pull request Jun 6, 2024
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata
log. This log is stored in a single log directory which is configured via metadata.log.dir. If
there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we
may support multiple metadata log directories, but we don't yet. For now, we must abort the
process when this log directory fails.

In ZK mode, it is not necessary to abort the process when this directory fails, since there is no
__cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK
mode and do not abort in that scenario (unless we lost the final remaining log directory. of
course.)

Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants