KAFKA-16195: ignore metadata.log.dir failure in ZK mode#15262
KAFKA-16195: ignore metadata.log.dir failure in ZK mode#15262cmccabe merged 3 commits intoapache:trunkfrom
Conversation
This change ensures we check that the broker is running in Kraft mode or is undergoing a migration while handling a failure for metadata.log.dir. This avoids halting a broker in ZK mode when the metadata.log.dir fails. When unconfigured, it defaults to the first log directory.
showuon
left a comment
There was a problem hiding this comment.
I think the impact of this bug is that if there are more than 1 log dirs in ZK broker, and when the 1st of them is failed, we will shutdown the broker unexpectedly. But if there's only 1 log dir, it should be fine to shutdown the broker since no available online log dir. (just have a strange log saying Shutdown broker because the metadata log dir. Is my understanding correct?
If so, I think the change makes sense. Could you add tests for it?
That's correct. The case where all log directories fail is handled in Will follow up with a test shortly. |
|
Tests Compilation is failing with |
d0858de to
eb134d3
Compare
OmniaGM
left a comment
There was a problem hiding this comment.
The change looks straightforward! @showuon can you have another look please assuming the tests are okay after the latest fix from @gaurav-narula?
cc: @rondagostino, @cmccabe and @pprovenzano
| val uuid = logManager.directoryId(dir) | ||
| logManager.handleLogDirFailure(dir) | ||
| if (dir == config.metadataLogDir) { | ||
| if (dir == new File(config.metadataLogDir).getAbsolutePath && (zkClient.isEmpty || config.migrationEnabled)) { |
There was a problem hiding this comment.
This is not quite the correct check... you should check config.processRoles (probably config.processRoles.isNotEmpty || config.migrationEnabled )
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
|
Hey folks. We've seen a large increase in LogDirFailureTest after this PR. Can we take a look and see if something here caused it? gradle enterprise: https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.container=kafka.server.LogDirFailureTest |
|
The issues started for 3.7 on the same day so it is one of the 3 commits backported feb 2 |
|
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
This change ensures we check that the broker is running in Kraft mode or is undergoing a migration while handling a failure for metadata.log.dir.
This avoids halting a broker in ZK mode when the metadata.log.dir fails. When unconfigured, it defaults to the first log directory.