Skip to content

KAFKA-16225: Set metadata.log.dir to broker in KRAFT mode in integration test#15354

Merged
jolshan merged 1 commit intoapache:trunkfrom
OmniaGM:KAFKA-16225_fix_LogDirFailureTest_flakiness
Feb 13, 2024
Merged

KAFKA-16225: Set metadata.log.dir to broker in KRAFT mode in integration test#15354
jolshan merged 1 commit intoapache:trunkfrom
OmniaGM:KAFKA-16225_fix_LogDirFailureTest_flakiness

Conversation

@OmniaGM
Copy link
Copy Markdown
Contributor

@OmniaGM OmniaGM commented Feb 12, 2024

Fix the flakiness of LogDirFailureTest by setting a separate metadata.log.dir for brokers in KRAFT mode

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@OmniaGM
Copy link
Copy Markdown
Contributor Author

OmniaGM commented Feb 12, 2024

@jolshan, @gharris1727, @gaurav-narula can you have a look please?

Copy link
Copy Markdown
Contributor

@gaurav-narula gaurav-narula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cfgs.foreach(_.setProperty(KafkaConfig.NewGroupCoordinatorEnableProp, "true"))
}

if(isKRaftTest()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after if

So the issue was that the metadata and other log directories were using the same temp directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue was that we don't set metadata dir which in this case we fall back to the first log dir.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add, #15262 made it apparent because prior to it, the check for metadata log dir failure didn't handle relative paths correctly. Refer https://github.com/apache/kafka/pull/15262/files#diff-78812e247ffeae6f8c49b1b22506434701b1e1bafe7f92ef8f8708059e292bf0R2589

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was flaky because as we call causeLogDirFailure some times we impact the first log.dir which also is KafkaConfig.metadataLogDir as we don't have metadata.log.dir. So to fix the flakiness we need to explicitly set metadata.log.dir to diff log dir than the ones we could potentially fail for the tests. Hope this explains it!

Copy link
Copy Markdown
Member

@jolshan jolshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait for the tests before merging.

Thanks @OmniaGM for the fix!

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 12, 2024

I filed a JIRA for DescribeConsumerGroupTest https://issues.apache.org/jira/browse/KAFKA-16245

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 12, 2024

I still see one LogDirFailureTest failure:

org.apache.kafka.server.fault.FaultHandlerException: quorumTestHarnessFaultHandler: Error applying topics delta in MetadataDelta up to 70: Log for partition topic-0 is not available on broker 1

Can we confirm it is not because the log directory is missing?

@OmniaGM
Copy link
Copy Markdown
Contributor Author

OmniaGM commented Feb 13, 2024

I still see one LogDirFailureTest failure:

org.apache.kafka.server.fault.FaultHandlerException: quorumTestHarnessFaultHandler: Error applying topics delta in MetadataDelta up to 70: Log for partition topic-0 is not available on broker 1

Can we confirm it is not because the log directory is missing?

So far I can not reproduce this failure locally to debug it. But I think the test try to offline disk before the broker catchup with topic creation delta as topic creation is fire and forget. I pushed a comment that ensure that brokers catch-up first before start testing.

@OmniaGM
Copy link
Copy Markdown
Contributor Author

OmniaGM commented Feb 13, 2024

What I noticed locally that testIOExceptionDuringCheckpoint still flaky because the test try to offline log dir before highWatermarkCheckPointThreadStarted create the high watermark replication-offset-checkpoint.tmp file. I can't find an easy way to make sure that the highWatermarkCheckPointThreadStarted has started before failing the log dir yet. Open for suggestions if anyone have any!

@OmniaGM OmniaGM force-pushed the KAFKA-16225_fix_LogDirFailureTest_flakiness branch 2 times, most recently from f38b268 to a2bc46f Compare February 13, 2024 14:34
@OmniaGM OmniaGM force-pushed the KAFKA-16225_fix_LogDirFailureTest_flakiness branch from a2bc46f to 5bf9b2f Compare February 13, 2024 14:40
@gharris1727
Copy link
Copy Markdown
Contributor

Does this make sense to change across every test suite, not just LogDirFailureTest?
Is this reducing our test coverage of the default metadata.log.dir configuration?
Is a separate metadata.log.dir a more common configuration in practice?

@OmniaGM
Copy link
Copy Markdown
Contributor Author

OmniaGM commented Feb 13, 2024

Is this reducing our test coverage of the default metadata.log.dir configuration?

If we want to limited to LogDirFailureTest test only that's okay it will properly have some duplicated code to override generateConfigs only for this tests. But either way I don't believe it would reduce the test coverage as most other tests don't care about metadata log dir or try to offline log dir

Is a separate metadata.log.dir a more common configuration in practice?

I don't think it is a question of common practice or not (doc doesn't mention which is the recommended way to set this) but as no other tests try to fail metadata log dir or test where the metadata land it wouldn't matter. We don't have coverage anywhere as far as I can see for metadata log dir set separately or default!

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 13, 2024

Ok -- I think in order to reduce the build failures we should merge this change. My understanding is that this doesn't introduce new failures.

I also agree that it seems ok to make this change since the metadata log is special and it doesn't seem like log dir failure test was ever trying to test its failure modes. If we do want such testing we should also file a jira for that.

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Feb 13, 2024

I will keep the JIRA open as we try to figure out the fix of the other errors.

@jolshan jolshan merged commit be6653c into apache:trunk Feb 13, 2024
Comment on lines +172 to +175
if (isKRaftTest()) {
val value = configs.map(c => c.brokerId -> c.logDirs.contains(c.metadataLogDir))
logger.warn(s">>>>>> ${value.mkString(",")}")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include this change? The logging format and lack of description seems odd.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I removed this one!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will raise a minor pr to remove it. I removed it but didn't push the commit ops

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712
Copy link
Copy Markdown
Member

@jolshan Do you plan to backport this to 3.7? I recently try to backport other PRs to 3.7 and encounter the same flaky

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Sep 9, 2024

I have backport this PR to 3.7 6f76ed9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants