Skip to content

KAFKA-14240; Validate kraft snapshot state on startup#12653

Merged
hachikuji merged 5 commits intoapache:trunkfrom
hachikuji:KAFKA-14240
Sep 19, 2022
Merged

KAFKA-14240; Validate kraft snapshot state on startup#12653
hachikuji merged 5 commits intoapache:trunkfrom
hachikuji:KAFKA-14240

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @hachikuji .

Comment on lines +51 to +61
// If the log start offset is not 0, then we must have a snapshot which covers the
// initial state up to the current log start offset.
if (log.logStartOffset > 0) {
val latestSnapshotId = snapshots.lastOption.map(_._1)
if (!latestSnapshotId.exists(snapshotId => snapshotId.offset >= log.logStartOffset)) {
throw new IllegalStateException("Inconsistent snapshot state: there must be a snapshot " +
s"at an offset larger then the current log start offset ${log.logStartOffset}, but the " +
s"latest snapshot is ${latestSnapshotId}")
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this check to recoverSnapshots or before calling recoverSnapshots?

recoverSnapshots makes changes to the log dir. Kafka should probably validate the log dir before making changes to it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a specific case where doing this prior to recoverSnapshots might help is if we have a deleted file corresponding to a snapshot in the range we are looking for.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. recoverSnapshots doesn't make a log dir invalid but it may make it "more" invalid by deleting more snapshots.

Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests. LGTM after one minor comment.

Comment thread core/src/main/scala/kafka/raft/KafkaMetadataLog.scala
Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the tests are green.

@jsancio jsancio added core Kafka Broker 3.3 labels Sep 16, 2022
@showuon
Copy link
Copy Markdown
Member

showuon commented Sep 17, 2022

2 out of 3 builds failed (terminated) without results. I've re-run it: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12653/7/

@hachikuji
Copy link
Copy Markdown
Contributor Author

Test failures look unrelated. I've triggered one more build to be on the safe side.

@showuon
Copy link
Copy Markdown
Member

showuon commented Sep 18, 2022

@hachikuji
Copy link
Copy Markdown
Contributor Author

Sigh.

[2022-09-18T10:01:25.947Z] * What went wrong:
[2022-09-18T10:01:25.947Z] Execution failed for task ':core:unitTest'.
[2022-09-18T10:01:25.947Z] > Process 'Gradle Test Executor 116' finished with non-zero exit value 1

@showuon
Copy link
Copy Markdown
Member

showuon commented Sep 19, 2022

2 out of 3 builds failed again in the latest build.
Could we consider to merge the fix in this PR: #12639 soon to solve this build issue? I've been spending time to investigate that issue and identified the root cause. IMO, merging that PR is the top priority to make our development process more efficient!

@hachikuji hachikuji merged commit 8c8b536 into apache:trunk Sep 19, 2022
hachikuji added a commit that referenced this pull request Sep 19, 2022
We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
ijuma added a commit to confluentinc/kafka that referenced this pull request Sep 21, 2022
…eptember 2022)

`Jenkinsfile` was the only conflict and we ignore the changes since
they are not relevant to the Confluent build.

* apache-github/3.3: (61 commits)
  KAFKA-14214: Introduce read-write lock to StandardAuthorizer for consistent ACL reads. (apache#12628)
  KAFKA-14243: Temporarily disable unsafe downgrade (apache#12664)
  KAFKA-14240; Validate kraft snapshot state on startup (apache#12653)
  KAFKA-14233: disable testReloadUpdatedFilesWithoutConfigChange first to fix the build (apache#12658)
  KAFKA-14238;  KRaft metadata log should not delete segment past the latest snapshot (apache#12655)
  KAFKA-14156: Built-in partitioner may create suboptimal batches (apache#12570)
  MINOR: Adds KRaft versions of most streams system tests (apache#12458)
  MINOR; Add missing li end tag (apache#12640)
  MINOR: Mention that kraft is production ready in upgrade notes (apache#12635)
  MINOR: Add upgrade note regarding the Strictly Uniform Sticky Partitioner (KIP-794)  (apache#12630)
  KAFKA-14222; KRaft's memory pool should always allocate a buffer (apache#12625)
  KAFKA-14208; Do not raise wakeup in consumer during asynchronous offset commits (apache#12626)
  KAFKA-14196; Do not continue fetching partitions awaiting auto-commit prior to revocation (apache#12603)
  KAFKA-14215; Ensure forwarded requests are applied to broker request quota (apache#12624)
  MINOR; Remove end html tag from upgrade (apache#12605)
  Remove the html end tag from upgrade.html
  KAFKA-14205; Document how to replace the disk for the KRaft Controller (apache#12597)
  KAFKA-14203 Disable snapshot generation on broker after metadata errors (apache#12596)
  KAFKA-14216: Remove ZK reference from org.apache.kafka.server.quota.ClientQuotaCallback javadoc (apache#12617)
  KAFKA-14217: app-reset-tool.html should not show --zookeeper flag that no longer exists (apache#12618)
  ...
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants