Skip to content

KAFKA-12381: remove live broker checks for forwarding topic creation#10240

Merged
abbccdda merged 5 commits intoapache:trunkfrom
abbccdda:fix-leader-not-available-KAFKA-12381
Mar 5, 2021
Merged

KAFKA-12381: remove live broker checks for forwarding topic creation#10240
abbccdda merged 5 commits intoapache:trunkfrom
abbccdda:fix-leader-not-available-KAFKA-12381

Conversation

@abbccdda
Copy link
Copy Markdown

@abbccdda abbccdda commented Mar 2, 2021

We introduced a regression in #9579 where originally we only return INVALID_REPLICATION_FACTOR for internal topic creation when there are not enough brokers. This PR addressed the fundamental problem which is a stale forwarding broker should not make the decision on whether to define a replication factor as invalid or not. This should only be verified on the active controller side.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Mar 2, 2021

Triggered system tests:
trunk: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4408/
fix: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4409/
to see if we have the verifiable producer test fixed.

@abbccdda abbccdda added kraft core Kafka Broker labels Mar 2, 2021
} else if (!hasEnoughLiveBrokers(topic, aliveBrokers)) {
Some(Errors.INVALID_REPLICATION_FACTOR)
if (Topic.isInternal(topic)) {
Some(Errors.INVALID_REPLICATION_FACTOR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot come up with a good reason why we should preserve this inconsistent handling. I traced back the origin of it to here: 063d534. It looks to me like the intent is for the client to keep retrying until the topic can be created, which makes sense since we return COORDINATOR_NOT_AVAILABLE in response to FindCoordinator requests for the same case. However, INVALID_REPLICATION_FACTOR is a non-retriable error, so it's likely this was poorly understood at the time. I suggest that we return LEADER_NOT_AVAILABLE consistently.

@hachikuji
Copy link
Copy Markdown
Contributor

@abbccdda Hmm.. Some of the failing tests suggest that INVALID_REPLICATION_FACTOR was intentional. For example, MetadataRequestTest.testAutoCreateTopicWithInvalidReplicationFactor. There must be some other difference in the new logic...

@hachikuji
Copy link
Copy Markdown
Contributor

Ok, I think I see what is going on now. The failing system test is verifying what happens when inter-broker communication no longer works. This results in different behavior because AutoTopicCreationManager relies on the MetadataCache in order to determine the number of live brokers while the old logic checked zk directly. That makes the INVALID_REPLICATION_FACTOR more dangerous since it is not retriable and the cache may be stale. In particular, when inter-broker communication is down, the cache will be empty and the broker will end up trying to auto-create all topics.

I can think of a few options to address the problem:

  1. Bring back the old logic to check Zookeeper for the live brokers. This might be fine for 2.8, but it does not address the problem for KIP-500.
  2. Return a retriable error instead. Really UNKNOWN_TOPIC_OR_PARTITION would be a better error in this case.
  3. Make INVALID_REPLICATION_FACTOR a retriable error. I guess we have to understand how clients

My inclination is probably option 2. The downside is that the user would no longer get a clear error when a topic cannot be auto-created. But I feel overall it's the safest and most consistent way to handle this case. There might be other options though.

It's interesting to note that this relates back to some of the discussion in the auto-create PR itself. We had discussed skipping the replication factor check on the broker and sending the request to the controller. But either way, we have to rely on the metadata cache locally at least to determine whether the topic already exists or not, so it might not have really helped.

@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Mar 3, 2021

After offline sync with @hachikuji , we decided that the invalid replication factor check would be redundant to be performed on the forwarding broker. Will remove that logic to ensure we don't accidentally return any wrong error code to the client, due to the staleness of metadata cache.

@abbccdda abbccdda changed the title KAFKA-12381: only return leader not available for internal topic creation KAFKA-12381: remove live broker checks for forwarding topic creation Mar 3, 2021
@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Mar 3, 2021

When security_protocol=SSL, client SSL handshakes are expected to fail due to hostname verification failure.
When security_protocol=PLAINTEXT and interbroker_security_protocol=SSL, controller connections fail
with hostname verification failure. Hence clients are expected to fail with LEADER_NOT_AVAILABLE.
with hostname verification failure. Hence clients are expected to fail with INVALID_REPLICATION_FACTOR.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this error is not a very intuitive way to handle the absence of metadata. Maybe we can rephrase the explanation a little bit.

Since metadata cannot be propagated in the cluster without a valid certificate, the broker's metadata caches will be empty. Hence we expect Metadata requests to fail with an INVALID_REPLICATION_FACTOR error since the broker will attempt to create the topic automatically as it does not exist in the metadata cache, and there will be no online brokers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will add this part!

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one nitpick.

private def testErrorWithCreationInZk(error: Errors,
topicName: String,
isInternal: Boolean,
expectedError: Errors = null): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a little more idiomatic to use Option[Errors] for a case like this

@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Mar 5, 2021

System test pass: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4419/ and only flaky tests are failing, merging

@abbccdda abbccdda merged commit 17851da into apache:trunk Mar 5, 2021
abbccdda pushed a commit that referenced this pull request Mar 5, 2021
…10240)

Removed broker number checks for invalid replication factor when doing the forwarding, in order to reduce false alarms for clients.

Reviewers: Jason Gustafson <jason@confluent.io>
@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Mar 5, 2021

Cherry-picked to 2.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker kraft

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants