KAFKA-12381: remove live broker checks for forwarding topic creation#10240
Conversation
|
Triggered system tests: |
| } else if (!hasEnoughLiveBrokers(topic, aliveBrokers)) { | ||
| Some(Errors.INVALID_REPLICATION_FACTOR) | ||
| if (Topic.isInternal(topic)) { | ||
| Some(Errors.INVALID_REPLICATION_FACTOR) |
There was a problem hiding this comment.
I cannot come up with a good reason why we should preserve this inconsistent handling. I traced back the origin of it to here: 063d534. It looks to me like the intent is for the client to keep retrying until the topic can be created, which makes sense since we return COORDINATOR_NOT_AVAILABLE in response to FindCoordinator requests for the same case. However, INVALID_REPLICATION_FACTOR is a non-retriable error, so it's likely this was poorly understood at the time. I suggest that we return LEADER_NOT_AVAILABLE consistently.
|
@abbccdda Hmm.. Some of the failing tests suggest that |
|
Ok, I think I see what is going on now. The failing system test is verifying what happens when inter-broker communication no longer works. This results in different behavior because I can think of a few options to address the problem:
My inclination is probably option 2. The downside is that the user would no longer get a clear error when a topic cannot be auto-created. But I feel overall it's the safest and most consistent way to handle this case. There might be other options though. It's interesting to note that this relates back to some of the discussion in the auto-create PR itself. We had discussed skipping the replication factor check on the broker and sending the request to the controller. But either way, we have to rely on the metadata cache locally at least to determine whether the topic already exists or not, so it might not have really helped. |
|
After offline sync with @hachikuji , we decided that the invalid replication factor check would be redundant to be performed on the forwarding broker. Will remove that logic to ensure we don't accidentally return any wrong error code to the client, due to the staleness of metadata cache. |
|
New system test run: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4415/ |
| When security_protocol=SSL, client SSL handshakes are expected to fail due to hostname verification failure. | ||
| When security_protocol=PLAINTEXT and interbroker_security_protocol=SSL, controller connections fail | ||
| with hostname verification failure. Hence clients are expected to fail with LEADER_NOT_AVAILABLE. | ||
| with hostname verification failure. Hence clients are expected to fail with INVALID_REPLICATION_FACTOR. |
There was a problem hiding this comment.
I still think this error is not a very intuitive way to handle the absence of metadata. Maybe we can rephrase the explanation a little bit.
Since metadata cannot be propagated in the cluster without a valid certificate, the broker's metadata caches will be empty. Hence we expect
Metadatarequests to fail with anINVALID_REPLICATION_FACTORerror since the broker will attempt to create the topic automatically as it does not exist in the metadata cache, and there will be no online brokers.
There was a problem hiding this comment.
Sounds good, will add this part!
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Left one nitpick.
| private def testErrorWithCreationInZk(error: Errors, | ||
| topicName: String, | ||
| isInternal: Boolean, | ||
| expectedError: Errors = null): Unit = { |
There was a problem hiding this comment.
nit: a little more idiomatic to use Option[Errors] for a case like this
|
System test pass: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4419/ and only flaky tests are failing, merging |
…10240) Removed broker number checks for invalid replication factor when doing the forwarding, in order to reduce false alarms for clients. Reviewers: Jason Gustafson <jason@confluent.io>
|
Cherry-picked to 2.8 |
We introduced a regression in #9579 where originally we only return
INVALID_REPLICATION_FACTORfor internal topic creation when there are not enough brokers. This PR addressed the fundamental problem which is a stale forwarding broker should not make the decision on whether to define a replication factor as invalid or not. This should only be verified on the active controller side.Committer Checklist (excluded from commit message)