KAFKA-6145: KIP-441: Fix assignor config passthough#8716
KAFKA-6145: KIP-441: Fix assignor config passthough#8716vvcephei merged 2 commits intoapache:trunkfrom vvcephei:kafka-6145-one-shot-balance
Conversation
vvcephei
left a comment
There was a problem hiding this comment.
Hey @cadonna and @ableegoldman , can you take a look at these fixes? I'm going to add a couple more tests, but I'd like to get your early feedback if you have time.
There was a problem hiding this comment.
This is where we forgot to copy over the configs.
There was a problem hiding this comment.
Can we add a note to AssignorConfiguration pointing to this for any new assignor-related configs that get added?
There was a problem hiding this comment.
Do we automatically pass through the internal configs? I notice we don't copy over the task assignor class, or the new assignment listener callback I added for the integration tests. But both of them seem to get through
There was a problem hiding this comment.
I know it's deprecated, but I think the PartitionGrouper config is missing as well.
There was a problem hiding this comment.
Yes, we copy over the other internal configs in org.apache.kafka.streams.processor.internals.StreamThread#create
I'll add your listener to the config test. I'm not sure about PartitionGrouper.
There was a problem hiding this comment.
Turns out PartitionGrouper does not get copied over. I'll create a jira to track this, so that we don't have to get sidetracked in this PR with the question of what we should do about this deprecated config.
There was a problem hiding this comment.
There was a problem hiding this comment.
we automatically pass through any config that isn't registered
I have to say, this seems totally backwards to me. So basically we just happen to pass in any number of configs that we may or may not need, but will split out specific configs that we do need unless explicitly told to include them? I understand the custom configs motivation, but then why not just pass through all the configs?
What if I wanted to access the value of one of my registered Streams configs in my plugged-in component? I'd have to add the same value a second time, but with an unregistered config name. Huh?
There was a problem hiding this comment.
Regarding KAFKA-10046, in current trunk we already have some logic that assumes the default partition grouper is always used, so I'd suggest we just bite the bullet and remove it in 2.6.
There was a problem hiding this comment.
Thanks, @guozhangwang
@ableegoldman , it seems backwards to me also.
There was a problem hiding this comment.
I added this constraint to mirror the constraint we already apply in StreamConfig. It's not critical, but I was disappointed that I had written a bunch of tests that included a technically invalid configuration.
There was a problem hiding this comment.
Should we add the same for all the other configs?
There was a problem hiding this comment.
IMO, we should check the limits for all configs. However, I am not sure if we should check probingRebalanceIntervalMs to be >= 60 * 1000L (as we do in StreamsConfig) or just `>= 0.
There was a problem hiding this comment.
We could; to be fair, the config parser already checks the limits, so it's not necessary for production code.
It only comes up when we manually create these internal objects for tests. I added this particular bound specifically because I had previously passed in what I thought was a dummy value, which turned out to be a misconfiguration that actually affected the behavior I was testing.
Offhand, it doesn't seem like the same thing would happen with probingRebalanceIntervalMs, so it doesn't seem like the check here would have the same benefit; WDYT?
There was a problem hiding this comment.
Nevermind; I refactored the class to use the same config validation for passed-in arguments.
There was a problem hiding this comment.
I found this useful while debugging the system test.
There was a problem hiding this comment.
Since we can't have zero warmups, we don't need this condition (that I added in #8696)
There was a problem hiding this comment.
Fixing a double-space we were printing when there was a followup. (It would say with followup)
There was a problem hiding this comment.
Expanding DeMorgan's law at @cadonna 's request (which I also appreciated).
There was a problem hiding this comment.
The diff is misaligned. I removed shouldBeStickyForActiveAndStandbyTasksEvenIfNoWarmups and added shouldSkipWarmupsWhenAcceptableLagIsMax.
There was a problem hiding this comment.
All these tests erroneously set "max warmups" to zero.
There was a problem hiding this comment.
It was handy to be able to see the used config file while debugging.
There was a problem hiding this comment.
Added this configuration to fix the flaky StreamsOptimizedTest.test_upgrade_optimized_topology
|
FWIW: |
|
Previously, it would always fail for me within the first one to three tests, the first one more often than not. |
There was a problem hiding this comment.
Added this test, and verified that it does indeed fail on trunk for the expected reason that the new configs were ignored and defaults were substituted instead.
There was a problem hiding this comment.
Fair enough. I don't think it was meant as a cost saving thing, just to make it easier to understand when something did or did not have caught-up clients. If you find this logic easier to follow, go for it
There was a problem hiding this comment.
Nice catch! One nit is that "active" alone is not sufficient for being considered caught-up. Can we rename the active condition to running or activeRunning, etc?
There was a problem hiding this comment.
Do we automatically pass through the internal configs? I notice we don't copy over the task assignor class, or the new assignment listener callback I added for the integration tests. But both of them seem to get through
There was a problem hiding this comment.
I know it's deprecated, but I think the PartitionGrouper config is missing as well.
There was a problem hiding this comment.
Should we move the handful of AssignorConfiguration related tests from StreamsPartitionAssignorTest to here?
There was a problem hiding this comment.
Ah, unfortunately, that test relies on mocking package-private fields from another package. I'll just leave it alone for now.
There was a problem hiding this comment.
Trying to avoid piling on even more unrelated questions to the thread below, but there's another config that would need to be passed in, the admin client timeout.
That said, can we just remove it? It only gets the timeout the admin is configured with anyway
There was a problem hiding this comment.
Ack, I'll take a look at it.
There was a problem hiding this comment.
It does wind up "surviving" into the config that we use in the assignor, but I'm not sure if it's on purpose or just lucky.
I'll postpone any existential questions, and just add it to the regression test.
There was a problem hiding this comment.
Ah, it was also tested here: org.apache.kafka.streams.StreamsConfigTest#consumerConfigShouldContainAdminClientConfigsForRetriesAndRetryBackOffMsWithAdminPrefix
There was a problem hiding this comment.
I'll open a separate PR to remove the extra timeout config
There was a problem hiding this comment.
req: Please add verifications to StreamsConfigTest#consumerConfigMustContainStreamPartitionAssignorConfig()
There was a problem hiding this comment.
IMO, we should check the limits for all configs. However, I am not sure if we should check probingRebalanceIntervalMs to be >= 60 * 1000L (as we do in StreamsConfig) or just `>= 0.
There was a problem hiding this comment.
Nice catch indeed! I agree with @ableegoldman about the renaming. I am in favour of activeRunning or activeAndRunning.
|
Thanks for the reviews, @ableegoldman and @cadonna ! I believe I've addressed all your feedback. |
|
This java 14 failure is strange: I'm going to try again, and see what we see. |
|
Retest this please |
1 similar comment
|
Retest this please |
There was a problem hiding this comment.
I think this comment got lost in a discussion thread, but can we add a note to AssignorConfiguration pointing out that any Streams configs added here will need to be explicitly passed through? It seems like it's too easy to fall into this same trap again
There was a problem hiding this comment.
Ah, right. Sure thing!
|
The test failures were the result of extra tests that were added on trunk after this branch. The branch builder does a merge with trunk before running the tests. I'm rebasing and fixing the tests. |
|
The only failures were: |
ableegoldman
left a comment
There was a problem hiding this comment.
One comment on the TaskAssignorIntegrationTest but feel free to merge as-is
| public static final String TIME = "__time__"; | ||
|
|
||
| // This is settable in the main Streams config, but it's a private API for testing | ||
| public static final String ASSIGNMENT_LISTENER = "__asignment.listener__"; |
There was a problem hiding this comment.
I'll open a separate PR to remove the extra timeout config
| mkEntry(StreamsConfig.ACCEPTABLE_RECOVERY_LAG_CONFIG, "6"), | ||
| mkEntry(StreamsConfig.MAX_WARMUP_REPLICAS_CONFIG, "7"), | ||
| mkEntry(StreamsConfig.PROBING_REBALANCE_INTERVAL_MS_CONFIG, "480000"), | ||
| mkEntry(StreamsConfig.InternalConfig.ASSIGNMENT_LISTENER, configuredAssignmentListener), |
There was a problem hiding this comment.
Should we also validate that the task assignor gets passed through? We could even pass in a custom assignor and use that to verify the correct assignor configs got passed through.
Of course, reflection black magic-ry is just more fun 🙂
There was a problem hiding this comment.
Good idea; I'm not sure why I didn't think to do this already. With my newfound understanding of this config translation logic, I'm confident that it gets copied over right now, because it's not a registered config, but a regression test would be nice.
I'll quickly follow up with a separate PR so that I can merge this one.
There was a problem hiding this comment.
Sounds good. Thanks for the fix!
Also fixes a system test by configuring the HATA to perform a one-shot balanced assignment
Committer Checklist (excluded from commit message)