KAFKA-10017: fix 2 issues in EosBetaUpgradeIntegrationTest#9733
KAFKA-10017: fix 2 issues in EosBetaUpgradeIntegrationTest#9733mjsax merged 5 commits intoapache:trunkfrom
Conversation
|
@mjsax @ableegoldman @guozhangwang , could you help review this PR? Thanks. |
mjsax
left a comment
There was a problem hiding this comment.
Thanks for the PR!
The built result is not available any longer. Where was the InvalidStateStoreException thrown? Wondering, because keysFromInstance should retry internally? Did we exceed the timeout?
We should already have a guard in place for multiple rebalances though, thus, I am not sure atm if I understand the root cause of the issue.
| @@ -256,8 +261,8 @@ public void shouldUpgradeFromEosAlphaToEosBeta() throws Exception { | |||
| streams2Alpha.cleanUp(); | |||
| streams2Alpha.start(); | |||
| assignmentListener.waitForNextStableAssignment(MAX_WAIT_TIME_MS); | |||
There was a problem hiding this comment.
To avoid the issue you point out, we actually use this line. As a matter of fact, there is no guarantee if there would be even more than two rebalances. \cc @ableegoldman might be able to provide more input?
There was a problem hiding this comment.
@mjsax
The stacktrace is like this. The line number is not mapped to master branch correctly, but you can know what it is from the method name. It failed when it's trying to get all state data and checking if the current stream is in RUNNING state, but it's under rebalancing. And this exception won't do any retry.
Stacktrace
org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get state store store because the stream thread is PARTITIONS_ASSIGNED, not RUNNING
at org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:81)
at org.apache.kafka.streams.state.internals.WrappingStoreProvider.stores(WrappingStoreProvider.java:50)
at org.apache.kafka.streams.state.internals.CompositeReadOnlyKeyValueStore.all(CompositeReadOnlyKeyValueStore.java:119)
at org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.keysFromInstance(EosBetaUpgradeIntegrationTest.java:1112)
at org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta(EosBetaUpgradeIntegrationTest.java:494)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
And, yes, we make sure it completed the rebalance then checking the running state, but as I mentioned, there will be 2 rebalance happened(1 for Adding new member, 1 for leader re-joining group during Stable), and we only wait 1 rebalance completes, so there might be another rebalancing later. The 2 stream state transition log is like this:
stateTransitions1:
[KeyValue(CREATED, REBALANCING), KeyValue(REBALANCING, RUNNING), KeyValue(RUNNING, REBALANCING), KeyValue(REBALANCING, RUNNING)]
stateTransitions2:
[KeyValue(RUNNING, REBALANCING), KeyValue(REBALANCING, RUNNING), KeyValue(RUNNING, REBALANCING), KeyValue(REBALANCING, RUNNING)]
So, as you can see, we might enter next step when we are in step 2 (KeyValue(REBALANCING, RUNNING)), and there will be another rebalancing soon. That's why I'll wait explicitly for this transition pair [KeyValue(RUNNING, REBALANCING), KeyValue(REBALANCING, RUNNING)]
There was a problem hiding this comment.
The line number is not mapped to master branch correctly, but you can know what it is from the method name.
Well, this seem to imply the the run was on an older version that did not contain the retry yet -- note. adding the retry it was part of the latest PR for this test that we merged: https://github.com/apache/kafka/pull/9688/files#diff-86a5136ae170df067137442b5eae05fa5fd9d1e02aca85bac8a251b7d2557b0eR1089
Thus, I would assume the the whole test failure you observes was before the last fix.
there will be 2 rebalance happened
Correct, but we actually "cut off" the unstable rebalances (there might actually be more then 2...) via assignmentListener.waitForNextStableAssignment(MAX_WAIT_TIME_MS);
To me, it seems that you observe the test failure before the PR got merged -- otherwise the line numbers for keysFromInstance would match.
There was a problem hiding this comment.
You're right! I didn't notice there is already new added retry in master. I'll revert my change for it. Thank you!
| // The exception won't cause the test fail since we actually "expected" exception thrown and failed the stream. | ||
| // So, log to stderr for debugging when the exception is not what we expected | ||
| e.printStackTrace(System.err); | ||
| fail("Should only get one uncaught exception from Streams."); |
There was a problem hiding this comment.
Seems this fail did not work as expected? Otherwise the test would have failed all the time? Maybe we should rather set a boolean flag that we evaluate outside of the callback to let the test fail?
Also, we have one run with zero exceptions and one run with 2 exception (one exception type each) -- not 4. Thus, we need to handle this differently for the error-injection and the "clean run" differently depending on the boolean test flag.
There was a problem hiding this comment.
Seems this fail did not work as expected? Otherwise the test would have failed all the time?
Yes, the fail() is throwing AssertionError to fail the stream, but that's excatly what we expected to fail the stream, just not with our injected exception. So it won't failed the test.
we need to handle this differently for the error-injection and the "clean run" differently depending on the boolean test flag.
I see, will update it. Thanks.
There was a problem hiding this comment.
Well, but the AsseriontError does not mark the test as failed as it seems, because it would be thrown in a different thread? We would need to set a boolean flag, and add an assertThat to the main test code to see if the flag fired?
|
@mjsax , as mentioned in #9690 (comment), I've updated the commit. What my change is:
So, if we only Note: I only Basically, The already running stream thread should have the state change:
Thank you. |
c032fd3 to
ee037dd
Compare
| private final static String APP_DIR_1 = "appDir1"; | ||
| private final static String APP_DIR_2 = "appDir2"; | ||
| private final static String UNEXPECTED_EXCEPTION_MSG = "Fail the test since we got an unexpected exception or" + | ||
| "there are too many exceptions thrown, please check standard error log for more info."; |
There was a problem hiding this comment.
nit: missing space between or (line above) and there
| private Throwable uncaughtException; | ||
|
|
||
| private int testNumber = 0; | ||
| private boolean hasUnexpectedError = false; |
There was a problem hiding this comment.
This variable should be declare volatile because the callback is executed on a different thread.
mjsax
left a comment
There was a problem hiding this comment.
Thanks for updating the PR. I am still not sure if I understand your reasoning. Also, let's back up for a second.
You originally observed a test failure based on missing retries when trying to get the state store. This should be fixed already.
Did you encounter any other test failures? I could not reproduce any test failures locally, so I am still wondering why we would need the changes (besides the fix in the exception callback).
I see that you did change the StableAssignmentListener but I don't understand why we need this: We first block until we get a stable assignment. For this case, we know that the state must be in rebalancing (because Kafka Streams would transit to REBALACING before the StableAssignmentListener is called). Thus, after waitForStableAssignment returns, we only need to wait until we transit to state RUNNING -- I don't see any advantage in tacking the exact number of rebalances. Just waiting until the last state is RUNNING seems sufficient.
Do I miss something?
Again: did you observer more test failures that indicate that we need the proposed changes?
|
@mjsax , thanks for your comments. Let me answer your questions below.
--> Right! The failure when trying to get state store is fixed now.
--> Yes, I observe test failures on your PR to 2.6 (i.e. #9690 (comment)), this test failure should also happen in
--> No, that's not what I observed. What I observed is that the The line And, that's the reason why the PR to 2.6 #9690 keeps encountering test failures, because we're still under "unstable" assignment (that is, just complete the Also, I think we should apply this fix to the PR #9690 , too, to fix the failed test issue. Thank you. Happy Holidays~ |
|
Thanks for the input! The test behavior is slightly different in |
| assignmentListener.waitForNextStableAssignment(MAX_WAIT_TIME_MS); | ||
| expectedNumAssignments = assignmentListener.numTotalAssignments() - prevNumAssignments; | ||
| waitForNumRebalancingToRunning(stateTransitions2, expectedNumAssignments); | ||
| waitForRunning(stateTransitions1); |
There was a problem hiding this comment.
If you reasoning is right, do we need to use waitForNumRebalancingToRunning here, too? If there are multiple rebalances, both clients would participate?
Also, you PR does not update all stages when we wait for state changes. Why? Would we not need to apply to logic each time?
There was a problem hiding this comment.
Thanks for your questions. Answer them below:
Also, you PR does not update all stages when we wait for state changes. Why? Would we not need to apply to logic each time?
--> Yes, I should do that. What I did now is just focusing on the case: 1 new started stream + 1 already running stream, which will have more failures here. But you're right, I should put the change in all stages.
If you reasoning is right, do we need to use waitForNumRebalancingToRunning here, too? If there are multiple rebalances, both clients would participate?
--> Good question! I have explained in the previous comment (#9733 (comment)) though, I can explain again since I know you didn't understand exactly why I did this change before.
I only waitForNumRebalancingToRunning for the new started stream only, not for the "already running stream" because I found sometimes if the stream runs fast enough, the "already running stream" might not have the expected number of [REBALANCING -> RUNNING] state transition. The reason is this line:
[appDir2-StreamThread-1] State transition from PARTITIONS_ASSIGNED to PARTITIONS_ASSIGNED
Basically, The already running stream thread should have the state change: [RUNNING to PARTITIONS_REVOKED], [PARTITIONS_REVOKED to PARTITIONS_ASSIGNED](unstable), [PARTITIONS_ASSIGNED to RUNNING], [RUNNING to PARTITIONS_ASSIGNED](stable), [PARTITIONS_ASSIGNED to RUNNING]. Because it needs one more PARTITIONS_REVOKED step, it might be under 2 PARTITIONS_ASSIGNED at the same time (no RUNNING in the middle). And that's why the stream client doesn't change to RUNNING as we expected.
Does that make sense?
|
@showuon Sorry for late response. Pretty busy atm. We found that the fix for the exception handler contained in this PR is also blocking #9720 (\cc @wcarlson5) -- could you extract this fix into a separate PR that we can merge quickly to unblock 9720? Otherwise we might need to extract your fix into 9720 to be able to move forward with it. Hope to cycle back to this PR soon.. Thanks. |
|
OK, working on it now. |
e5cac2d to
f2f0967
Compare
|
@mjsax , I've reverted my change for the state change. Now, there's only the exception handler improvement change. And for the state change fix, after the holidays, I think, if the test is not flaky now, why should we change it, and maybe make it unreliable again. So, my thought is, we keep monitoring this test, and if tests failed again, we can discuss it again. What do you think? |
|
There's a jenkins timeout build error in jdk8. Made a minor change to trigger jenkins build again to make sure it's infra's issue, not my change |
|
All tests passed. |
|
Thanks for updating the PR!
Agree. We can always do a follow up if needed. |
The PR follows #9688, to make the EosBetaUpgradeIntegrationTest more stable. Fixed some issues:
https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk8/274/testReport/org.apache.kafka.streams.integration/EosBetaUpgradeIntegrationTest/shouldUpgradeFromEosAlphaToEosBeta_true_/
After investigation, I found it's because we actually do rebalancing twice after new stream started: 1 for Adding new member, 1 for leader re-joining group during Stable. So, if we only wait for the state to be
RUNNING, it might enterREBALANCINGstate later, and cause that we can't get store successfullyCommitter Checklist (excluded from commit message)