-
Notifications
You must be signed in to change notification settings - Fork 331
SAMZA-1832: Fix race condition between StreamProcessor and SamzaContainerListener #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@prateekm @nickpan47 @shanthoosh please take a look. |
| } else { | ||
| LOGGER.info("Ignoring onNewJobModel invocation since the current state is {} and not {}.", state, State.IN_REBALANCE); | ||
| } | ||
| if (state.get() == IN_REBALANCE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if i'm wrong here.
There're three actors who could update the StreamProcessor state viz DebounceThread, UserThread, ContainerThread.
Scenario:
UserThread is trying to stop the StreamProcessor(StreamProcessor.stop()) and DebounceThread is triggering onNewJobModel at same time(let's say StreamProcessor state is IN_REBALANCE):
Time 0: Debounce thread checks the streamprocessor state and goes past the if guard in `onNewJobModel`.
Time 1: User thread invokes StateTransitionUtil.compareNotInAndSet and the check passes as well and user thread proceeds(sets the streamprocessor state to stopping).
Time 2: User thread stops the container.
Time 3: Debounce thread recreates the container and launches it(most importantly updates streamprocessor state variables).
Time 4: User thread stops the coordinator.
Time 5: Container callback for container stopped in (time 2) comes back and counts down the container shutdownlatch.
Time 6: Coordinator.onFailure()/ coordinatorListener.OnStop() callback triggered for coordinator stopped in (time 4) will not stop the container, since the current streamprocessor state is stopping and container shutdownlatch value is zero.
After the StreamProcessor.stop() with stream processor state set to STOPPED, we will have a SamzaContainer running in the JVM(we could end-up in similar state with containerListener.afterStop(), containerListener.afterFailure() as well). The above interleaved execution sequence could occur with different combinations of the StreamProcessor public methods essentially resulting in state corruption. It is not sufficient to check a state variable alone and update other state in StreamProcessor.
Having serialized execution in StreamProcessor(with explicit locks) would help us avoid getting into the above state and will keep things simple and easy to understand.
| } | ||
| processorListener.afterFailure(throwable); | ||
|
|
||
| if (state.get() == STOPPING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious.
From the JIRA, it looks like the container status is set prematurely to failed or stopped while the container shutdown is in progress. If incorrect container state is the actual cause, then why didn't we go the route of fixing it to solve the problem. How does adding a new stream processor state and changing the lock mechanism fixes the problem?
Can you please share the rationale.
| * | ||
| * @return true if current state is not in the blacklisted values and transition was successful; false otherwise | ||
| */ | ||
| public static <T> boolean compareNotInAndSet(AtomicReference<T> reference, Collection<T> blacklistedValues, T update) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious.
Though it looks like we use atomic reference in StreamProcessor, we lock implicitly in StateTransitionUtil. From what I can understand, we've replaced synchronized lock with a busy spinning lock in this patch for performing state transition on a object.
My concern with the spinlock is that, while waiting to acquire a lock, it wastes the CPU cycles. Can you please share the value-add and benefits that we get by replacing synchronized lock with the spinning lock. If we need to lock, why can't we use a explicit lock.
|
Closing the PR since we have decided to pursue a simpler point fix w/ #673 |
The PR addresses the following issues
SamzaContainerListenerandStreamProcessorthat results in incorrect application status being propagated to clientSamzaContainersets the status flag to failed as soon as it runs into an exception in the runloop. This is problematic because it is possible forStreamProcessorto assume the container has shutdown correctly even though container is still in the shutdown sequence.StreamProcessorusescontainer.hasStopped()to determine the shutdown status of the container. Internally, this method returns true if the status is eitherFAILEDorSTOPPED.StreamProcessorshutdown in others.processorListenerwhen containerException is not null. It is possible for the samza container to take longer thantask.shutdown.msto shutdown in which case, we need to propagate a timeout exception to theprocessorListeneras opposed assuming the shutdown was successful.