Skip to content

Conversation

@mynameborat
Copy link
Contributor

@mynameborat mynameborat commented Nov 7, 2019

Symptom: Duplicate processing, Inconsistent checkpoints for inputs, Inconsistent changelog state
Cause: We have a bug in the state machine inside stream processor that can result in processors running containers with old job model version after rebalances in Standalone deployment model.
Fix: We interrupt the container and wait for container to shutdown gracefully within a timeout (task.shutdown.ms) and fail the stream processor if the container doesn’t shut down within the timeout
Tests: Added unit tests for StreamProcessor and SamzaContainer. Working on integration tests.
API Changes restore methods on TaskRestoreManager and StorageEngine now throws InterruptedException. Please refer to java docs to get additional implementation notes.
Upgrade Instructions Standalone jobs w/ external monitoring service to restart your application should follow the External monitoring section in the document below to tune debounce time to account for monitoring service latency and container startup time.
Usage Instructions None
More details about the bug can be found here.

@mynameborat mynameborat changed the title SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance [WIP] SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance Nov 7, 2019
Copy link
Contributor

@dnishimura dnishimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this important fix! Just a few comments and questions.

Copy link
Contributor

@dnishimura dnishimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix.

@mynameborat mynameborat changed the title [WIP] SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance Nov 13, 2019
Copy link
Contributor

@cameronlee314 cameronlee314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall feedback as a separate reader:
It seems like using InterruptedExceptions is kind of an "implicit" way of handling this case, so it's a bit hard to validate that all places have been updated correctly (and might be a bit hard to follow if anyone else has to update this flow in the future). I see that you have done good analysis on alternative options, so I think it is reasonable to stick with this strategy overall, but if there is anything further you can think of to make it as explicit as possible, it might be helpful. You have already added some good things (i.e. update method sig with throws InterruptedException, javadocs), but I'm not sure if there is anything else you can do.

Copy link
Contributor

@cameronlee314 cameronlee314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the description notes to reflect the API change for the InterruptedException.

@cameronlee314 cameronlee314 self-requested a review December 10, 2019 19:45
@mynameborat mynameborat merged commit 0436528 into apache:master Dec 10, 2019
rmatharu-zz pushed a commit to rmatharu-zz/samza that referenced this pull request Jan 21, 2020
…ped during a rebalance (apache#1213)

Stream processor should ensure previous container is stopped during a rebalance
lhaiesp pushed a commit that referenced this pull request Feb 13, 2020
…ped during a rebalance (#1213)

Stream processor should ensure previous container is stopped during a rebalance
@mynameborat mynameborat deleted the SAMZA-2305 branch March 7, 2020 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants