SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance #1213

mynameborat · 2019-11-07T01:02:44Z

Symptom: Duplicate processing, Inconsistent checkpoints for inputs, Inconsistent changelog state
Cause: We have a bug in the state machine inside stream processor that can result in processors running containers with old job model version after rebalances in Standalone deployment model.
Fix: We interrupt the container and wait for container to shutdown gracefully within a timeout (task.shutdown.ms) and fail the stream processor if the container doesn’t shut down within the timeout
Tests: Added unit tests for StreamProcessor and SamzaContainer. Working on integration tests.
API Changes restore methods on TaskRestoreManager and StorageEngine now throws InterruptedException. Please refer to java docs to get additional implementation notes.
Upgrade Instructions Standalone jobs w/ external monitoring service to restart your application should follow the External monitoring section in the document below to tune debounce time to account for monitoring service latency and container startup time.
Usage Instructions None
More details about the bug can be found here.

dnishimura

Thanks for this important fix! Just a few comments and questions.

samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala

samza-core/src/main/scala/org/apache/samza/system/SystemProducers.scala

samza-core/src/main/java/org/apache/samza/processor/StreamProcessor.java

dnishimura

LGTM! Thanks for the fix.

samza-core/src/main/java/org/apache/samza/storage/StorageRecovery.java

samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala

samza-core/src/main/java/org/apache/samza/processor/StreamProcessor.java

cameronlee314

My overall feedback as a separate reader:
It seems like using InterruptedExceptions is kind of an "implicit" way of handling this case, so it's a bit hard to validate that all places have been updated correctly (and might be a bit hard to follow if anyone else has to update this flow in the future). I see that you have done good analysis on alternative options, so I think it is reasonable to stick with this strategy overall, but if there is anything further you can think of to make it as explicit as possible, it might be helpful. You have already added some good things (i.e. update method sig with throws InterruptedException, javadocs), but I'm not sure if there is anything else you can do.

samza-core/src/main/java/org/apache/samza/processor/StreamProcessor.java

samza-api/src/main/java/org/apache/samza/storage/StorageEngine.java

samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala

samza-core/src/main/scala/org/apache/samza/storage/ContainerStorageManager.java

samza-core/src/test/java/org/apache/samza/processor/TestStreamProcessor.java

…ped during a rebalance

cameronlee314

Please update the description notes to reflect the API change for the InterruptedException.

samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngine.scala

samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStorageEngine.scala

samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngine.scala

…ped during a rebalance (apache#1213) Stream processor should ensure previous container is stopped during a rebalance

…ped during a rebalance (#1213) Stream processor should ensure previous container is stopped during a rebalance

mynameborat changed the title ~~SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance~~ [WIP] SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance Nov 7, 2019

dnishimura suggested changes Nov 7, 2019

View reviewed changes

dnishimura approved these changes Nov 13, 2019

View reviewed changes

mynameborat changed the title ~~[WIP] SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance~~ SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance Nov 13, 2019

sborya reviewed Nov 18, 2019

View reviewed changes

sborya approved these changes Nov 20, 2019

View reviewed changes

cameronlee314 reviewed Nov 20, 2019

View reviewed changes

mynameborat added 5 commits December 9, 2019 00:05

SAMZA-2305: Stream processor should ensure previous container is stop…

80398b2

…ped during a rebalance

minor doc fixes

cb4f1bc

Address Dan's feedback

243b848

Address review comments

42b2a9a

Address Cameron's comments

15a56bf

mynameborat force-pushed the SAMZA-2305 branch from 59530a4 to 15a56bf Compare December 9, 2019 08:06

Removed assertion on isShutdown call on executor service

798325d

cameronlee314 reviewed Dec 10, 2019

View reviewed changes

samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngine.scala Show resolved Hide resolved

Add unit test for storage engine restore

67c116d

cameronlee314 approved these changes Dec 10, 2019

View reviewed changes

samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStorageEngine.scala Show resolved Hide resolved

Address review comments

6751b53

cameronlee314 self-requested a review December 10, 2019 19:45

cameronlee314 reviewed Dec 10, 2019

View reviewed changes

samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngine.scala Show resolved Hide resolved

cameronlee314 approved these changes Dec 10, 2019

View reviewed changes

mynameborat merged commit 0436528 into apache:master Dec 10, 2019

lhaiesp pushed a commit that referenced this pull request Feb 13, 2020

SAMZA-2305: Stream processor should ensure previous container is stop…

27aa391

…ped during a rebalance (#1213) Stream processor should ensure previous container is stopped during a rebalance

mynameborat deleted the SAMZA-2305 branch March 7, 2020 06:25

SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance #1213

SAMZA-2305: Stream processor should ensure previous container is stopped during a rebalance #1213

Uh oh!

Conversation

mynameborat commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnishimura left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dnishimura left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cameronlee314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cameronlee314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mynameborat commented Nov 7, 2019 •

edited

Loading