SAMZA-1748: Standalone failure tests. #554

shanthoosh · 2018-06-14T17:56:47Z

In the standalone model, a processor can leave and join the group at any point in time. This processor reshuffle is referred to as rebalancing which results in task(work) redistribution amongst other available, live processors in the group.

Processor rebalancing in existing standalone integration tests(junit tests) is accomplished through clean shutdown of the processors. However, in real production scenarios, processor rebalancing is triggered through unclean shutdown and full garbage collection(GC) of the processors.

As a part of this patch to cover those scenarios, the following integration tests are added.

Force killing the leader processor of the group.
Force killing a single follower in the group.
Force killing multiple followers in the group.
Force killing the leader and a follower in the group.
Suspending and resuming the leader of the group.

Since existing standalone integration tests cover event consumption/production after the re-balancing phase, these new tests will just test the coordination. We'll iterate on this initial suite and add tests whenever necessary.

shanthoosh · 2018-06-14T18:25:57Z

@vjagadish1989 @xinyuiscool
Please take a look when you get a chance.

shanthoosh · 2018-06-15T16:36:06Z

Verification:

i=0
while [ $i -lt 150 ]; do
                i=`expr $i + 1`
                echo "Run " +$i 
                ./bin/integration-tests.sh /tmp/samza-tests/ standalone-integration-tests --nopassword >> ~/test-logs-runTests-1_26
done;

Result:

grep -i 'passed' ~/test-logs-runTests-1_26 | wc -l 
750

grep -i 'failed' ~/test-logs-runTests-1_26 | wc -l 
0

Though the tests were run for 150 times, the result had 750 passed text since each test prints passed 5 times like the following:

2018-06-15 10:28:30,870 zopkio.test_runner [INFO] test_kill_leader----passed
2018-06-15 10:28:30,870 zopkio.test_runner [INFO] test_kill_multiple_followers----passed
2018-06-15 10:28:30,870 zopkio.test_runner [INFO] test_kill_leader_and_a_follower----passed
2018-06-15 10:28:30,870 zopkio.test_runner [INFO] test_kill_one_follower----passed
2018-06-15 10:28:30,871 zopkio.test_runner [INFO] test_pause_resume_leader----passed

xinyuiscool

Minor suggestions.

xinyuiscool · 2018-06-21T18:35:16Z

samza-test/src/main/java/org/apache/samza/test/integration/LocalApplicationRunnerMain.java

+ *
+ * This runner class is built for standalone failure tests and not recommended for general use.
+ */
+public class LocalApplicationRunnerMain {


Please be specific about the application name and add test to it.

xinyuiscool · 2018-06-21T18:36:08Z

samza-test/src/main/java/org/apache/samza/test/integration/PassThroughStreamApplication.java

+/**
+ * Acts as a pass through filter for all the events from a input stream.
+ */
+public class PassThroughStreamApplication implements StreamApplication {


Please use a real app name, like StandaloneIntegrationTestKafkaApp or something

xinyuiscool · 2018-06-21T18:37:55Z

samza-test/src/main/python/stream_processor.py

+
+logger = logging.getLogger(__name__)
+
+class StreamProcessor:


StreamProcessorLauncher

xinyuiscool · 2018-06-21T18:48:58Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in JobModel: {1}.'.format(processor_id, job_model)
+        assert leader_processor_id not in job_model['containers'], 'Leader processor: {0} exists in JobModel: {1}.'.format(leader_processor_id, job_model)


check also we have new leader id.

xinyuiscool · 2018-06-21T18:49:40Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in JobModel: {1}.'.format(processor_id, job_model)


check the leader is the same.

vjagadish1989

approved, thanks!

vjagadish1989 · 2018-06-21T18:53:05Z

samza-test/src/main/python/standalone_deployment.py

+import urllib
+import os
+
+TEST_INPUT_TOPIC = 'standaloneIntegrationTestKafkaInputTopic'


nit: use underscores instead of camel-case for Kafka topics

vjagadish1989 · 2018-06-21T18:54:31Z

samza-test/src/main/python/standalone_deployment.py

+    output, err = p.communicate()
+    logger.info("Output from delete kafka topic: {0}\nstdout: {1}\nstderr: {2}".format(topic_name, output, err))
+
+def setup_suite():


nit: move setup_suite to the top

vjagadish1989 · 2018-06-21T18:57:00Z

samza-test/src/main/python/standalone_deployment.py

+    ## Create input and output topics.
+    for topic in [TEST_INPUT_TOPIC, TEST_OUTPUT_TOPIC]:
+        logger.info("Creating topic: {0}.".format(topic))
+        _create_kafka_topic('localhost:2181', topic, 3, 1)


We should start with a "clean slate" each time we start the suite and not rely on teardown being invoked reliably. for eg: you could clear the Kafka directory during start or use a unique topic name

Makes sense. Clearing up the kafka directory during startup.

vjagadish1989 · 2018-06-21T19:07:01Z

samza-test/src/main/python/stream_processor.py

+        pids = []
+        if len(full_output) > 0:
+            pids = [int(pid_str) for pid_str in full_output.split('\n') if pid_str.isdigit()]
+        return pids


does this return a single "pid" or a list of "pid"s? if there is one pid per "StreamProcessor", would be simpler to return a value here

vjagadish1989 · 2018-06-21T19:17:42Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        event.wait(GROUP_COORDINATION_TIMEOUT * 2)
+
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():


you get better encapsulation by wrapping this up into zk_client.get_processors() instead of duplicating this here.

vjagadish1989 · 2018-06-21T19:20:49Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        __pump_messages_into_input_topic()
+        processors = __setup_processors()
+
+        leader_processor_id = zk_client.get_leader_processor_id()


I'd love to validate the following:
1.Before: there are 3 processors in the old jobmodel; After killing one follower: there are 2 processors in the new jobmodel
2. There should be no un-assigned partition. We should ensure this invariant always holds.

Agree. Added both assertions in all of the integration tests.

vjagadish1989 · 2018-06-21T19:24:15Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+            break
+
+        event = threading.Event()
+        zk_client.watch_job_model(job_model_watch(event = event, expected_processors=processors.keys()))


while we validate the expected set of processors, it'd also be nice to validate their assignments

vjagadish1989 · 2018-06-21T20:37:47Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        ## Verifications after leader was suspended.
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in containerModel: {1}.'.format(processor_id, job_model['containers'])


this asserts that all processors except the leader are on the group. should we also assert that the leader is not in the group?

vjagadish1989

approved. modulo previous comments.

shanthoosh

Thanks for the review.

Done with all the comments.

shanthoosh · 2018-06-21T20:54:16Z

samza-test/src/main/java/org/apache/samza/test/integration/PassThroughStreamApplication.java

+/**
+ * Acts as a pass through filter for all the events from a input stream.
+ */
+public class PassThroughStreamApplication implements StreamApplication {


shanthoosh · 2018-06-21T21:11:32Z

samza-test/src/main/python/standalone_deployment.py

+import urllib
+import os
+
+TEST_INPUT_TOPIC = 'standaloneIntegrationTestKafkaInputTopic'


shanthoosh · 2018-06-21T21:16:47Z

samza-test/src/main/python/standalone_deployment.py

+    output, err = p.communicate()
+    logger.info("Output from delete kafka topic: {0}\nstdout: {1}\nstderr: {2}".format(topic_name, output, err))
+
+def setup_suite():


shanthoosh · 2018-06-21T21:17:11Z

samza-test/src/main/python/standalone_deployment.py

+    ## Create input and output topics.
+    for topic in [TEST_INPUT_TOPIC, TEST_OUTPUT_TOPIC]:
+        logger.info("Creating topic: {0}.".format(topic))
+        _create_kafka_topic('localhost:2181', topic, 3, 1)


Makes sense. Clearing up the kafka directory during startup.

shanthoosh · 2018-06-21T22:37:22Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        event.wait(GROUP_COORDINATION_TIMEOUT * 2)
+
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():


shanthoosh · 2018-06-21T22:37:29Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in JobModel: {1}.'.format(processor_id, job_model)


shanthoosh · 2018-06-22T02:09:17Z

samza-test/src/main/java/org/apache/samza/test/integration/LocalApplicationRunnerMain.java

+ *
+ * This runner class is built for standalone failure tests and not recommended for general use.
+ */
+public class LocalApplicationRunnerMain {


shanthoosh · 2018-06-22T02:09:38Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in JobModel: {1}.'.format(processor_id, job_model)
+        assert leader_processor_id not in job_model['containers'], 'Leader processor: {0} exists in JobModel: {1}.'.format(leader_processor_id, job_model)


shanthoosh · 2018-06-22T02:10:08Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        __pump_messages_into_input_topic()
+        processors = __setup_processors()
+
+        leader_processor_id = zk_client.get_leader_processor_id()


Agree. Added both assertions in all of the integration tests.

shanthoosh · 2018-06-22T02:11:56Z

samza-test/src/main/python/tests/standalone_failure_tests.py

+        ## Verifications after leader was suspended.
+        job_model = zk_client.get_latest_job_model()
+        for processor_id, deployer in processors.iteritems():
+            assert processor_id in job_model['containers'], 'Processor id: {0} does not exist in containerModel: {1}.'.format(processor_id, job_model['containers'])


Standalone failure tests.

8db964f

shanthoosh force-pushed the standalone_failure_tests branch from 1bdb1d5 to 8db964f Compare June 15, 2018 02:40

shanthoosh added 2 commits June 14, 2018 19:46

Minor doc changes.

05b08f2

Minor comment changes.

a0f6300

Minor python doc cleanup.

b703fb3

xinyuiscool approved these changes Jun 21, 2018

View reviewed changes

vjagadish1989 reviewed Jun 21, 2018

View reviewed changes

vjagadish1989 approved these changes Jun 22, 2018

View reviewed changes

shanthoosh commented Jun 22, 2018

View reviewed changes

Review comments.

61653e4

shanthoosh force-pushed the standalone_failure_tests branch from f3ea067 to 61653e4 Compare June 22, 2018 02:37

asfgit closed this in cdeea90 Jun 22, 2018

cameronlee314 mentioned this pull request Dec 13, 2019

SAMZA-2418: Integration tests are failing and do not properly pass params to zopkio #1235

Merged

SAMZA-1748: Standalone failure tests. #554

SAMZA-1748: Standalone failure tests. #554

Uh oh!

Conversation

shanthoosh commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shanthoosh commented Jun 14, 2018

Uh oh!

shanthoosh commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyuiscool left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vjagadish1989 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vjagadish1989 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanthoosh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanthoosh commented Jun 14, 2018 •

edited

Loading

shanthoosh commented Jun 15, 2018 •

edited

Loading

vjagadish1989 left a comment •

edited

Loading

vjagadish1989 left a comment •

edited

Loading