Refactor scheduler process to exit properly #4543

m4dcoder · 2019-02-11T22:52:34Z

Add retries in the scheduler handler to temporarily handle DB connection failures. Refactor how threads exit for the process to return proper code. Fixes #4539

Add retries in the scheduler handler to temporarily handle DB connection failures. Refactor how threads exit for the process to return proper code.

Kami · 2019-02-12T09:16:39Z

st2actions/st2actions/scheduler/handler.py

            eventlet.greenthread.sleep(cfg.CONF.scheduler.gc_interval)
            self._handle_garbage_collection()

+    @retrying.retry(


I'm personally still not too sure about this retry here. It's seems like a one off - aka we don't do it in other similar services.

Besides that, the linking looks good to me, let's please just add some test cases for it.

I think having the retries there is ok. For a single server install w/o any complex service management, this prevents temporary hiccups with MongoDB connection. If users want to fail fast, they can reconfigure the retries. If you're worry about consistency, we can revisit. After this issue, our service pattern need an overhaul.

entrypoint exceptions. NOTE: Tests currently fail because issue hasn't been fully fixed yet.

Kami · 2019-02-12T10:11:44Z

I added some test cases in 442c57e.

It wasn't totally trivial to write them because we need to emulate async nature of those services and throw an exception in a specific place (process() method).

If we don't emulate that async nature and exception will be thrown inside start() and not async after blocking wait() is called, it will look like the service is working and exiting correctly, but it's actually not.

Those tests, test_service_exits_correctly_on_fatal_exception_in_entrypoint_process specifically uncover an issue with the code, aka the change doesn't fix the whole issue.

Problem lies in this line of code - https://github.com/StackStorm/st2/pull/4543/files#diff-4d1c13310fc6ebda8f5e5d2ec414c2f9R71.

Doing handler.wait() or entrypoint.wait() means the process will only exit if handler.wait() throws. If only entrypoint throws, we will still be waiting on handler.wait() and service will still be running with one of the process being dead (entrypoint specifically).

In short - the same original issue still exists.

You can also replicate the issue manually in the same manner by adding raise Exception('') inside SchedulerEntrypoint.process() method, running the scheduler manually and scheduling an action execution.

Again, it's important you add it there. If you add it inside start(), run() or similar the exception will be correctly propagated and process will exit because that happens before the blocking handler.wait() or entrypoint.wait()line is called.

Kami · 2019-02-12T10:22:41Z

EDIT: So was actually wrong about process() method - we have try / except around process() so throwing there will never be fatal there.

We should be fine as long as we call handler.wait() first because SchedulerEntryPoint class uses consumers.MessageHandler which has a try / except around process() method call already (aka currently SchedulerEntryPoint exceptions will never propagate all the way up).

Having said that, thread1.wait() or thread2.wait() pattern is still very dangerous / misleading aka a ticking time bomb.

.wait() call is always blocking and or clause makes it seem like both of those method calls finish immediately, but they don't.

It's the same as doing:

thread1.wait()  # blocks until thread1 finishes / returns
thread2.wait()  # blocks until thread2 finishes / returns

And it means we will always block for thread1 to finish first before waiting and checking on thread2.

So if there is a chance that thread2 can exit / finish / throw an exception before thread1 and we want to consider that error as fatal and exit the whole service, we can't use such approach.

m4dcoder · 2019-02-12T19:54:28Z

Yeah. That's why I commented that the handler.wait needs to be evaluated first since entrypoint is more durable. I agree, eventlets give us very little option to wait on multiple threads. Either we use link throughout to signal all other threads to exit or we find another option which signals us if any thread exits (i.e. gevent.wait with count=1)

Regenerated the sample st2 config with the scheduler retry configuration options.

Add a unit test to cover failure in the handler cleanup. This should signal the run method to also pause and exit the scheduler handler process.

Add unit tests to cover the retries in the run and cleanup in the scheduler handler.

Add or move the parsing of test configs to the top of affected test modules and make sure the scheduler default config options do not conflict with test configs.

Kami · 2019-02-13T08:37:55Z

Let's please add a changelog entry. Besides that, LGTM 👍

Refactor scheduler process to exit properly

1d7edbe

Add retries in the scheduler handler to temporarily handle DB connection failures. Refactor how threads exit for the process to return proper code.

m4dcoder requested review from Kami and bigmstone February 11, 2019 22:52

m4dcoder changed the title ~~Refactor scheduler process to exit properly~~ [WIP] Refactor scheduler process to exit properly Feb 11, 2019

m4dcoder added the WIP label Feb 11, 2019

m4dcoder mentioned this pull request Feb 11, 2019

Executions stuck in requested state #4539

Closed

Kami added this to the 2.10.2 milestone Feb 12, 2019

Kami reviewed Feb 12, 2019

View reviewed changes

Kami added 4 commits February 12, 2019 10:20

Add missing license header.

51cb7c2

Fix function name.

092e066

Add a test case for scheduler correctly exiting on handler and

442c57e

entrypoint exceptions. NOTE: Tests currently fail because issue hasn't been fully fixed yet.

Make binaries executable.

1437004

Update tests.

6371ef1

m4dcoder added 4 commits February 12, 2019 20:01

Regenerated the sample st2 config

359ae24

Regenerated the sample st2 config with the scheduler retry configuration options.

Add unit test to cover the handler cleanup

5aa3f6e

Add a unit test to cover failure in the handler cleanup. This should signal the run method to also pause and exit the scheduler handler process.

Add unit tests to cover the retries in scheduler handler

86221e0

Add unit tests to cover the retries in the run and cleanup in the scheduler handler.

Fix scheduler test configs in unit tests

3898b26

Add or move the parsing of test configs to the top of affected test modules and make sure the scheduler default config options do not conflict with test configs.

m4dcoder changed the title ~~[WIP] Refactor scheduler process to exit properly~~ Refactor scheduler process to exit properly Feb 13, 2019

m4dcoder removed the WIP label Feb 13, 2019

Kami approved these changes Feb 13, 2019

View reviewed changes

Kami and others added 2 commits February 13, 2019 09:38

Merge branch 'master' into fix-scheduler-process

316d729

Include the scheduler retry and exit code fix in changelog

03d7775

m4dcoder merged commit 8e0d659 into master Feb 13, 2019

m4dcoder deleted the fix-scheduler-process branch February 13, 2019 19:05

Kami mentioned this pull request Feb 18, 2019

Changes for v2.10.2 release #4553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor scheduler process to exit properly #4543

Refactor scheduler process to exit properly #4543

Uh oh!

m4dcoder commented Feb 11, 2019 •

edited

Loading

Uh oh!

Kami Feb 12, 2019

Uh oh!

Kami Feb 12, 2019

Uh oh!

m4dcoder Feb 12, 2019

Uh oh!

Kami commented Feb 12, 2019

Uh oh!

Kami commented Feb 12, 2019 •

edited

Loading

Uh oh!

m4dcoder commented Feb 12, 2019

Uh oh!

Kami commented Feb 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Refactor scheduler process to exit properly #4543

Refactor scheduler process to exit properly #4543

Uh oh!

Conversation

m4dcoder commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kami Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

Kami Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

m4dcoder Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

Kami commented Feb 12, 2019

Uh oh!

Kami commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m4dcoder commented Feb 12, 2019

Uh oh!

Kami commented Feb 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m4dcoder commented Feb 11, 2019 •

edited

Loading

Kami commented Feb 12, 2019 •

edited

Loading