Skip to content

Conversation

@m4dcoder
Copy link
Contributor

@m4dcoder m4dcoder commented Feb 11, 2019

Add retries in the scheduler handler to temporarily handle DB connection failures. Refactor how threads exit for the process to return proper code. Fixes #4539

Add retries in the scheduler handler to temporarily handle DB connection failures. Refactor how threads exit for the process to return proper code.
@m4dcoder m4dcoder requested review from Kami and bigmstone February 11, 2019 22:52
@m4dcoder m4dcoder changed the title Refactor scheduler process to exit properly [WIP] Refactor scheduler process to exit properly Feb 11, 2019
@m4dcoder m4dcoder added the WIP label Feb 11, 2019
@Kami Kami added this to the 2.10.2 milestone Feb 12, 2019
eventlet.greenthread.sleep(cfg.CONF.scheduler.gc_interval)
self._handle_garbage_collection()

@retrying.retry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally still not too sure about this retry here. It's seems like a one off - aka we don't do it in other similar services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that, the linking looks good to me, let's please just add some test cases for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the retries there is ok. For a single server install w/o any complex service management, this prevents temporary hiccups with MongoDB connection. If users want to fail fast, they can reconfigure the retries. If you're worry about consistency, we can revisit. After this issue, our service pattern need an overhaul.

@Kami
Copy link
Member

Kami commented Feb 12, 2019

I added some test cases in 442c57e.

It wasn't totally trivial to write them because we need to emulate async nature of those services and throw an exception in a specific place (process() method).

If we don't emulate that async nature and exception will be thrown inside start() and not async after blocking wait() is called, it will look like the service is working and exiting correctly, but it's actually not.

Those tests, test_service_exits_correctly_on_fatal_exception_in_entrypoint_process specifically uncover an issue with the code, aka the change doesn't fix the whole issue.

Problem lies in this line of code - https://github.com/StackStorm/st2/pull/4543/files#diff-4d1c13310fc6ebda8f5e5d2ec414c2f9R71.

Doing handler.wait() or entrypoint.wait() means the process will only exit if handler.wait() throws. If only entrypoint throws, we will still be waiting on handler.wait() and service will still be running with one of the process being dead (entrypoint specifically).

In short - the same original issue still exists.

You can also replicate the issue manually in the same manner by adding raise Exception('') inside SchedulerEntrypoint.process() method, running the scheduler manually and scheduling an action execution.

Again, it's important you add it there. If you add it inside start(), run() or similar the exception will be correctly propagated and process will exit because that happens before the blocking handler.wait() or entrypoint.wait()line is called.

@Kami
Copy link
Member

Kami commented Feb 12, 2019

EDIT: So was actually wrong about process() method - we have try / except around process() so throwing there will never be fatal there.

We should be fine as long as we call handler.wait() first because SchedulerEntryPoint class uses consumers.MessageHandler which has a try / except around process() method call already (aka currently SchedulerEntryPoint exceptions will never propagate all the way up).

Having said that, thread1.wait() or thread2.wait() pattern is still very dangerous / misleading aka a ticking time bomb.

.wait() call is always blocking and or clause makes it seem like both of those method calls finish immediately, but they don't.

It's the same as doing:

thread1.wait()  # blocks until thread1 finishes / returns
thread2.wait()  # blocks until thread2 finishes / returns

And it means we will always block for thread1 to finish first before waiting and checking on thread2.

So if there is a chance that thread2 can exit / finish / throw an exception before thread1 and we want to consider that error as fatal and exit the whole service, we can't use such approach.

@m4dcoder
Copy link
Contributor Author

Yeah. That's why I commented that the handler.wait needs to be evaluated first since entrypoint is more durable. I agree, eventlets give us very little option to wait on multiple threads. Either we use link throughout to signal all other threads to exit or we find another option which signals us if any thread exits (i.e. gevent.wait with count=1)

Regenerated the sample st2 config with the scheduler retry configuration options.
Add a unit test to cover failure in the handler cleanup. This should signal the run method to also pause and exit the scheduler handler process.
Add unit tests to cover the retries in the run and cleanup in the scheduler handler.
Add or move the parsing of test configs to the top of affected test modules and make sure the scheduler default config options do not conflict with test configs.
@m4dcoder m4dcoder changed the title [WIP] Refactor scheduler process to exit properly Refactor scheduler process to exit properly Feb 13, 2019
@m4dcoder m4dcoder removed the WIP label Feb 13, 2019
@Kami
Copy link
Member

Kami commented Feb 13, 2019

Let's please add a changelog entry. Besides that, LGTM 👍

@m4dcoder m4dcoder merged commit 8e0d659 into master Feb 13, 2019
@m4dcoder m4dcoder deleted the fix-scheduler-process branch February 13, 2019 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants