-
-
Notifications
You must be signed in to change notification settings - Fork 782
Description
SUMMARY
Earlier this week we noticed that our instance had a bunch of executions stuck in the requested state and none were ever being executed. Restarting the host resolved the problem.
ISSUE TYPE
- Bug Report
STACKSTORM VERSION
$ st2 --version
st2 2.10.1, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)
# install method: puppet-st2
# single host installed "all in one" with NO MongoDB clustering/replica-set
# the MongoDB instance is a single instance that lives on the same host as StackStorm
STEPS TO REPRODUCE
We've been unable to come up with steps that reproduce the issue consistently. It has been noticed, usually over night, that cron rules were not producing the expected output (emails and tickets). These executions were stuck in the requested state.
EXPECTED RESULTS
Executions run when invoked.
Logs
As stated above, restarted the host fixed the issue. Unfortunately the engineer who found the issue didn't debug any further and try restarting individual services or pin point which service was the issue.
Post-mortem, we found the following in the /var/log/st2/st2scheduler.log
2019-02-04 14:48:45,229 140516293681552 INFO scheduler [-] (PID=8536) Scheduler started.
2019-02-04 14:48:45,418 140516293681552 INFO consumers [-] Starting SchedulerEntrypoint...
2019-02-04 14:48:45,429 140516109106224 INFO mixins [-] Connected to amqp://xxx:**@127.0.0.1:5672//
2019-02-04 14:51:05,599 140516198248688 ERROR traceback [-] Traceback (most recent call last):
2019-02-04 14:51:05,599 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait
2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-] listener.cb(fileno)
2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main
2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-] result = function(*args, **kwargs)
2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 67, in run
2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-] execution_queue_item_db = self._get_next_execution()
2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 125, in _get_next_execution
2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-] execution_queue_item_db = ActionExecutionSchedulingQueue.query(**query).first()
2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first
2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-] result = queryset[0]
2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__
2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-] queryset._cursor[key],
2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__
2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-] for doc in clone:
2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next
2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-] if len(self.__data) or self._refresh():
2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh
2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-] self.__send_message(q)
2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 982, in __send_message
2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] helpers._check_command_response(first)
2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/helpers.py", line 132, in _check_command_response
2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] raise NotMasterError(errmsg, response)
2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] NotMasterError: interrupted at shutdown
2019-02-04 14:51:05,605 140516198248688 ERROR hub [-] Removing descriptor: 9
2019-02-05 03:28:30,413 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.log" with mode "a"
2019-02-05 03:28:30,422 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.audit.log" with mode "a"
It appears that the st2scheduler encountered an exception and the greenthread died and caused the service to no longer function.
In the next log file (after rotation) you can see when we fixed the issue:
2019-02-05 03:28:31,993 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.log" with mode "a"
2019-02-05 03:28:31,996 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.audit.log" with mode "a"
2019-02-05 10:29:53,767 140516293681552 INFO scheduler [-] (PID=8536) Scheduler stopped.
2019-02-05 10:29:53,767 140516293681552 INFO consumers [-] Shutting down SchedulerEntrypoint...
2019-02-05 10:31:30,168 139946447700368 INFO __init__ [-] Connecting to database "st2" @ "127.0.0.1:27017" as user "xxx".
2019-02-05 10:31:30,769 139946447700368 INFO __init__ [-] Successfully connected to database "st2" @ "127.0.0.1:27017" as user "xxx".
2019-02-05 10:31:46,988 139946447700368 AUDIT runnersregistrar [-] RunnerType updated. RunnerType RunnerTypeDB(d