-
-
Notifications
You must be signed in to change notification settings - Fork 782
Description
SUMMARY
St2 services do not fully recover from network partitions or other network issues where packets sent to RabbitMQ were dropped.
STACKSTORM VERSION
st2 3.3dev (9ea4173), on Python 3.6.9
Earlier versions are affected as well.
OS, environment, install method
I came across this issue when doing some tests for StackStorm/st2-dockerfiles#28 and was able to reproduce it on plain Ubuntu VMs as well.
Steps to reproduce the problem
- Start all services
- Drop the packages being sent to the RabbitMQ instance
In my case I did this by prepending an iptables rule to the FORWARD chain:sudo iptables -I FORWARD 1 -d 172.23.0.3 -j DROP - Wait until the first exceptions occur and the connections attempts from the st2 services fail with a timeout. The log:
[--- st2api ---]2020-05-23 20:17:36,162 WARNING [-] Broker connection error, trying again in 2.0 seconds: timeout('timed out',).
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 137, in _connect
[--- st2api ---] host, port, family, socket.SOCK_STREAM, SOL_TCP)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/support/greendns.py", line 537, in getaddrinfo
[--- st2api ---] raise socket.gaierror(socket.EAI_NONAME, 'No address found')
[--- st2api ---]socket.gaierror: [Errno -2] No address found
[--- st2api ---]
[--- st2api ---]During handling of the above exception, another exception occurred:
[--- st2api ---]
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/utils/functional.py", line 343, in retry_over_time
[--- st2api ---] return fun(*args, **kwargs)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 283, in connect
[--- st2api ---] return self.connection
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 837, in connection
[--- st2api ---] self._connection = self._establish_connection()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 792, in _establish_connection
[--- st2api ---] conn = self.transport.establish_connection()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
[--- st2api ---] conn.connect()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 311, in connect
[--- st2api ---] self.transport.connect()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 77, in connect
[--- st2api ---] self._connect(self.host, self.port, self.connect_timeout)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 148, in _connect
[--- st2api ---] "failed to resolve broker hostname"))
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 161, in _connect
[--- st2api ---] self.sock.connect(sa)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 263, in connect
[--- st2api ---] self._trampoline(fd, write=True, timeout=timeout, timeout_exc=_timeout_exc)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 210, in _trampoline
[--- st2api ---] mark_as_closed=self._mark_as_closed)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
[--- st2api ---] return hub.switch()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
[--- st2api ---] return self.greenlet.switch()
[--- st2api ---]socket.timeout: timed out
[--- st2api ---]2020-05-23 20:17:36,180 WARNING [-] Broker connection error, trying again in 2.0 seconds: timeout('timed out',).
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 137, in _connect
[--- st2api ---] host, port, family, socket.SOCK_STREAM, SOL_TCP)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/support/greendns.py", line 537, in getaddrinfo
[--- st2api ---] raise socket.gaierror(socket.EAI_NONAME, 'No address found')
[--- st2api ---]socket.gaierror: [Errno -2] No address found
[--- st2api ---]
[--- st2api ---]During handling of the above exception, another exception occurred:
[--- st2api ---]
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/utils/functional.py", line 343, in retry_over_time
[--- st2api ---] return fun(*args, **kwargs)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 283, in connect
[--- st2api ---] return self.connection
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 837, in connection
[--- st2api ---] self._connection = self._establish_connection()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 792, in _establish_connection
[--- st2api ---] conn = self.transport.establish_connection()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
[--- st2api ---] conn.connect()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 311, in connect
[--- st2api ---] self.transport.connect()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 77, in connect
[--- st2api ---] self._connect(self.host, self.port, self.connect_timeout)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 148, in _connect
[--- st2api ---] "failed to resolve broker hostname"))
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 161, in _connect
[--- st2api ---] self.sock.connect(sa)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 263, in connect
[--- st2api ---] self._trampoline(fd, write=True, timeout=timeout, timeout_exc=_timeout_exc)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 210, in _trampoline
[--- st2api ---] mark_as_closed=self._mark_as_closed)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
[--- st2api ---] return hub.switch()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
[--- st2api ---] return self.greenlet.switch()
[--- st2api ---]socket.timeout: timed out
- Revert 3 and re-establish the connection between the st2 services and RabbitMQ again:
sudo iptables -D FORWARD -d 172.23.0.3 -j DROP
Expected Results
St2 services should work as intended as soon as they are able to connect to RabbitMQ again
Actual Results
The St2 services are able to establish the connection to the RabbitMQ service but can't serve any requests as the resource is locked from the previous lock which is orphaned now:
The appearing error:
[--- st2api ---]2020-05-23 20:21:38,654 INFO [-] Connected to amqp://guest:**@rabbitmq:5672//
[--- st2api ---]2020-05-23 20:21:38,665 INFO [-] Connected to amqp://guest:**@rabbitmq:5672//
[--- st2api ---]2020-05-23 20:21:38,665 WARNING [-] Connection to broker lost, trying to re-establish connection...
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 175, in run
[--- st2api ---] for _ in self.consume(limit=None, **kwargs):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 191, in consume
[--- st2api ---] with self.consumer_context(**kwargs) as (conn, channel, consumers):
[--- st2api ---] File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
[--- st2api ---] return next(self.gen)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 184, in consumer_context
[--- st2api ---] with self.Consumer() as (connection, channel, consumers):
[--- st2api ---] File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
[--- st2api ---] return next(self.gen)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 233, in Consumer
[--- st2api ---] with self._consume_from(*self.get_consumers(cls, channel)) as c:
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/triggerwatcher.py", line 77, in get_consumers
[--- st2api ---] callbacks=[self.process_task])]
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 386, in __init__
[--- st2api ---] self.revive(self.channel)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 408, in revive
[--- st2api ---] self.declare()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 421, in declare
[--- st2api ---] queue.declare()
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 609, in declare
[--- st2api ---] self._create_queue(nowait=nowait, channel=channel)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 618, in _create_queue
[--- st2api ---] self.queue_declare(nowait=nowait, passive=False, channel=channel)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 653, in queue_declare
[--- st2api ---] nowait=nowait,
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/channel.py", line 1149, in queue_declare
[--- st2api ---] spec.Queue.DeclareOk, returns_tuple=True,
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py", line 88, in wait
[--- st2api ---] self.connection.drain_events(timeout=timeout)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
[--- st2api ---] while not self.blocking_read(timeout):
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
[--- st2api ---] return self.on_inbound_frame(frame)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
[--- st2api ---] callback(channel, method_sig, buf, None)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
[--- st2api ---] method_sig, payload, content,
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
[--- st2api ---] listener(*args)
[--- st2api ---] File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/channel.py", line 280, in _on_close
[--- st2api ---] reply_code, reply_text, (class_id, method_id), ChannelError,
[--- st2api ---]amqp.exceptions.ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.TimersController-4d50c0c7ad' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original dec...
I was able to reproduce this for different services and assume that the RabbitMQ client implementation is the same for all components. So it's likely that all services communicating with the broker are affected. Services that are affected for sure:
st2api
st2auth
st2sensorcontainer
Although this issue will be fixed with a restart of the affected st2 service it would be nice to have st2 handle issues like this gracefully so that we minimize the risk for unnoticed issues like this to appear and cause potential impact.