Skip to content

Worker doesn't reconnect to Faktory if the connection is reset #10

@valo

Description

@valo

I have a single threaded worker defined as:

w = Worker(queues=['etherbi_decode'],
           concurrency=1,
           executor=ThreadPoolExecutor)

which works for a 2-3 hours and at some point falls into an infinite loop, not getting any new jobs and not sending heartbeats to the faktory server.

Logs from the worker before it falls into infinite loop:

INFO:faktory.connection:Connecting to faktory-faktory:7419 (with password None)
ERROR:faktory.worker:Task failed: 50c676346b5a4307826de2245caf5593
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "etherbi/decoder/decoder_worker.py", line 225, in decode_bucket
    _enqueue_block_decoded_task(task_number, 'calculate_burn_rate', 'etherbi_burn_rate_calculation')
  File "etherbi/decoder/decoder_worker.py", line 187, in _enqueue_block_decoded_task
    if faktory_client.queue(task, queue=queue, args=[task_number]):
  File "/app/src/faktory/faktory/client.py", line 32, in queue
    self.connect()
  File "/app/src/faktory/faktory/client.py", line 21, in connect
    self.is_connected = self.faktory.connect()
  File "/app/src/faktory/faktory/_proto.py", line 64, in connect
    self.socket.connect((self.host, self.port))
socket.timeout: timed out
INFO:__main__:Transaction to 0xaf30d2a7e90d7dc361c8c4585e9bb7d2f6f15bc7 is recognized in block 3917149 on position 98.
INFO:__main__:Transaction to 0xaf30d2a7e90d7dc361c8c4585e9bb7d2f6f15bc7 is recognized in block 3917154 on position 2.
<SOME_WORK_RELATED_LOGS_AS_THE_ABOVE>
INFO:faktory.connection:Connecting to faktory-faktory:7419 (with password None)
INFO:faktory.connection:Disconnected

running sudo strace -p <PID> -e trace=network -f -s 10000 against the process I get infinite stream of

[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 16254] recvfrom(3, "", 4096, 0, NULL, NULL) = 0

We use python and elixir clients against the same faktory server and so far this behavior is observed only with the python workers.

Worker version: c5cb89b
Server version: 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions