Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Feb 16, 2022

Not the most elegant solution but should do the trick

@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2022

Unit Test Results

       12 files  +       1         12 suites  +1   7h 27m 38s ⏱️ + 52m 50s
  2 610 tests +       3    2 529 ✔️ +       3    79 💤  -   2  2 +2 
15 584 runs  +1 393  14 706 ✔️ +1 476  876 💤  - 85  2 +2 

For more details on these failures, see this check.

Results for commit c6f8f6c. ± Comparison against base commit d2d76c0.

♻️ This comment has been updated with latest results.

@fjetter fjetter changed the title WIP Do not mark tests xfailed if cluster doesn't come up in time Do not mark tests xfailed if cluster doesn't come up in time Feb 18, 2022
@fjetter fjetter force-pushed the xfail_worker_scheduler_error_startup branch from 87454e5 to c6f8f6c Compare February 18, 2022 13:19
@fjetter
Copy link
Member Author

fjetter commented Feb 18, 2022

It appears something is actually wrong. The ubu/py3.9 job is running for more than an hour without any stdout

@fjetter
Copy link
Member Author

fjetter commented Feb 18, 2022

The timed out job actually finished running the test suite as expected without problems, all green. https://github.com/dask/distributed/runs/5248080147?check_suite_focus=true

However, is hang during teardown

= 1344 passed, 59 skipped, 1201 deselected, 9 xfailed, 2 xpassed, 17 warnings, 80 leaked in 1714.67s (0:28:34) =
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/selector_events.py", line 918, in write
    n = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/weakref.py", line 667, in _exitfunc
    f()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/process.py", line 1342, in kill
    self._chan.kill()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/channel.py", line 1433, in kill
    self.send_signal('KILL')
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/channel.py", line 1399, in send_signal
    self._send_request(b'signal', String(signal))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/channel.py", line 716, in _send_request
    self.send_packet(MSG_CHANNEL_REQUEST, String(request),
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/channel.py", line 710, in send_packet
    self._conn.send_packet(pkttype, payload, handler=self)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/connection.py", line 1572, in send_packet
    self._send(packet + mac)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/asyncssh/connection.py", line 1349, in _send
    self._transport.write(data)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/selector_events.py", line 924, in write
    self._fatal_error(exc, 'Fatal write error on socket transport')
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/selector_events.py", line 719, in _fatal_error
    self._force_close(exc)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/selector_events.py", line 731, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 746, in call_soon
    self._check_closed()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Error: The operation was canceled.

Anybody seen this before?

@fjetter
Copy link
Member Author

fjetter commented Feb 21, 2022

This stuck test job appears to be an old problem, see #2925
I'm not sure why we're hitting this now again

see ronf/asyncssh#112

#3885 for an attempt to implement the hotfix

What's happening is that the weakref finalizer of a SSH process tries to close it but that apparently never finishes. This behaviour is confusing me since based on the above issue I would expect it to return but keep a zombie process around

@fjetter
Copy link
Member Author

fjetter commented Feb 22, 2022

I'll move on with this change since I believe raising the exceptions explicitly is valuable and I don't see how the ssh problem could connect to the changes I'm proposing but if it happens again after merging we can revert this again.

@fjetter fjetter merged commit c7ed6c2 into dask:main Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants