Skip to content

Conversation

@sjperkins
Copy link
Member

@sjperkins sjperkins commented May 9, 2022

In the log output e.g. https://github.com/dask/distributed/runs/6348164203?check_suite_focus=true for d7549b0

the following errors are occuring but are silently ignored:

2022-05-09T08:38:04.1870734Z 2022-05-09 08:36:42,873 - distributed.core - ERROR - reschedule() got an unexpected keyword argument 'stimulus_id'
2022-05-09T08:38:04.1870845Z Traceback (most recent call last):
2022-05-09T08:38:04.1871030Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
2022-05-09T08:38:04.1871136Z     handler(**merge(extra, msg))
2022-05-09T08:38:04.1871388Z TypeError: reschedule() got an unexpected keyword argument 'stimulus_id'
2022-05-09T08:38:04.1871819Z 2022-05-09 08:36:42,873 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:33919', name: 0, status: running, memory: 0, processing: 12>
2022-05-09T08:38:04.1872097Z 2022-05-09 08:36:42,873 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:33919
2022-05-09T08:38:04.1872525Z 2022-05-09 08:36:42,878 - distributed.core - ERROR - Exception while handling op register-worker
2022-05-09T08:38:04.1872639Z Traceback (most recent call last):
2022-05-09T08:38:04.1872833Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 585, in handle_comm
2022-05-09T08:38:04.1872910Z     result = await result
2022-05-09T08:38:04.1873097Z   File "/home/runner/work/distributed/distributed/distributed/utils.py", line 759, in wrapper
2022-05-09T08:38:04.1873201Z     return await func(*args, **kwargs)
2022-05-09T08:38:04.1873400Z   File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 3829, in add_worker
2022-05-09T08:38:04.1873572Z     await self.handle_worker(comm=comm, worker=address, stimulus_id=stimulus_id)
2022-05-09T08:38:04.1873775Z   File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 4924, in handle_worker
2022-05-09T08:38:04.1873926Z     await self.handle_stream(comm=comm, extra={"worker": worker})
2022-05-09T08:38:04.1874122Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
2022-05-09T08:38:04.1874210Z     handler(**merge(extra, msg))
2022-05-09T08:38:04.1874465Z TypeError: reschedule() got an unexpected keyword argument 'stimulus_id'

worker_state_machine.RescheduleMsg always sends stimulus_ids so this should be handled.

  • Tests added / passed
  • Passes pre-commit run --all-files

@sjperkins sjperkins requested a review from crusaderky May 9, 2022 10:56
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Unit Test Results

       16 files  ±    0         16 suites  ±0   6h 59m 45s ⏱️ - 45m 51s
  2 769 tests +    2    2 688 ✔️ +    2    78 💤  -   2  3 +2 
21 300 runs   - 798  20 310 ✔️  - 766  987 💤  - 34  3 +2 

For more details on these failures, see this check.

Results for commit f2f3a39. ± Comparison against base commit 8411c2d.

♻️ This comment has been updated with latest results.

transition_story = story

def reschedule(self, key=None, worker=None):
def reschedule(self, key=None, worker=None, stimulus_id=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not counting tests, I can see that this method is invoked only

  • from the worker, through RPC
  • directly, from stealing.py

In both cases a stimulus_id is available. So you could change this to mandatory.

@crusaderky
Copy link
Collaborator

crusaderky commented May 9, 2022

So the Scheduler.reschedule method is never successful when invoked from the worker, and yet no tests break.
In other words, at some point the status of the task stopped transitioning from processing to released on the scheduler when it transitions from executing/long-running to rescheduled on the worker, and nobody noticed.

What's the impact of having it work again? Could it be that the scheduler-side transition was unneeded to begin with? Or (at the other extreme) would this fix a potential deadlock?

I think understanding this is way more important than the fix itself (which is straightforward).

CC @fjetter @graingert @gjoseph92

@crusaderky
Copy link
Collaborator

With your latest changes, the two calls from stealing.py won't work anymore - and yet all tests still pass.

@crusaderky
Copy link
Collaborator

Related: #6332

@crusaderky
Copy link
Collaborator

Superseded by #6339

@crusaderky crusaderky closed this May 13, 2022
@crusaderky
Copy link
Collaborator

Follow up on #6340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants