Skip to content

Conversation

@mrocklin
Copy link
Member

This is a fork of #4432 . I'm just raising a PR in order to use CI for testing in a low-battery situation. Please ignore.

fjetter and others added 2 commits January 15, 2021 17:57
Also, check that we don't release keys with dependents
@mrocklin
Copy link
Member Author

Tests seem to pass. I'm going to rerun them a couple of times.

@mrocklin
Copy link
Member Author

cc @fjetter

@fjetter
Copy link
Member

fjetter commented Jan 18, 2021

Even the failures on travis look promising. I'll give it a spin on some real world stress situations and report back

@fjetter
Copy link
Member

fjetter commented Jan 18, 2021

There are still failures popping up connected to the dependency fetching (Worker becomes a zombie afterwards)

distributed.worker - ERROR - "('split-shuffle-1-8f785f8d95bb8c8bb1512110c0dd56f7', 19, (5, 5))" Traceback (most recent call last): File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 1913, in ensure_communicating worker, dep.key File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 2017, in select_keys_for_gather total_bytes = self.tasks[dep].get_nbytes() KeyError: "('split-shuffle-1-8f785f8d95bb8c8bb1512110c0dd56f7', 19, (5, 5))"

Haven't investigated, yet, but I suspect this key (or a related one) was stolen and the release_key in Worker.steal_request is removing the dependency from Worker.tasks.

I tried at some point to modify

if key in self.tasks:
self.tasks.pop(key)

to only delete the task from the internal dict if there are no more dependents, similar to the way Worker.data is handled a few lines above.
This all led me to questioning the behaviour of Worker.release_key in the first place

  • What does it mean to release a task on a worker?
  • Does the worker employ some ref counting?
  • When is a task forgotten? Is this even a thing for the worker?
  • Is the internal release key the same as the external release key (triggered via stream_handler release-tasks or op handler delete-data)

@fjetter
Copy link
Member

fjetter commented Jan 18, 2021

Similarly, in a different run, I receive an error in update_who_has, very likely due to the same issue

Details
Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f69d39240b8>>, <Task finished coro=<Worker.gather_dep() done, defined at /mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py:2032> exception=KeyError("('split-shuffle-1-7238acd8eecbac6a1b1a244be44527cf', 19, (2, 3))",)>)
Traceback (most recent call last):
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 2178, in gather_dep
    await self.query_who_has(dep.key)
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 2250, in query_who_has
    self.update_who_has(response)
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 2259, in update_who_has
    self.tasks[dep].who_has.update(workers)
KeyError: "('split-shuffle-1-7238acd8eecbac6a1b1a244be44527cf', 19, (2, 3))"

@mrocklin
Copy link
Member Author

I believe that @fjetter has continued this on his PR. Closing.

@mrocklin mrocklin closed this Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants