-
-
Notifications
You must be signed in to change notification settings - Fork 748
Closed
Description
- The scheduler sends acquire-replicas to the worker
- Before the task can transition to flight, the task is forgotten on the scheduler. This in turn triggers a release-key command from the scheduler to all workers which hold the task in memory. This is unlike in a compute-task request, where the workers that are fetching the key are tracked on the scheduler side and receive a release-key command too.
- the task eventually transitions fetch->flight->missing
- the task remains stuck forever in missing state on the worker that was fetching it. The worker will pester the scheduler every second with find_missing.
Reproducer:
@gen_cluster(client=True)
async def test_forget_acquire_replicas(c, s, a, b):
"""
1. The scheduler sends acquire-replicas to the worker
2. Before the task can transition to flight, the task is forgotten on the scheduler
and on the peer workers *holding the replicas*.
This is unlike in a compute-task command, where the workers that are fetching
the key are tracked on the scheduler side and receive a release-key command too.
3. the task eventually transitions fetch->flight->missing
4. Test that the task is eventually forgotten everywhere.
"""
x = c.submit(inc, 2, key="x", workers=[a.address])
await x
with freeze_data_fetching(b, jump_start=True):
s.request_acquire_replicas(b.address, ["x"], stimulus_id="one")
await wait_for_state("x", "fetch", b)
x.release()
while "x" in s.tasks or "x" in a.tasks:
await asyncio.sleep(0.01)
while "x" in b.tasks:
await asyncio.sleep(0.01)The above test times out on the last line.
Proposed design
At the moment, the scheduler silently ignores missing keys in the request-refresh-who-has message.
The scheduler should instead respond stating "I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.
Blockers
CC @fjetter
Metadata
Metadata
Assignees
Labels
No labels