Skip to content

AMM can leave a forgotten task forever in missing state #6479

@crusaderky

Description

@crusaderky
  1. The scheduler sends acquire-replicas to the worker
  2. Before the task can transition to flight, the task is forgotten on the scheduler. This in turn triggers a release-key command from the scheduler to all workers which hold the task in memory. This is unlike in a compute-task request, where the workers that are fetching the key are tracked on the scheduler side and receive a release-key command too.
  3. the task eventually transitions fetch->flight->missing
  4. the task remains stuck forever in missing state on the worker that was fetching it. The worker will pester the scheduler every second with find_missing.

Reproducer:

@gen_cluster(client=True)
async def test_forget_acquire_replicas(c, s, a, b):
    """
    1. The scheduler sends acquire-replicas to the worker
    2. Before the task can transition to flight, the task is forgotten on the scheduler
       and on the peer workers *holding the replicas*. 
       This is unlike in a compute-task command, where the workers that are fetching
       the key are tracked on the scheduler side and receive a release-key command too.
    3. the task eventually transitions fetch->flight->missing
    4. Test that the task is eventually forgotten everywhere.
    """
    x = c.submit(inc, 2, key="x", workers=[a.address])
    await x
    with freeze_data_fetching(b, jump_start=True):
        s.request_acquire_replicas(b.address, ["x"], stimulus_id="one")
        await wait_for_state("x", "fetch", b)
        x.release()
        while "x" in s.tasks or "x" in a.tasks:
            await asyncio.sleep(0.01)

    while "x" in b.tasks:
        await asyncio.sleep(0.01)

The above test times out on the last line.

Proposed design

At the moment, the scheduler silently ignores missing keys in the request-refresh-who-has message.
The scheduler should instead respond stating "I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.

Blockers

CC @fjetter

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions