Do not allow for a worker to reject a drop replica request #7490

hendrikmakait · 2023-01-19T15:33:46Z

Supersedes #7487 and finishes it up.

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait · 2023-01-19T15:35:17Z

distributed/tests/test_cancelled_state.py

+                (f3.key, "resumed", "released", "cancelled", {}),
+                (f3.key, "cancelled", "waiting", "executing", {}),
                (f3.key, "executing", "error", "error", {}),
+                # FIXME: (distributed#7489)


Instead of accepting the erred task, the scheduler should reject the result and reschedule the computation (#7489)

github-actions · 2023-01-20T13:10:27Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      24 files ±  0       24 suites ±0 10h 27m 31s ⏱️ + 26m 45s
  3 329 tests +  2   3 220 ✔️ +  1   105 💤 - 1 4 ❌ +2
39 252 runs +24 37 357 ✔️ +22 1 890 💤 - 1 5 ❌ +3

For more details on these failures, see this check.

Results for commit 3a655f1. ± Comparison against base commit fae59c4.

♻️ This comment has been updated with latest results.

fjetter · 2023-01-20T13:22:52Z

The codecov is interesting. There are some indirect code coverage changes reported. Basically _transition_resumed_waiting is not covered anymore. That's interesting 🤔

fjetter · 2023-01-20T13:27:29Z

The codecov is interesting. There are some indirect code coverage changes reported. Basically _transition_resumed_waiting is not covered anymore. That's interesting 🤔

Might of course just be flaky but there is also a real possibility that this is no longer possible. I was hoping that with the consistencies the run_id provides the cancelled/resumed states would no longer be required. This may be the first glimpse at this

fjetter · 2023-01-20T13:30:34Z

I can confirm that distributed/tests/test_cancelled_state.py::test_resumed_cancelled_handle_compute[True-True] is triggering this code path on main

distributed/worker_state_machine.py

fjetter · 2023-01-20T14:25:07Z

https://app.codecov.io/gh/dask/distributed/blob/main/distributed/worker_state_machine.py

Parts of this transition were already uncovered

From what I can tell,

the first branch (initial state resumed(executing->fetch) is only covered by distributed/tests/test_cancelled_state.py::test_resumed_cancelled_handle_compute[True-True]
the second one (initial state resumed(long-running->fetch))is covered by the tests distributed/tests/test_cancelled_state.py::test_secede_cancelled_or_resumed_workerstate
the third one was never covered. I went back ~6months where the code looked very different but similar code sections where already skipped back then.

fjetter · 2023-01-20T14:38:45Z

The last code branch is indeed impossible. It could only trigger if the scheduler asked a worker to compute a task twice w/out any additional intermediate messages

fjetter · 2023-01-20T15:23:39Z

This is a low level test that covers the above branches and shows what is happening and why that is OK. This is effectively the scenario you are describing in #7490 (comment) and I believe this is the only way to trigger this.
This transition path is made impossible if FreeKeysEvent(keys=["y"], stimulus_id="s4") is allowed to release x, i.e. we remove resumed from PROCESSING

@pytest.mark.parametrize("secede", [True, False])
def test_compute_free_fetch_compute(ws, secede):
    ws2 = "127.0.0.1:2"
    instructions = ws.handle_stimulus(ComputeTaskEvent.dummy("x", stimulus_id="s1"))
    # Note: A future implementation could also allow the task to be executed
    # again Right now, the scheduler should reschedule the task because of wrong
    # run_id
    if secede:
        ws.handle_stimulus(
            SecedeEvent(
                key="x",
                compute_duration=1.0,
                stimulus_id=f"secede",
            )
        )
    assert len(instructions) == 1
    assert isinstance(instructions[0], Execute)
    instructions = ws.handle_stimulus(
        # x is released for whatever reasen (e.g. client cancellation)
        FreeKeysEvent(keys=["x"], stimulus_id="s2"),
        # x was computed somewhere else
        ComputeTaskEvent.dummy("y", who_has={"x": [ws2]}, stimulus_id="s3"),
        # x was lost / no known replicas, therefore y is cancelled
        FreeKeysEvent(keys=["y"], stimulus_id="s4"),
        ComputeTaskEvent.dummy("x", stimulus_id="s5"),
    )
    assert len(ws.tasks) == 1
    assert ws.tasks["x"].state == "executing" if not secede else "long-running"

fjetter

good to go once CI is done and green-ish

fjetter · 2023-01-20T16:38:47Z

distributed/worker_state_machine.py

-    def _transition_resumed_waiting(
-        self, ts: TaskState, *, stimulus_id: str
-    ) -> RecsInstrs:
-        """
-        See also
-        --------
-        _transition_cancelled_fetch
-        _transition_cancelled_or_resumed_long_running
-        _transition_cancelled_waiting
-        _transition_resumed_fetch
-        """
-        # None of the exit events of execute or gather_dep recommend a transition to
-        # waiting
-        assert not ts.done
-        if ts.previous == "executing":
-            assert ts.next == "fetch"
-            # We're back where we started. We should forget about the entire
-            # cancellation attempt
-            ts.state = "executing"
-            ts.next = None
-            ts.previous = None
-            return {}, []
-
-        elif ts.previous == "long-running":
-            assert ts.next == "fetch"
-            # Same as executing, and in addition send the LongRunningMsg in arrears
-            # Note that, if the task seceded before it was cancelled, this will cause
-            # the message to be sent twice.
-            ts.state = "long-running"
-            ts.next = None
-            ts.previous = None
-            smsg = LongRunningMsg(
-                key=ts.key, compute_duration=None, stimulus_id=stimulus_id
-            )
-            return {}, [smsg]
-
-        else:
-            assert ts.previous == "flight"
-            assert ts.next == "waiting"
-            return {}, []
-


hendrikmakait · 2023-01-20T17:56:38Z

I haven't seen test_memory flake before, but CI looks green-ish, and failures appear to be unrelated.

fjetter

A couple of nits but the PR can go in

distributed/tests/test_cancelled_state.py

distributed/worker_state_machine.py

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

fjetter and others added 3 commits January 18, 2023 16:07

Remove cancelled and resumed from PROCESSING

c590029

Do not allow for the worker to reject a drop replica request

411ddfb

Fix stories in test_resumed_cancelled_handle_compute

20b7231

hendrikmakait commented Jan 19, 2023

View reviewed changes

hendrikmakait marked this pull request as draft January 19, 2023 15:36

Add test case

23db587

hendrikmakait marked this pull request as ready for review January 20, 2023 12:22

Comment

1252790

hendrikmakait self-assigned this Jan 20, 2023

fjetter reviewed Jan 20, 2023

View reviewed changes

distributed/worker_state_machine.py Show resolved Hide resolved

Minor fix

381ff9a

hendrikmakait added 2 commits January 20, 2023 16:41

Add test for rescheduling case

2bf4dd5

Remove _transition_resumed_waiting

a9aace3

fjetter approved these changes Jan 20, 2023

View reviewed changes

fjetter approved these changes Jan 23, 2023

View reviewed changes

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

distributed/worker_state_machine.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3a655f1

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

fjetter merged commit ea11c17 into dask:main Jan 23, 2023

This was referenced Jan 23, 2023

Do not allow for a worker to reject a drop replica request #7487

Closed

P2P shuffle deduplicates data and can be run several times #7486

Merged

Uh oh!

Do not allow for a worker to reject a drop replica request #7490

Do not allow for a worker to reject a drop replica request #7490

Uh oh!

Conversation

hendrikmakait commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmakait Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

fjetter commented Jan 20, 2023

Uh oh!

fjetter commented Jan 20, 2023

Uh oh!

fjetter commented Jan 20, 2023

Uh oh!

Uh oh!

fjetter commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Jan 20, 2023

Uh oh!

fjetter commented Jan 20, 2023

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Jan 20, 2023

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hendrikmakait commented Jan 19, 2023 •

edited

Loading

github-actions bot commented Jan 20, 2023 •

edited

Loading

fjetter commented Jan 20, 2023 •

edited

Loading