Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jan 18, 2023

There appears to be a race where a task is left dangling, i.e. we produce a memory leak
if a task is being fetched while the downstream task is being cancelled.

Came up during #7486

TODO: Test

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2023

Might actually not be related to the cancelled/resumed states at all but rather that scheduler and worker states drift apart s.t. the scheduler never has a chance to submit another free-keys

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2023

Actually the case I'm looking at is even more interesting.

  • Worker fetches dependency DepA for KeyA and sends an AddKeys to scheduler
  • Scheduler asks Worker to forget/cancel KeyA, i.e. it is transferred to state cancelled(executing)
  • AddKeys is received on Scheduler but it is no longer supposed to be in memory, i.e. DepA is flagged as a redundant copy and Worker is asked to forget it again
  • Worker rejects release of this task and submits another AddKeys
  • The scheduler, again, notices that the key is no longer in memory and asks the worker again to release it

The message exchange between worker and scheduler repeats in an endless loop until either DepA is computed again somewhere else OR KeyA finishes executing. Depending on what happens first, the outcome is different.

  • If KeyA finishes first DepA is no longer protected and the ping pong ends there by releasing all data properly
  • If DepA finishes on another worker first, the data of an old run is retained but should be released at a later date.

I might be missing something... the dump looked as if the task was genuinely leaking. Either way, it might be time to get rid of the is_protected check. Will need to dig a bit

@github-actions
Copy link
Contributor

github-actions bot commented Jan 18, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       24 files  ±  0         24 suites  ±0   10h 25m 4s ⏱️ + 20m 17s
  3 326 tests  -   1    3 216 ✔️  -   4     105 💤  - 1    5 +  4 
39 216 runs   - 12  37 277 ✔️  - 58  1 890 💤  - 2  49 +48 

For more details on these failures, see this check.

Results for commit 411ddfb. ± Comparison against base commit fae59c4.

♻️ This comment has been updated with latest results.

@fjetter fjetter changed the title Remove cancelled and resumed from PROCESSING Do not allow for a worker to reject a drop replica request Jan 18, 2023
@hendrikmakait
Copy link
Member

Superseded by #7490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants