Fix dask/distributed regression #617

fjetter · 2022-06-07T08:38:18Z

Closes #615

The comment suggests that these cancellations are required to delete data from the dask cluster but this is not necessary for a couple of reasons

Cancellation itself is not necessary unless you want to abort an actively running graph, e.g. if you wanted to stop all submitted futures before they finished. We reach this code path only if all completed successfully
What you are looking for is likely rather del future or future.release() which tells the scheduler that the future is no longer required and it deletes it
The explicit release of the data is not even required. We're already at the end of the function and as soon as we exit the local context, the futures are garbage collected and dask dereferences the futures automatically and releases the data.

seanlaw · 2022-06-07T12:03:39Z

@fjetter Thank you for looking into this. Do you know what would happen with the data if, say, stumped is called multiple times with the same inputs? Will this cause a problem (i.e., the garbage collector hasn't deleted the data yet and dask can somehow still reference the same inputs)? I only ask because, in the past, when I did not scatter the data with hash=False:

dask_client.scatter(
    some_array, broadcast=True
)

Then, dask got into a state where some_array could still be referenced by multiple calls to the same STUMPY function (i.e., stumped) but the data was corrupted (likely due to some race condition where the hash was not unique and still referencing data that is in a bad state). Adding hash=False resolved the issue but that is also why I chose to cancel the data to be extra safe. I only did this to be explicit but, perhaps, this is not needed?

codecov-commenter · 2022-06-07T12:26:22Z

Codecov Report

Merging #617 (e152986) into main (89487ae) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
- Coverage   99.89%   99.89%   -0.01%     
==========================================
  Files          80       80              
  Lines       11356    11312      -44     
==========================================
- Hits        11344    11300      -44     
  Misses         12       12

Impacted Files	Coverage Δ
stumpy/aamped.py	`100.00% <ø> (ø)`
stumpy/maamped.py	`100.00% <ø> (ø)`
stumpy/mstumped.py	`100.00% <ø> (ø)`
stumpy/stumped.py	`100.00% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89487ae...e152986. Read the comment docs.

fjetter · 2022-06-07T12:50:19Z

With hash=False you will generate a random key, see

i.e. even if you were to call scatter with the identical data twice, we would store the data twice and dask would not know that it's duplicated but would treat every instance as unique. Using this you'd never reuse any data and would re-scatter the data all the time.

It is possible that there is, for a while, still data from the previous run before the next starts but this will be cleaned up eventually. 100% sure that the runs won't mix even without cancellation.

If you are using hash=True (default) it will hash the data using tokenize (an md5 hash) and dask would recognize that it's the same data. IIUC, we haven't implemented a shortcut where we'd skip a re-upload, i.e. you would not gain a lot by using this.

The race condition I see is not necessarily due to hash collisions but rather in how we track tasks. I could see some potential for problems with out internal state. However, we have tests that cover this edge case but there may be something else going on, idk

seanlaw · 2022-06-07T13:56:35Z

@fjetter Thank you for the explanation. Also, it looks like the tests have all passed and, indeed, I re-timed the test and it seems to have resolved the regression. I appreciate your help and attention on this!

Fix dask/distributed regression

e152986

fjetter force-pushed the fix_615 branch from c265a81 to e152986 Compare June 7, 2022 08:39

fjetter mentioned this pull request Jun 7, 2022

Dask/Distributed Performance Regression #615

Closed

seanlaw mentioned this pull request Jun 7, 2022

Topk_matrixprofile Investigator #616

Closed

seanlaw merged commit 7cdeabb into stumpy-dev:main Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dask/distributed regression #617

Fix dask/distributed regression #617

Uh oh!

fjetter commented Jun 7, 2022 •

edited

Loading

Uh oh!

seanlaw commented Jun 7, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 7, 2022 •

edited

Loading

Uh oh!

fjetter commented Jun 7, 2022

Uh oh!

seanlaw commented Jun 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix dask/distributed regression #617

Fix dask/distributed regression #617

Uh oh!

Conversation

fjetter commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fjetter commented Jun 7, 2022

Uh oh!

seanlaw commented Jun 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fjetter commented Jun 7, 2022 •

edited

Loading

seanlaw commented Jun 7, 2022 •

edited

Loading

codecov-commenter commented Jun 7, 2022 •

edited

Loading