Skip to content

Conversation

@fjetter
Copy link
Contributor

@fjetter fjetter commented Jun 7, 2022

Closes #615

The comment suggests that these cancellations are required to delete data from the dask cluster but this is not necessary for a couple of reasons

  • Cancellation itself is not necessary unless you want to abort an actively running graph, e.g. if you wanted to stop all submitted futures before they finished. We reach this code path only if all completed successfully
  • What you are looking for is likely rather del future or future.release() which tells the scheduler that the future is no longer required and it deletes it
  • The explicit release of the data is not even required. We're already at the end of the function and as soon as we exit the local context, the futures are garbage collected and dask dereferences the futures automatically and releases the data.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 7, 2022

@fjetter Thank you for looking into this. Do you know what would happen with the data if, say, stumped is called multiple times with the same inputs? Will this cause a problem (i.e., the garbage collector hasn't deleted the data yet and dask can somehow still reference the same inputs)? I only ask because, in the past, when I did not scatter the data with hash=False:

dask_client.scatter(
    some_array, broadcast=True
)

Then, dask got into a state where some_array could still be referenced by multiple calls to the same STUMPY function (i.e., stumped) but the data was corrupted (likely due to some race condition where the hash was not unique and still referencing data that is in a bad state). Adding hash=False resolved the issue but that is also why I chose to cancel the data to be extra safe. I only did this to be explicit but, perhaps, this is not needed?

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2022

Codecov Report

Merging #617 (e152986) into main (89487ae) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
- Coverage   99.89%   99.89%   -0.01%     
==========================================
  Files          80       80              
  Lines       11356    11312      -44     
==========================================
- Hits        11344    11300      -44     
  Misses         12       12              
Impacted Files Coverage Δ
stumpy/aamped.py 100.00% <ø> (ø)
stumpy/maamped.py 100.00% <ø> (ø)
stumpy/mstumped.py 100.00% <ø> (ø)
stumpy/stumped.py 100.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89487ae...e152986. Read the comment docs.

@fjetter
Copy link
Contributor Author

fjetter commented Jun 7, 2022

With hash=False you will generate a random key, see

i.e. even if you were to call scatter with the identical data twice, we would store the data twice and dask would not know that it's duplicated but would treat every instance as unique. Using this you'd never reuse any data and would re-scatter the data all the time.

It is possible that there is, for a while, still data from the previous run before the next starts but this will be cleaned up eventually. 100% sure that the runs won't mix even without cancellation.


If you are using hash=True (default) it will hash the data using tokenize (an md5 hash) and dask would recognize that it's the same data. IIUC, we haven't implemented a shortcut where we'd skip a re-upload, i.e. you would not gain a lot by using this.

The race condition I see is not necessarily due to hash collisions but rather in how we track tasks. I could see some potential for problems with out internal state. However, we have tests that cover this edge case but there may be something else going on, idk

@seanlaw
Copy link
Contributor

seanlaw commented Jun 7, 2022

@fjetter Thank you for the explanation. Also, it looks like the tests have all passed and, indeed, I re-timed the test and it seems to have resolved the regression. I appreciate your help and attention on this!

@seanlaw seanlaw merged commit 7cdeabb into stumpy-dev:main Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dask/Distributed Performance Regression

3 participants