Conversation
|
Is it possible to reproduce it in unit tests (can we try)? I think it is important to have good tests there to avoid regressions in the future. My guess is we can intentionally make highly concurrent workload that fails frequently (by increasing object manager threads and reducing the object store memory size with smaller chunk size of many objects) |
|
I have tried with many chunks being pushed, but it did not reproduce the issue, probably because the issue arises only with failures. I can add that test. An additional step is to add a Ray internal config option to fail 50% of create buffer requests. Does it sound ok? |
|
@mwtian when you say failure, it is creation failure because it is already created (or lack of memory)? isn't it reproducible just by increasing and decreasingray/src/ray/common/ray_config_def.h Line 205 in c4bc05b |
|
(If it doesn't work, I think adding artificial failure is not a bad idea) |
|
With the original change, there wouldn't be any create buffer failure due to object already exists. I'm not 100% sure what failures are encountered during buffer creation in nightly tests, which seemed to happen rarely. Will add a unit test. |
ericl
left a comment
There was a problem hiding this comment.
Nice! Seems like quite a subtle bug.
We can do this as a follow-up; merging to get more data from nightly tests. |
Why are these changes needed?
This PR re-applies d12e35c, and fixes the issue discovered after the original reverted commit.
#18955 contains the background information of the original commit.
The origin commit can cause threads stuck under the following condition:
Eventually an object transfer would not complete, likely related to more threads stuck in limbo state like request 3. Hence the test stalled.
The original change and its fix in this PR passed 3 consecutive
dask_on_ray_large_scale_test_no_spillingruns. For now we will rely on this nightly test to catch similar issues in future. If we can inject failures to create buffer, this issue might be reproducible in unit tests too.Related issue number
#18062
Checks
scripts/format.shto lint the changes in this PR.