Skip to content

Call TriggerGlobalGC when the plasma store is full#7337

Merged
edoakes merged 11 commits intoray-project:masterfrom
edoakes:gc-on-full
Feb 27, 2020
Merged

Call TriggerGlobalGC when the plasma store is full#7337
edoakes merged 11 commits intoray-project:masterfrom
edoakes:gc-on-full

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Feb 26, 2020

Why are these changes needed?

If the plasma store is full, it may be due to cyclic references to ObjectIDs in python somewhere on the cluster. This change causes workers to trigger a global GC in an attempt to garbage collect such references and free up space before giving up when the plasma store is full.

Checks

@edoakes
Copy link
Collaborator Author

edoakes commented Feb 26, 2020

Dependent on #7328.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

<< MemoryUsageString() << "\nWaiting " << delay
<< "ms for space to free up...";
if (on_store_full_) {
on_store_full_();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this once or multiple times? I guess it's possible you might need a series of gc calls, though I'm not sure about the scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it multiple times for now to cover all scenarios. This is throttled anyways to once per heartbeat interval across the cluster.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an end to end Python test for this, similar to the one in the ref count tests?

@ericl
Copy link
Contributor

ericl commented Feb 26, 2020

Oh nevermind, I see it now.

@ericl
Copy link
Contributor

ericl commented Feb 26, 2020

Here's another test you can add that introduces cyclic refs in both the worker and driver:

import ray
import gc
import random
import time
import json
import numpy as np


ray.init(num_cpus=1, object_store_memory=2000 * 1024 * 1024)

def block():
    return np.zeros(100 * 1024 * 1024, dtype=np.uint8)


class CyclicRef:
    def __init__(self, v):
        self.value = v
        self.cyclic_ref = self


refs = [
    CyclicRef(ray.put(block())) for _ in range(10)
]


@ray.remote
def f(x):
    y = CyclicRef(x)
#    gc.collect()
    return x


for j in range(1000):
    i = random.randint(0, len(refs) - 1)
    res = f.remote(refs[i].value)
    refs[i] = CyclicRef(res)
    ray.get(res)
    print("processing", j)
#    gc.collect()

Couple other suggestions:

  • only print "gc collect freed ...." in raylet.pyx if num freed refs > 0, to avoid spam
  • only print the object store memory message if the all retries failed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22458/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22456/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22467/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22460/
Test FAILed.

@edoakes edoakes merged commit bd9411f into ray-project:master Feb 27, 2020
ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants