[workflow] Fast workflow indexing by suquark · Pull Request #24767 · ray-project/ray

suquark · 2022-05-13T08:45:03Z

Why are these changes needed?

This PR enables indexing for workflow status. So it would be much faster to list workflows and status.

The indexing is done by creating keys under corresponding status directories. For example, RUNNING directory contains all keys (named with workflow ids), which the corresponding workflow is running.

One issue is that the cluster / workflow maybe crashed while updating the status, this would result in inconsistency status, because we have to create the new key, delete the old key and update workflow metadata, these actions cannot be combined as a single atomic operation. We use a special directory marking the status updating is underway. This makes us possible to detect unfinished status updating and fixing them. (See examples in newly added tests).

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/workflow/workflow_access.py

python/ray/workflow/workflow_storage.py

fishbone · 2022-05-17T05:54:39Z

Sorry for the late review and thanks for working on this!

Overall, I'm good about the protocol. My concern is that this protocol enforces readers to be single thread as well which seems not to work in the current system.

One way is to ensure make read work without fixing the half updates and we only fix this when there is a write.

suquark · 2022-05-17T18:54:05Z

let me try to implement another protocol using key creation time. I think it might address the multi-threading issues.

suquark · 2022-05-20T20:33:53Z

@iycheng ready for reviewing again

fishbone · 2022-05-20T22:09:20Z

python/ray/workflow/execution.py

 def cancel(workflow_id: str) -> None:
    try:
        workflow_manager = get_management_actor()
-        ray.get(workflow_manager.cancel_workflow.remote(workflow_id))
    except ValueError:
        wf_store = workflow_storage.get_workflow_storage(workflow_id)
-        wf_store.save_workflow_meta(WorkflowMetaData(WorkflowStatus.CANCELED))
+        # TODO(suquark): Here we update workflow status "offline", so it is likely
+        # thread-safe because there is no workflow management actor updating the
+        # workflow concurrently. But we should be careful if we are going to
+        # update more workflow status offline in the future.
+        wf_store.update_workflow_status(WorkflowStatus.CANCELED)
+        return
+    ray.get(workflow_manager.cancel_workflow.remote(workflow_id))


Should we move ray.get back to try block?

no, I think we put ray.get in try block accidentally. ray.get cannot generate ValueError, only get_management_actor generates ValueError

python/ray/workflow/workflow_storage.py

fishbone · 2022-05-20T23:59:25Z

python/ray/workflow/workflow_storage.py

+            return WorkflowStatus(metadata["status"])
+        return WorkflowStatus.NONE
+
+    def list_workflow(self) -> List[Tuple[str, WorkflowStatus]]:


I think we need an extra param here list_workflow(self, status=None)

Here if status is set, we'll only check the dirty directory and the specified status directory. I'm ok with another PR to fix this

I just implemented it.

python/ray/workflow/workflow_storage.py

fishbone · 2022-05-21T00:04:25Z

python/ray/workflow/workflow_storage.py

+                    if s != prev_status:
+                        self._storage.delete(
+                            self._key_workflow_with_status(workflow_id, s)
+                        )


No need to update this PR. But do you think in the future we can put the status in the dirty flag and only delete that?

fishbone · 2022-05-21T00:07:54Z

It seems there is a bug here: we should always set up the flag I think.

Another thing is that, I think we need to put status filter into storage layer so that we don't need to read successful workflow status which is not useful for resume_all/list_all with filter. I'm OK with this PR and having another one for this optimization.

suquark · 2022-05-21T00:27:10Z

@iycheng I think we did not set the dirty flag because in that branch we already detected the dirty flag. Since the workflow status updating is single threaded, there is no need to create it again. (Create it again also does not work under concurrent case, because another faster process could delete the newly created flag anyway - the order of create/delete of different processes could be arbitrary).

fishbone · 2022-05-21T01:45:09Z

Got it and thanks for the explaination.

suquark · 2022-05-23T18:43:28Z

@iycheng I just updated the PR and support status filter. I also fixed a bug: in the original workflow.api.list_all(), status_filter = set(WorkflowStatus.__members__.keys()) returns a set of strings instead of a set of workflow status. It is clearly against the behaviors of other branches in the function, and this bug reveals when I implement status filter in the storage layer.

python/ray/workflow/workflow_storage.py

fishbone

LGTM!

suquark · 2022-05-25T03:20:49Z

CI failures seem unrelated. I'll merge this PR.

workflow indexing

abe9fcb

suquark assigned fishbone May 13, 2022

update

1b05d65

suquark marked this pull request as ready for review May 13, 2022 19:53

suquark requested review from ericl, fishbone and stephanie-wang as code owners May 13, 2022 19:53

suquark added 4 commits May 13, 2022 14:24

update

90367cb

update

ae7854a

update

10b47a9

update

1b0352e

suquark mentioned this pull request May 14, 2022

[workflow] Workflow queue #24697

Merged

15 tasks

fix

abe1fd2

fishbone reviewed May 16, 2022

View reviewed changes

python/ray/workflow/workflow_access.py Outdated Show resolved Hide resolved

fishbone reviewed May 16, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Outdated Show resolved Hide resolved

simplify workflow storage API

0ed069c

suquark requested a review from fishbone May 16, 2022 22:53

fishbone reviewed May 16, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Show resolved Hide resolved

fishbone reviewed May 16, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Outdated Show resolved Hide resolved

update

37bccf6

suquark requested a review from fishbone May 17, 2022 00:14

fishbone reviewed May 17, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Outdated Show resolved Hide resolved

fishbone reviewed May 17, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Outdated Show resolved Hide resolved

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2022

Only fix workflow status when updating the status.

00d1725

suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 20, 2022

suquark requested a review from fishbone May 20, 2022 00:33

fishbone reviewed May 20, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Show resolved Hide resolved

fishbone reviewed May 20, 2022

View reviewed changes

fishbone reviewed May 21, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Show resolved Hide resolved

fishbone reviewed May 21, 2022

View reviewed changes

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 21, 2022

support status filter

b0078ac

suquark requested a review from fishbone May 23, 2022 18:40

suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 23, 2022

fishbone reviewed May 23, 2022

View reviewed changes

python/ray/workflow/workflow_storage.py Outdated Show resolved Hide resolved

fishbone approved these changes May 23, 2022

View reviewed changes

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 23, 2022

update

7171aa1

suquark force-pushed the workflow_indexing branch from 4fca233 to 7171aa1 Compare May 24, 2022 17:10

suquark merged commit f67871c into ray-project:master May 25, 2022

suquark deleted the workflow_indexing branch May 25, 2022 03:21

Conversation

suquark commented May 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fishbone commented May 17, 2022

Uh oh!

suquark commented May 17, 2022

Uh oh!

suquark commented May 20, 2022

Uh oh!

fishbone May 20, 2022

Choose a reason for hiding this comment

Uh oh!

suquark May 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fishbone May 20, 2022

Choose a reason for hiding this comment

Uh oh!

suquark May 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fishbone May 21, 2022

Choose a reason for hiding this comment

Uh oh!

suquark May 21, 2022

Choose a reason for hiding this comment

Uh oh!

fishbone commented May 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suquark commented May 21, 2022

Uh oh!

fishbone commented May 21, 2022

Uh oh!

suquark commented May 23, 2022

Uh oh!

Uh oh!

fishbone left a comment

Choose a reason for hiding this comment

Uh oh!

suquark commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suquark commented May 13, 2022 •

edited

Loading

fishbone commented May 21, 2022 •

edited

Loading