[workflow] Workflow queue by suquark · Pull Request #24697 · ray-project/ray

suquark · 2022-05-11T19:12:22Z

Why are these changes needed?

Implementation of #24029

Here are the behaviors this PR supports and tests:

The queued workflow has a PENDING status
When a running workflow completes, a pending workflow from the queue starts running.
If the workflow queue is full, submitting a workflow raises queue.Full("Workflow queue has been full") immediately.
Submitting a workflow would not be blocked when the workflow is queued (pended).
workflow.run() or ray.get(workflow.async_run()) would block on the pending workflow, until the workflow resumes running and finishes.
ray.workflow.get_output_async(workflow_id) would not be blocked when the workflow is queued (pended).
ray.workflow.get_output(workflow_id) would block on the pending workflow, until the workflow resumes running and finishes.
Both pending and running workflows could be resumed, if the system was down accidentally.
When both running and pending workflows are resumed with workflow.resume_all(), running workflows have the higher priority (i.e. the pending workflows would still likely be pending).

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

suquark · 2022-05-11T19:16:09Z

@iycheng @ericl This would be the workflow queue API. See if you like it.

The API requires users to specify max_concurrent_jobs and max_queueing_jobs in workflow.init. Once specified, these options would be permanent, you cannot change them unless tearing down the whole Ray cluster.

And due to previous limitations in our workflow, a workflow is only considered "finished" after users get the result of the workflow. A backend running workflow will never finish and will occupy max_concurrent_jobs forever unless tearing down the whole Ray cluster. Major changes are required to fix this, involving letting the workflow management actor to scheduling all workflow tasks and update workflow status.

python/ray/workflow/api.py

ericl

Do we need to add a new PENDING/QUEUED workflow state?

suquark · 2022-05-14T00:11:52Z

I think adding a new PENDING status is good, otherwise it would also be harder to unittest this PR.

This PR is blocked by #24767 for workflow status updating (adding new status).

python/ray/workflow/api.py

python/ray/workflow/workflow_access.py

fishbone · 2022-05-14T05:43:04Z

Also, how about resuming all? It seems that the queue reconstructed doesn't take order into consideration.

suquark · 2022-05-25T23:54:06Z

too messy to merge. let me squash and rebase it. the comment are easy to locate in the code

suquark · 2022-05-29T04:12:33Z

@iycheng ready for review

python/ray/workflow/workflow_storage.py

python/ray/workflow/workflow_access.py

python/ray/workflow/tests/test_workflow_queuing.py

fishbone

Overall it looks good. Please check the comments.

suquark · 2022-07-06T21:56:19Z

Some tests will be affected by #26318, so I will retest them later.

doc/source/workflows/management.rst

python/ray/workflow/workflow_access.py

fishbone

Overall it looks good and the new engine simplifies a lot of things. Could you please also address the comments I left in the previous reviews?

fishbone

Commented.

One question about resume: for pending one's, when they are added back, are they still in the same order as before?

I am ok with either right now. But please add comments/doc about this as well.

python/ray/workflow/tests/test_workflow_queuing.py

suquark · 2022-07-09T00:23:58Z

CI failures are not related. I'll merge this PR.

* master: (42 commits) [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408) [Doc] Fix docs feedback button (ray-project#26402) [core][1/2] Improve liveness check in GCS (ray-project#26405) [RLlib] Checkpoint and restore connectors. (ray-project#26253) [Workflow] Minor refactoring of workflow exceptions (ray-project#26398) [workflow] Workflow queue (ray-project#24697) [RLlib] Minor simplification of code. (ray-project#26312) [AIR] Update TensorflowPredictor to new API (ray-project#26215) [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304) [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314) [Doc] Fix rate-the-docs plugin (ray-project#26384) [Docs] [Serve] Has a consistent landing page style (ray-project#26029) [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297) [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805) [tune] Fix `SyncerCallback` having a size limit (ray-project#26371) [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351) [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330) [Doc] Fix actor example (ray-project#26381) Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366) [Datasets] Update docs for drop_columns and fix typos (ray-project#26317) ...

* implement workflow queue Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

suquark assigned fishbone and ericl May 11, 2022

ericl reviewed May 11, 2022

View reviewed changes

python/ray/workflow/api.py Outdated Show resolved Hide resolved

ericl reviewed May 11, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 11, 2022

suquark marked this pull request as ready for review May 14, 2022 00:10

suquark requested review from fishbone and stephanie-wang as code owners May 14, 2022 00:10

suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 14, 2022

fishbone reviewed May 14, 2022

View reviewed changes

python/ray/workflow/api.py Outdated Show resolved Hide resolved

fishbone reviewed May 14, 2022

View reviewed changes

python/ray/workflow/workflow_access.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 16, 2022

suquark force-pushed the workflow_queue branch from bb3e7be to 3ca3c93 Compare May 25, 2022 23:58

suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 29, 2022

suquark requested a review from fishbone May 29, 2022 03:11

suquark requested a review from maxpumperla as a code owner May 29, 2022 04:11