[workflow] Workflow queue#24697
Conversation
|
@iycheng @ericl This would be the workflow queue API. See if you like it. The API requires users to specify And due to previous limitations in our workflow, a workflow is only considered "finished" after users get the result of the workflow. A backend running workflow will never finish and will occupy |
ericl
left a comment
There was a problem hiding this comment.
Do we need to add a new PENDING/QUEUED workflow state?
|
I think adding a new PENDING status is good, otherwise it would also be harder to unittest this PR. This PR is blocked by #24767 for workflow status updating (adding new status). |
|
Also, how about resuming all? It seems that the queue reconstructed doesn't take order into consideration. |
|
too messy to merge. let me squash and rebase it. the comment are easy to locate in the code |
|
@iycheng ready for review |
fishbone
left a comment
There was a problem hiding this comment.
Overall it looks good. Please check the comments.
|
Some tests will be affected by #26318, so I will retest them later. |
fishbone
left a comment
There was a problem hiding this comment.
Overall it looks good and the new engine simplifies a lot of things. Could you please also address the comments I left in the previous reviews?
fishbone
left a comment
There was a problem hiding this comment.
Commented.
One question about resume: for pending one's, when they are added back, are they still in the same order as before?
I am ok with either right now. But please add comments/doc about this as well.
|
CI failures are not related. I'll merge this PR. |
* master: (42 commits) [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408) [Doc] Fix docs feedback button (ray-project#26402) [core][1/2] Improve liveness check in GCS (ray-project#26405) [RLlib] Checkpoint and restore connectors. (ray-project#26253) [Workflow] Minor refactoring of workflow exceptions (ray-project#26398) [workflow] Workflow queue (ray-project#24697) [RLlib] Minor simplification of code. (ray-project#26312) [AIR] Update TensorflowPredictor to new API (ray-project#26215) [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304) [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314) [Doc] Fix rate-the-docs plugin (ray-project#26384) [Docs] [Serve] Has a consistent landing page style (ray-project#26029) [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297) [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805) [tune] Fix `SyncerCallback` having a size limit (ray-project#26371) [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351) [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330) [Doc] Fix actor example (ray-project#26381) Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366) [Datasets] Update docs for drop_columns and fix typos (ray-project#26317) ...
* implement workflow queue Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Why are these changes needed?
Implementation of #24029
Here are the behaviors this PR supports and tests:
PENDINGstatusqueue.Full("Workflow queue has been full")immediately.workflow.run()orray.get(workflow.async_run())would block on the pending workflow, until the workflow resumes running and finishes.ray.workflow.get_output_async(workflow_id)would not be blocked when the workflow is queued (pended).ray.workflow.get_output(workflow_id)would block on the pending workflow, until the workflow resumes running and finishes.workflow.resume_all(), running workflows have the higher priority (i.e. the pending workflows would still likely be pending).Related issue number
Checks
scripts/format.shto lint the changes in this PR.