Skip to content

[workflow] Workflow queue#24697

Merged
suquark merged 10 commits intoray-project:masterfrom
suquark:workflow_queue
Jul 9, 2022
Merged

[workflow] Workflow queue#24697
suquark merged 10 commits intoray-project:masterfrom
suquark:workflow_queue

Conversation

@suquark
Copy link
Member

@suquark suquark commented May 11, 2022

Why are these changes needed?

Implementation of #24029

Here are the behaviors this PR supports and tests:

  • The queued workflow has a PENDING status
  • When a running workflow completes, a pending workflow from the queue starts running.
  • If the workflow queue is full, submitting a workflow raises queue.Full("Workflow queue has been full") immediately.
  • Submitting a workflow would not be blocked when the workflow is queued (pended).
  • workflow.run() or ray.get(workflow.async_run()) would block on the pending workflow, until the workflow resumes running and finishes.
  • ray.workflow.get_output_async(workflow_id) would not be blocked when the workflow is queued (pended).
  • ray.workflow.get_output(workflow_id) would block on the pending workflow, until the workflow resumes running and finishes.
  • Both pending and running workflows could be resumed, if the system was down accidentally.
  • When both running and pending workflows are resumed with workflow.resume_all(), running workflows have the higher priority (i.e. the pending workflows would still likely be pending).

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@suquark
Copy link
Member Author

suquark commented May 11, 2022

@iycheng @ericl This would be the workflow queue API. See if you like it.

The API requires users to specify max_concurrent_jobs and max_queueing_jobs in workflow.init. Once specified, these options would be permanent, you cannot change them unless tearing down the whole Ray cluster.

And due to previous limitations in our workflow, a workflow is only considered "finished" after users get the result of the workflow. A backend running workflow will never finish and will occupy max_concurrent_jobs forever unless tearing down the whole Ray cluster. Major changes are required to fix this, involving letting the workflow management actor to scheduling all workflow tasks and update workflow status.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a new PENDING/QUEUED workflow state?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 11, 2022
@suquark suquark marked this pull request as ready for review May 14, 2022 00:10
@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 14, 2022
@suquark
Copy link
Member Author

suquark commented May 14, 2022

I think adding a new PENDING status is good, otherwise it would also be harder to unittest this PR.

This PR is blocked by #24767 for workflow status updating (adding new status).

@fishbone
Copy link
Contributor

Also, how about resuming all? It seems that the queue reconstructed doesn't take order into consideration.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 16, 2022
@suquark
Copy link
Member Author

suquark commented May 25, 2022

too messy to merge. let me squash and rebase it. the comment are easy to locate in the code

@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 29, 2022
@suquark suquark requested a review from fishbone May 29, 2022 03:11
@suquark suquark requested a review from maxpumperla as a code owner May 29, 2022 04:11
@suquark
Copy link
Member Author

suquark commented May 29, 2022

@iycheng ready for review

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. Please check the comments.

@ericl ericl removed their assignment Jun 1, 2022
@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 3, 2022
@suquark
Copy link
Member Author

suquark commented Jul 6, 2022

Some tests will be affected by #26318, so I will retest them later.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good and the new engine simplifies a lot of things. Could you please also address the comments I left in the previous reviews?

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 7, 2022
@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 8, 2022
@suquark suquark requested a review from fishbone July 8, 2022 01:49
Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented.

One question about resume: for pending one's, when they are added back, are they still in the same order as before?

I am ok with either right now. But please add comments/doc about this as well.

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 8, 2022
@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 8, 2022
@suquark
Copy link
Member Author

suquark commented Jul 9, 2022

CI failures are not related. I'll merge this PR.

@suquark suquark merged commit b0e913f into ray-project:master Jul 9, 2022
@suquark suquark deleted the workflow_queue branch July 9, 2022 00:24
truelegion47 pushed a commit to truelegion47/ray that referenced this pull request Jul 9, 2022
* master: (42 commits)
  [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408)
  [Doc] Fix docs feedback button (ray-project#26402)
  [core][1/2] Improve liveness check in GCS  (ray-project#26405)
  [RLlib] Checkpoint and restore connectors. (ray-project#26253)
  [Workflow] Minor refactoring of workflow exceptions (ray-project#26398)
  [workflow] Workflow queue (ray-project#24697)
  [RLlib] Minor simplification of code. (ray-project#26312)
  [AIR] Update TensorflowPredictor to new API (ray-project#26215)
  [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304)
  [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314)
  [Doc] Fix rate-the-docs plugin (ray-project#26384)
  [Docs] [Serve] Has a consistent landing page style (ray-project#26029)
  [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297)
  [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805)
  [tune] Fix `SyncerCallback` having a size limit (ray-project#26371)
  [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351)
  [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330)
  [Doc] Fix actor example (ray-project#26381)
  Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366)
  [Datasets] Update docs for drop_columns and fix typos (ray-project#26317)
  ...
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
* implement workflow queue

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants