Currently, dstack's retry policy is very limited – it only works for interrupted spot jobs (despite its description that the run is retried on failure). To make retry policy useful, it should cover common use cases, including the following:
- When I run a production service, I want to always restart the job if it fails for any reason (possibly with some large duration limit).
- When I run a one-time task, I want to retry provisioning only to wait for capacity. I don't want job being restarted if there is a problem with my code.
The current retry policy specification looks like this:
retry_policy:
retry: true
duration: 1h
We should introduce new values for retry:
retry: always – always retries the job unless explicitly stopped
retry: no-capacity – retries on no capacity/interruption but not if the job failed
retry: never – default
So retry_policy could look like this:
retry_policy:
retry: always
duration: 1h
To specify different retry policies via CLI, we could allow specifying them in --retry:
dstack run . --retry=always --retry-duration=1h
The semantics of duration should also be changed and clarified. Currently, the duration is calculated from the job submission time. It should be calculated from the last failure time (or job submission time for new jobs) so that retry policy can be used to retry production services.
Currently, dstack's retry policy is very limited – it only works for interrupted spot jobs (despite its description that the run is retried on failure). To make retry policy useful, it should cover common use cases, including the following:
The current retry policy specification looks like this:
We should introduce new values for
retry:retry: always– always retries the job unless explicitly stoppedretry: no-capacity– retries on no capacity/interruption but not if the job failedretry: never– defaultSo
retry_policycould look like this:To specify different retry policies via CLI, we could allow specifying them in
--retry:dstack run . --retry=always --retry-duration=1hThe semantics of
durationshould also be changed and clarified. Currently, the duration is calculated from the job submission time. It should be calculated from the last failure time (or job submission time for new jobs) so that retry policy can be used to retry production services.