Skip to content

fix: wait for Swarm task convergence after service update to prevent orphan containers#3809

Closed
jaimehgb wants to merge 1 commit into
Dokploy:canaryfrom
jaimehgb:quiver
Closed

fix: wait for Swarm task convergence after service update to prevent orphan containers#3809
jaimehgb wants to merge 1 commit into
Dokploy:canaryfrom
jaimehgb:quiver

Conversation

@jaimehgb
Copy link
Copy Markdown
Contributor

Summary

After service.update(), Docker Swarm's start-first update order can leave orphan containers running indefinitely when multiple deployments happen in quick succession. This PR adds a convergence wait that polls docker.listTasks() until the service has settled to a single running task before returning from mechanizeDockerContainer.

  • Adds waitForServiceConvergence() helper that polls task state after each service update
  • Graceful 120s timeout — worst case falls back to current behavior (no blocked deploys)
  • Single change point catches all deploy paths (regular, preview, rebuild)

Root Cause

Docker Swarm's start-first update operates in two phases:

  1. Start new task → wait for it to reach RUNNING
  2. Shut down old task → set its desired state to SHUTDOWN

When a second service.update() arrives while the first is mid-transition, the SwarmKit UpdateSupervisor cancels the in-progress update. If the cancellation hits between phase 1 and phase 2, the old task's desired state is never set to SHUTDOWN — and since Swarm actively maintains tasks whose desired state is RUNNING (task_model.md), the orphan persists indefinitely.

T0: service.update() → API returns immediately (async)
    → SwarmKit starts Update A in background

T1: Update A creates new task (Task-B), starts it
    → Task-A (old) still running, Task-B starting

T2: service.update() called again → SwarmKit CANCELS Update A
    → Update A's goroutine exits before shutting down Task-A

T3: Update B sees Task-B as "current", creates Task-C
    → Task-A is now ORPHANED — no update tracks it

T4: Update B completes
    → Task-C running, Task-B shut down, Task-A still running ← orphan!

This is a well-documented behavior in Docker Swarm:

Why This Happens in Dokploy

Dokploy's BullMQ deploy queue has concurrency 1, so jobs run serially. But mechanizeDockerContainer calls service.update() and returns immediately — it doesn't wait for Swarm to finish draining the old task. When the next queued deploy runs its service.update(), the previous update is still mid-transition, triggering the orphan race.

The POST /services/{id}/update API is explicitly asynchronous — it returns HTTP 200 after recording the update in the Raft store, not after task convergence.

The Fix

After service.update(), poll docker.listTasks() filtered by service and desired-state: running. Wait until only 1 task has Status.State === "running" (meaning the old task has been drained). If convergence doesn't happen within 120 seconds, log a warning and return — the deploy completes anyway, falling back to current behavior.

Since the queue has concurrency 1, this naturally serializes the Swarm update lifecycle: the next deploy won't start its service.update() until the current one has converged.

Edge cases handled

Scenario Behavior
Broken deploy (new task never healthy) New task stays in starting/failed state, never reaches running → loop sees ≤1 running task → exits immediately
Timeout (120s) Logs warning, returns → same behavior as today
New service (first deploy) Uses createService path (catch block) — no convergence wait needed
Remote servers (SSH) Works transparently — docker instance from getRemoteDocker() already handles SSH tunneling

Related Issues

Test Plan

  • Build the project — verify TypeScript compiles
  • Deploy an app, push 3-4 rapid commits, check docker ps -f 'name=<appName>' — should see at most 2 containers (1 old + 1 starting) at any point, never 3+
  • Deploy a broken image (bad health check) — verify the convergence loop exits quickly and doesn't block subsequent deploys
  • Check deploy logs for [Dokploy] Service X did not converge messages on timeout
  • Verify preview deployments also clean up correctly

…orphan containers

Docker Swarm's `start-first` update order operates in two phases: start new
task, then shut down old task. When `service.update()` is called again before
the first update completes, SwarmKit cancels the in-progress update — and the
old task's shutdown phase is skipped, leaving orphan containers running
indefinitely.

This adds a convergence wait after `service.update()` that polls
`docker.listTasks()` until only one task remains running (or a 120s timeout).
Since the BullMQ deploy queue has concurrency 1, this naturally prevents
rapid consecutive updates from creating orphans.

Relates to Dokploy#1669, Dokploy#2223, Dokploy#2911, Dokploy#2150
@github-actions github-actions Bot closed this Feb 26, 2026
@hl9020
Copy link
Copy Markdown
Contributor

hl9020 commented Mar 17, 2026

Why is this closed without action?

@jaimehgb
Copy link
Copy Markdown
Contributor Author

@hl9020 it was autoclosed because it didn't pass the LLM-slop filter 😅
Same PR but that survived can be found here: #3810

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants