fix: wait for Swarm task convergence after service update to prevent orphan containers by jaimehgb · Pull Request #3809 · Dokploy/dokploy

jaimehgb · 2026-02-26T23:46:46Z

Summary

After service.update(), Docker Swarm's start-first update order can leave orphan containers running indefinitely when multiple deployments happen in quick succession. This PR adds a convergence wait that polls docker.listTasks() until the service has settled to a single running task before returning from mechanizeDockerContainer.

Adds waitForServiceConvergence() helper that polls task state after each service update
Graceful 120s timeout — worst case falls back to current behavior (no blocked deploys)
Single change point catches all deploy paths (regular, preview, rebuild)

Root Cause

Docker Swarm's start-first update operates in two phases:

Start new task → wait for it to reach RUNNING
Shut down old task → set its desired state to SHUTDOWN

When a second service.update() arrives while the first is mid-transition, the SwarmKit UpdateSupervisor cancels the in-progress update. If the cancellation hits between phase 1 and phase 2, the old task's desired state is never set to SHUTDOWN — and since Swarm actively maintains tasks whose desired state is RUNNING (task_model.md), the orphan persists indefinitely.

T0: service.update() → API returns immediately (async)
    → SwarmKit starts Update A in background

T1: Update A creates new task (Task-B), starts it
    → Task-A (old) still running, Task-B starting

T2: service.update() called again → SwarmKit CANCELS Update A
    → Update A's goroutine exits before shutting down Task-A

T3: Update B sees Task-B as "current", creates Task-C
    → Task-A is now ORPHANED — no update tracks it

T4: Update B completes
    → Task-C running, Task-B shut down, Task-A still running ← orphan!

This is a well-documented behavior in Docker Swarm:

moby/moby #35107 — Swarm mode leaves orphan containers
moby/moby #41380 — docker service update does not stop old task
moby/moby #31387 — No built-in way to wait for service update completion
moby/moby #43081 — start-first rollback replaces healthy tasks
SwarmKit PR #2308 — "Only shut down old tasks on success"

Why This Happens in Dokploy

Dokploy's BullMQ deploy queue has concurrency 1, so jobs run serially. But mechanizeDockerContainer calls service.update() and returns immediately — it doesn't wait for Swarm to finish draining the old task. When the next queued deploy runs its service.update(), the previous update is still mid-transition, triggering the orphan race.

The POST /services/{id}/update API is explicitly asynchronous — it returns HTTP 200 after recording the update in the Raft store, not after task convergence.

The Fix

After service.update(), poll docker.listTasks() filtered by service and desired-state: running. Wait until only 1 task has Status.State === "running" (meaning the old task has been drained). If convergence doesn't happen within 120 seconds, log a warning and return — the deploy completes anyway, falling back to current behavior.

Since the queue has concurrency 1, this naturally serializes the Swarm update lifecycle: the next deploy won't start its service.update() until the current one has converged.

Edge cases handled

Scenario	Behavior
Broken deploy (new task never healthy)	New task stays in `starting`/`failed` state, never reaches `running` → loop sees ≤1 running task → exits immediately
Timeout (120s)	Logs warning, returns → same behavior as today
New service (first deploy)	Uses `createService` path (catch block) — no convergence wait needed
Remote servers (SSH)	Works transparently — `docker` instance from `getRemoteDocker()` already handles SSH tunneling

Related Issues

Redeploy in swarm (1 replica) is not killing previous instance #1669 — "Redeploy in swarm (1 replica) is not killing previous instance"
Local deployment doesn’t replace old container #2223 — "Local deployment doesn't replace old container"
Persistent Old Code in Containers and Deployment Orchestration Failures #2911 — "Persistent Old Code in Containers"
Deploy builds new image but Swarm service keeps running old image (port already in use) #2150 — "Deploy builds new image but Swarm keeps running old image"
Traefik routes intermittently timeout in Docker Swarm due to service DNS resolving to stale VIP instead of task IPs #3480 — Stale VIPs after service updates cause Traefik routing failures

Test Plan

Build the project — verify TypeScript compiles
Deploy an app, push 3-4 rapid commits, check docker ps -f 'name=<appName>' — should see at most 2 containers (1 old + 1 starting) at any point, never 3+
Deploy a broken image (bad health check) — verify the convergence loop exits quickly and doesn't block subsequent deploys
Check deploy logs for [Dokploy] Service X did not converge messages on timeout
Verify preview deployments also clean up correctly

…orphan containers Docker Swarm's `start-first` update order operates in two phases: start new task, then shut down old task. When `service.update()` is called again before the first update completes, SwarmKit cancels the in-progress update — and the old task's shutdown phase is skipped, leaving orphan containers running indefinitely. This adds a convergence wait after `service.update()` that polls `docker.listTasks()` until only one task remains running (or a 120s timeout). Since the BullMQ deploy queue has concurrency 1, this naturally prevents rapid consecutive updates from creating orphans. Relates to Dokploy#1669, Dokploy#2223, Dokploy#2911, Dokploy#2150

hl9020 · 2026-03-17T22:01:59Z

Why is this closed without action?

jaimehgb · 2026-03-19T19:18:52Z

@hl9020 it was autoclosed because it didn't pass the LLM-slop filter 😅
Same PR but that survived can be found here: #3810

github-actions Bot closed this Feb 26, 2026

hl9020 mentioned this pull request Mar 17, 2026

Compose redeploy with changed project name leaves old stack running (orphan containers after reboot) #4019

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: wait for Swarm task convergence after service update to prevent orphan containers#3809

fix: wait for Swarm task convergence after service update to prevent orphan containers#3809
jaimehgb wants to merge 1 commit into
Dokploy:canaryfrom
jaimehgb:quiver

jaimehgb commented Feb 26, 2026

Uh oh!

hl9020 commented Mar 17, 2026

Uh oh!

jaimehgb commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jaimehgb commented Feb 26, 2026

Summary

Root Cause

Why This Happens in Dokploy

The Fix

Edge cases handled

Related Issues

Test Plan

Uh oh!

hl9020 commented Mar 17, 2026

Uh oh!

jaimehgb commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants