Skip to content

failTask double-emits task_failed (and over-decrements concurrency) on Durable Execution step retry #55

@scoropeza

Description

@scoropeza

Follow-up from PR #52 — surfaced during hands-on deploy-validation. Tracked here so the fix can land on its own merits, separately from the interactive-agents rev-6 narrative.

Functional description

When a task fails (e.g. the user's prompt is blocked by the safety filter), the system is supposed to emit exactly one task_failed event. In practice, a single logical failure can emit 5-6 task_failed events over a ~60 s window.

User-visible impact:

  1. GitHub comment spam — once the fanout dispatcher is wired to GitHub (landed in PR feat(interactive-agents): async-only background task UX + Cedar HITL design #52), each task_failed event produces a PATCH against the PR comment. One guardrail-blocked task produced 5 redundant GitHub writes in testing.
  2. Concurrency counter drift — the user's concurrency counter is decremented once per emission, so one failed task may be counted as N completions. The user's "tasks currently running" count drifts away from reality (bounded but real).
  3. Event-stream noise — watch / CLI / dashboard consumers see a stream of duplicate task_failed events for one logical failure.

Observed in testing: task 01KQT96W0B8J48HDWGF4YA9V7D (PR #52 Scenario 7-extended take 3): 6 task_failed rows in TaskEventsTable for one guardrail-blocked submission, timestamps spread across 60 s with the classic Durable Execution retry backoff (1 s / 7 s / 17 s / 8 s / 23 s).

Technical root cause

failTask in cdk/src/handlers/shared/orchestrator.ts (lines 618-637) emits task_failed and calls decrementConcurrency unconditionally, even when transitionTask raises ConditionalCheckFailedException because the task is already in FAILED state:

// cdk/src/handlers/shared/orchestrator.ts:618-637 (simplified)
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
  // ← swallowed; no re-throw, no guard flag
}
await emitTaskEvent(taskId, 'task_failed', { error_message });   // ← always fires
if (releaseConcurrency) await decrementConcurrency(userId);      // ← always fires

The caller at cdk/src/handlers/orchestrate-task.ts:117-119 catches, calls failTask, then throw err to signal step failure to the @aws/durable-execution-sdk-js runtime. The SDK retries context.step('hydrate-context', ...) with exponential backoff. Each retry re-enters failTask → re-emits and re-decrements, because the ConditionalCheckFailed on the transition is swallowed without gating downstream side effects.

CloudWatch evidence (abridged):

20:00:02  Content blocked by guardrail … PROMPT_ATTACK LOW
20:00:04  WARN Failed to transition task to FAILED … conditional request failed
20:00:11  WARN Failed to transition task to FAILED …
20:00:28  WARN Failed to transition task to FAILED …
20:00:36  WARN Failed to transition task to FAILED …
20:00:59  WARN Failed to transition task to FAILED …

Proposed fix

Gate the side effects in failTask behind a transitioned flag, mirroring the pattern already used at orchestrator.ts:478-499 (the heartbeat-stale path):

let transitioned = false;
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
  transitioned = true;
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
}
if (transitioned) {
  await emitTaskEvent(taskId, 'task_failed', { error_message });
  if (releaseConcurrency) await decrementConcurrency(userId);
}

This is a missed consistency fix — the sibling code path already uses exactly this shape. PR #52's fanout-to-GitHub wiring just made the downstream effects visible on a real PR.

Acceptance criteria

  • failTask runs its emit + decrement side effects only when transitionTask succeeded.
  • Unit test: call failTask twice for the same task, assert exactly one emitTaskEvent + one decrementConcurrency invocation (the second call's ConditionalCheckFailed is benignly swallowed).
  • Integration assertion: a guardrail-blocked path produces exactly one task_failed row in TaskEventsTable under Durable Execution step retry, not N.
  • Audit the sibling call site at orchestrator.ts:155 (session-start) for the same pattern and fix if applicable.

Out of scope

  • Changing the Durable Execution retry policy itself (the retries are legitimate — the fix is idempotent side effects, not "don't retry").
  • Concurrency reconciler tuning (follow-up if concurrency drift is observed in production after this fix).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions