Skip to content

fix(triggers): suppress redundant progress-comment DELETE after gadget cleanup#1222

Merged
zbigniewsobiecki merged 3 commits intodevfrom
fix/progress-comment-double-delete
Apr 29, 2026
Merged

fix(triggers): suppress redundant progress-comment DELETE after gadget cleanup#1222
zbigniewsobiecki merged 3 commits intodevfrom
fix/progress-comment-double-delete

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

  • Plan 017/3 of spec 017 (router-side silent-failure hardening). Closes failure mode C from the 2026-04-29 log audit.
  • ~72 WARN entries/day on cascade-router (Failed to delete progress comment after agent success, all 404 from GitHub) caused by the post-agent cleanup hook redundantly deleting comments that in-run gadgets had already disposed of. No functional impact (comment IS deleted), but the noise dominated WARN volume and obscured real failures.
  • Three changes: (1) new explicit initialCommentIdConsumed: boolean on session state, set by deleteInitialComment and clearInitialComment; (2) post-agent hook gates the deletion path (including the legacy agentInput.ackCommentId fallback) on !consumed; (3) githubClient.deletePRComment treats HTTP 404 as success (RFC-7231 idempotency, defense in depth) and logs at DEBUG. Other HTTP errors continue to throw.

Test plan

  • +7 tests on SessionStateData.initialCommentIdConsumed (default value, recordInitialComment doesn't flip, deleteInitialComment success/error semantics, clearInitialComment also flips, init resets).
  • +3 tests on deleteProgressCommentOnSuccess (consumed-bypass, headline regression pin against ackCommentId fallback, legacy fallback preservation).
  • +4 tests on deletePRComment 404 idempotency (404→DEBUG no-throw, 5xx/401/network throw regression pins).
  • npm test (full unit suite): 470 files / 8649 tests passing (+14 from baseline).
  • npm run typecheck clean.
  • npm run lint clean.
  • Pre-push lefthook (unit + integration) passed.
  • Post-deploy: 24h log volume of Failed to delete progress comment after agent success on cascade-router drops to <1/24h under normal operation.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 3 commits April 29, 2026 10:21
Three production hardening fixes derived from the 2026-04-29 24h log/webhook
audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix).

Failure modes:
  A) Linear PM-ack silently skipped on PM-focused agents (24/day)
  B) Pipeline-capacity gate fails open on every PM status-changed (32/day)
  C) Progress-comment double-delete race produces 404 log spam (72/day)

Decomposed per spec strategic decision #8 into three independent plans
(none blocks another):
  1-pm-ack-coverage  (consolidate dispatch via manifest registry)
  2-capacity-gate-pm-scope  (shared adapter helper, fail-closed semantics)
  3-progress-comment-double-delete  (consumed-flag, gated fallback, 404-DEBUG)

This commit ships scaffolding only. Plan 3's implementation lands in this
branch; plans 1 and 2 follow as their own PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t mid-run cleanup

Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec
017's 24h log audit on 2026-04-29.

Root cause: deleteProgressCommentOnSuccess (post-agent hook) read
sessionState.initialCommentId, fell back to result.agentInput.ackCommentId
when session state was empty, and issued a DELETE. But "session state cleared
by an in-run gadget" was indistinguishable from "session state never
populated", so the fallback fired and re-deleted comments that gadgets had
already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries
per day on cascade-router (no functional impact — the comment WAS deleted —
but the noise dominated WARN volume and obscured real failures).

Three changes:

1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default
   false). Both `deleteInitialComment` (gadget-driven) and
   `clearInitialComment` (sidecar-driven) set it to true after disposing of
   the comment. Distinguishes "had a comment, now gone" from "never had one".

2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the
   entire deletion path including the legacy `agentInput.ackCommentId`
   fallback. The legacy fallback continues to work for paths that never
   populated session state (consumed stays false).

3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under
   RFC-7231 idempotency: returns without throwing, logs at DEBUG with the
   comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense
   in depth: even if a future regression of similar shape recurs, the WARN
   volume stays clean while a DEBUG breadcrumb persists for audit.

Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag`
describe block), +3 in ack-comments (consumed-bypass + headline regression
pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network
rejection regression pins). Full suite: 470 files / 8649 tests passing
(+14 from baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit c3f29d9 into dev Apr 29, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/progress-comment-double-delete branch April 29, 2026 10:33
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant