Skip to content

RFC: ADR-010 — Error recovery and rollback protocol #141

@scottschreckengaust

Description

@scottschreckengaust

Summary

Define the protocol for how agents and humans respond when work breaks something — whether a broken deploy, a merged bug, a corrupted state, or a failed stacked PR chain. Specifies when to revert vs. fix forward, who decides, and how to prevent cascading damage.

Use case and motivation

Problem today:

  1. When a merged PR breaks something, the response is ad-hoc — sometimes a revert, sometimes a hotfix, sometimes nothing until someone notices.
  2. Agents operating autonomously may merge code that passes CI but breaks integration (state machines, cross-service contracts, runtime behavior).
  3. No protocol defines the decision tree for "revert immediately" vs. "fix forward quickly" vs. "investigate then decide."
  4. Cascading failures from stacked PRs: if PR 3 of 8 broke something, what happens to PRs 4-8?

After this ADR:

  • Clear decision tree for revert vs. fix-forward
  • Defined blast radius assessment before any recovery action
  • Stacked PR chain recovery protocol
  • Automated detection where possible; human escalation where necessary

Proposed protocol

Decision tree

Broken thing detected
├─ Is production affected? (users impacted NOW)
│  ├─ Yes → REVERT immediately, investigate after
│  └─ No → Continue assessment
├─ Is the fix obvious and < 30 min?
│  ├─ Yes → Fix forward (new PR, not amend)
│  └─ No → REVERT, then fix properly
├─ Is it a stacked PR chain?
│  ├─ Yes → Pause all dependent PRs, fix the base
│  └─ No → Standard fix/revert
└─ Is the scope of damage unclear?
   ├─ Yes → REVERT (safe default), investigate
   └─ No → Targeted fix

Revert protocol

  1. Create a revert commit (not force-push) — preserves history
  2. Open an issue documenting: what broke, why CI didn't catch it, what the fix needs
  3. The original author (or any available contributor) owns the fix
  4. The fix goes through normal review (no rushing, no skipping gates)

Fix-forward protocol

  1. Only if the fix is obvious, small, and low-risk
  2. Must still go through PR + review (no direct pushes to main)
  3. Document in the PR: "Fix-forward for breakage introduced in #N"
  4. If the fix introduces NEW complexity — revert instead

Stacked PR chain recovery

  1. Identify which PR in the stack introduced the breakage
  2. Pause/close all PRs above it
  3. Fix the base PR
  4. Rebase and re-evaluate all dependent PRs (some may need redesign, not just rebase)
  5. Re-run CI on each before re-opening

What agents must NEVER do during recovery

  • Force-push to shared branches
  • Delete branches with others' work
  • Amend published commits
  • Skip review "because it's urgent"
  • Self-approve a revert

Open questions

  1. Should reverts be auto-generated by CI when a specific check fails post-merge?
  2. Who has revert authority — any contributor, or only admins?
  3. How do we detect "production affected" for a project that's self-hosted differently per user?
  4. Should there be a "incident commander" role for complex failures?

Alignment

Aligns with Observability and safe deploy and Platform maturity.

Dependencies


  • RFC PR: (pending approval)
  • Approved by: (pending)
  • Reviewed by: (pending)

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFC-proposalRequest for Comments: design proposaldocumentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions