Summary
Define the protocol for how agents and humans respond when work breaks something — whether a broken deploy, a merged bug, a corrupted state, or a failed stacked PR chain. Specifies when to revert vs. fix forward, who decides, and how to prevent cascading damage.
Use case and motivation
Problem today:
- When a merged PR breaks something, the response is ad-hoc — sometimes a revert, sometimes a hotfix, sometimes nothing until someone notices.
- Agents operating autonomously may merge code that passes CI but breaks integration (state machines, cross-service contracts, runtime behavior).
- No protocol defines the decision tree for "revert immediately" vs. "fix forward quickly" vs. "investigate then decide."
- Cascading failures from stacked PRs: if PR 3 of 8 broke something, what happens to PRs 4-8?
After this ADR:
- Clear decision tree for revert vs. fix-forward
- Defined blast radius assessment before any recovery action
- Stacked PR chain recovery protocol
- Automated detection where possible; human escalation where necessary
Proposed protocol
Decision tree
Broken thing detected
├─ Is production affected? (users impacted NOW)
│ ├─ Yes → REVERT immediately, investigate after
│ └─ No → Continue assessment
├─ Is the fix obvious and < 30 min?
│ ├─ Yes → Fix forward (new PR, not amend)
│ └─ No → REVERT, then fix properly
├─ Is it a stacked PR chain?
│ ├─ Yes → Pause all dependent PRs, fix the base
│ └─ No → Standard fix/revert
└─ Is the scope of damage unclear?
├─ Yes → REVERT (safe default), investigate
└─ No → Targeted fix
Revert protocol
- Create a revert commit (not force-push) — preserves history
- Open an issue documenting: what broke, why CI didn't catch it, what the fix needs
- The original author (or any available contributor) owns the fix
- The fix goes through normal review (no rushing, no skipping gates)
Fix-forward protocol
- Only if the fix is obvious, small, and low-risk
- Must still go through PR + review (no direct pushes to main)
- Document in the PR: "Fix-forward for breakage introduced in #N"
- If the fix introduces NEW complexity — revert instead
Stacked PR chain recovery
- Identify which PR in the stack introduced the breakage
- Pause/close all PRs above it
- Fix the base PR
- Rebase and re-evaluate all dependent PRs (some may need redesign, not just rebase)
- Re-run CI on each before re-opening
What agents must NEVER do during recovery
- Force-push to shared branches
- Delete branches with others' work
- Amend published commits
- Skip review "because it's urgent"
- Self-approve a revert
Open questions
- Should reverts be auto-generated by CI when a specific check fails post-merge?
- Who has revert authority — any contributor, or only admins?
- How do we detect "production affected" for a project that's self-hosted differently per user?
- Should there be a "incident commander" role for complex failures?
Alignment
Aligns with Observability and safe deploy and Platform maturity.
Dependencies
- RFC PR: (pending approval)
- Approved by: (pending)
- Reviewed by: (pending)
Summary
Define the protocol for how agents and humans respond when work breaks something — whether a broken deploy, a merged bug, a corrupted state, or a failed stacked PR chain. Specifies when to revert vs. fix forward, who decides, and how to prevent cascading damage.
Use case and motivation
Problem today:
After this ADR:
Proposed protocol
Decision tree
Revert protocol
Fix-forward protocol
Stacked PR chain recovery
What agents must NEVER do during recovery
Open questions
Alignment
Aligns with Observability and safe deploy and Platform maturity.
Dependencies