RFC: ADR-010 — Error recovery and rollback protocol

## Summary

Define the protocol for how agents and humans respond when work breaks something — whether a broken deploy, a merged bug, a corrupted state, or a failed stacked PR chain. Specifies when to revert vs. fix forward, who decides, and how to prevent cascading damage.

## Use case and motivation

**Problem today:**
1. When a merged PR breaks something, the response is ad-hoc — sometimes a revert, sometimes a hotfix, sometimes nothing until someone notices.
2. Agents operating autonomously may merge code that passes CI but breaks integration (state machines, cross-service contracts, runtime behavior).
3. No protocol defines the decision tree for "revert immediately" vs. "fix forward quickly" vs. "investigate then decide."
4. Cascading failures from stacked PRs: if PR 3 of 8 broke something, what happens to PRs 4-8?

**After this ADR:**
- Clear decision tree for revert vs. fix-forward
- Defined blast radius assessment before any recovery action
- Stacked PR chain recovery protocol
- Automated detection where possible; human escalation where necessary

## Proposed protocol

### Decision tree

```
Broken thing detected
├─ Is production affected? (users impacted NOW)
│  ├─ Yes → REVERT immediately, investigate after
│  └─ No → Continue assessment
├─ Is the fix obvious and < 30 min?
│  ├─ Yes → Fix forward (new PR, not amend)
│  └─ No → REVERT, then fix properly
├─ Is it a stacked PR chain?
│  ├─ Yes → Pause all dependent PRs, fix the base
│  └─ No → Standard fix/revert
└─ Is the scope of damage unclear?
   ├─ Yes → REVERT (safe default), investigate
   └─ No → Targeted fix
```

### Revert protocol

1. Create a revert commit (not force-push) — preserves history
2. Open an issue documenting: what broke, why CI didn't catch it, what the fix needs
3. The original author (or any available contributor) owns the fix
4. The fix goes through normal review (no rushing, no skipping gates)

### Fix-forward protocol

1. Only if the fix is obvious, small, and low-risk
2. Must still go through PR + review (no direct pushes to main)
3. Document in the PR: "Fix-forward for breakage introduced in #N"
4. If the fix introduces NEW complexity — revert instead

### Stacked PR chain recovery

1. Identify which PR in the stack introduced the breakage
2. Pause/close all PRs above it
3. Fix the base PR
4. Rebase and re-evaluate all dependent PRs (some may need redesign, not just rebase)
5. Re-run CI on each before re-opening

### What agents must NEVER do during recovery

- Force-push to shared branches
- Delete branches with others' work
- Amend published commits
- Skip review "because it's urgent"
- Self-approve a revert

## Open questions

1. Should reverts be auto-generated by CI when a specific check fails post-merge?
2. Who has revert authority — any contributor, or only admins?
3. How do we detect "production affected" for a project that's self-hosted differently per user?
4. Should there be a "incident commander" role for complex failures?

## Alignment

Aligns with **Observability and safe deploy** and **Platform maturity**.

## Dependencies

- ADR-003 (#134) — governance (no bypasses during recovery)
- ADR-001 (stacked PRs) — chain recovery protocol
- ADR-009 — security (revert authority tied to role)

---

* RFC PR: (pending approval)
* Approved by: (pending)
* Reviewed by: (pending)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: ADR-010 — Error recovery and rollback protocol #141

Summary

Use case and motivation

Proposed protocol

Decision tree

Revert protocol

Fix-forward protocol

Stacked PR chain recovery

What agents must NEVER do during recovery

Open questions

Alignment

Dependencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: ADR-010 — Error recovery and rollback protocol #141

Description

Summary

Use case and motivation

Proposed protocol

Decision tree

Revert protocol

Fix-forward protocol

Stacked PR chain recovery

What agents must NEVER do during recovery

Open questions

Alignment

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions