diff --git a/AGENTS.md b/AGENTS.md index 0346ac3..02296af 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -39,6 +39,8 @@ Handler entry tests: `cdk/test/handlers/orchestrate-task.test.ts`, `create-task. ### Common mistakes +- **Starting implementation without an approved GitHub issue** — Conversational approval ("yes, do it", "go ahead", "start with X") is NOT governance approval. The correct sequence is: create a GitHub issue with acceptance criteria → get the `approved` label from an admin → self-assign → comment "Starting implementation" → then begin work. Even if the user explicitly directs the work in conversation, create the durable artifact (issue) first. See [ADR-003](./docs/decisions/003-contribution-governance.md). +- **Creating branches without an issue reference** — Branch names must follow the pattern `(feat|fix|chore|docs)/-short-description`. A branch without an issue number is unauthorized work. Example: `feat/148-operational-knowledge-stack`. - Editing **`docs/src/content/docs/`** instead of **`docs/guides/`** or **`docs/design/`** — content is generated; sync from sources. - Adding or editing files in **`docs/design/`** or **`docs/guides/`** without running **`cd docs && node scripts/sync-starlight.mjs`** — CI will reject ("Fail build on mutation") because the Starlight mirror files in `docs/src/content/docs/` are stale. Always commit the regenerated mirrors alongside source changes. - Changing **`cdk/.../types.ts`** without updating **`cli/src/types.ts`** — CLI and API drift. diff --git a/docs/decisions/003-contribution-governance.md b/docs/decisions/003-contribution-governance.md index 537b502..91761e7 100644 --- a/docs/decisions/003-contribution-governance.md +++ b/docs/decisions/003-contribution-governance.md @@ -11,6 +11,10 @@ The rules below define how any contributor — human or AI — picks up, owns, a ## Decision +### No branches without an Issue + +Every feature branch references an issue in its name (e.g., `feat/123-short-description` or `fix/456-bug-name`). A branch without an issue reference is unauthorized work. This prevents the failure mode where work is started "just to explore" and then snowballs into a PR without governance. + ### No PRs without an Issue Every PR references an issue. The issue provides rationale, sufficient context for the solution to be obvious, and verifiable acceptance criteria. @@ -27,9 +31,9 @@ Issues align to the [product roadmap](https://github.com/aws-samples/sample-auto Only permitted users can mark an issue `approved` — a GitHub Actions workflow validates that the label applicant is authorized. An issue is not workable until it is both approved and assigned. After approval, the issue is considered scope-frozen: further revisions that change deliverables require re-approval. -### Self-assignment on start +### Assignments -Unassigned means available. On starting work, self-assign. Multiple assignees (>1) require intentionality verification. +Unassigned means available. Assignment may happen via self-assignment, directed assignment by another agent/human, or priority-based pickup (inspect open tasks for highest priority + earliest predecessor). Multiple assignees (>1) require intentionality verification. ### Issue body as primary directive @@ -47,10 +51,16 @@ Before implementation, the assigned contributor must: **Priority evaluation:** Identify priority (`p0`/`p1`/`p2`). If asked to work a lower-priority item while higher-priority items are unassigned, challenge: "Should I work on #X (p0) instead?" -**Predecessor validation:** If predecessors are incomplete, unassigned, and not in a stacked PR — challenge: "Steps 1-3 are incomplete. Starting step 4 may cause rework." +**Predecessor validation (GraphQL dependency graph is authoritative):** +- Query the issue's `blockedBy` field via GraphQL — if any blocking issue is open, this issue is **not ready** (hard gate) +- Check `parent`/`subIssues` ordering — verify prior siblings are complete or in-flight +- Reconcile graph vs. prose — graph is authoritative for enforcement; prose explains rationale +- If predecessors are incomplete, unassigned, and not in a stacked PR — challenge: "Steps 1-3 are incomplete. Starting step 4 may cause rework." **Cross-reference audit:** Search open issues for duplicates. Search open PRs (including drafts) for conflicts. Flag overlaps. Check the full dependency graph. Forward-look into downstream actions to ensure alignment. +**Dependency graph maintenance:** When creating/modifying issues with dependencies, use GraphQL mutations (`addBlockedBy`, `addSubIssue`) to maintain the machine-enforceable graph. Update prose to explain rationale. If they diverge, fix the wrong one (usually prose — graph is set programmatically). + **Final gate:** If all checks pass, comment "Starting implementation." ### Identity and attribution @@ -65,6 +75,36 @@ Provide progress signals at checkpoints. If blocked or abandoning, comment and u CI passes before requesting review. After merge, verify acceptance criteria and close. Create follow-up issues for discovered work before closing. +### Conversational approval is NOT issue approval + +A user saying "yes, do it" or "go ahead" in a conversation does NOT satisfy the governance gate. The correct response to conversational approval is: + +1. Create an issue with acceptance criteria +2. Request the `approved` label from an admin +3. Self-assign once approved +4. Then begin implementation + +**Known failure mode:** Agents interpret conversational momentum ("Yes start with X") as authorization to skip issue creation. This is the most common governance bypass — it feels like permission because the user explicitly directed the work, but the governance requires a *durable, reviewable artifact* (the issue), not a transient conversation. + +**Why this matters:** Conversations are ephemeral. Issues are auditable. If an agent creates work based on a conversation and that conversation is lost (context compaction, session end), no record exists of what was authorized, what the acceptance criteria were, or why the work was started. + +### Enforcement mechanisms (planned) + +Prose governance is necessary but insufficient. The following enforcement points are planned to prevent bypass progressively. Mechanisms are deployed incrementally — see #186 for implementation tracking. + +| Mechanism | Layer | What it catches | Status | +|-----------|-------|-----------------|--------| +| AGENTS.md directive | Agent prompt | Explicit instruction: "Do NOT begin implementation without an approved issue, even if the user says 'go ahead' in conversation" | Implemented | +| Branch name convention | Git workflow | Branch must match `(feat|fix|chore|docs)/-*` — rejects branches without issue reference | Planned | +| Commit-msg hook (Tier 0) | Pre-commit | Rejects commits without `Refs #N` or `Fixes #N` | Planned | +| Pre-push hook (Tier 1) | Pre-push | Validates referenced issue exists and has `approved` label via `gh` API | Planned | +| Claude Code hook (`PreToolUse: Write`) | Agent runtime | Blocks file creation in governed paths without declared issue context | Planned | +| Skill gate: `pickup-issue` | Agent workflow | Agent must invoke before implementation — hard-fails without valid issue | Planned | + +**Transition:** Branch naming and commit-msg rules apply to branches created after the corresponding hooks are deployed. Existing branches (including this PR's) pre-date enforcement. + +**Progressive enforcement:** Start with the commit-msg hook (cheapest, catches all contributors). Add pre-push validation next. Skill gates enforce at the agent-workflow level (see ADR-012, proposed, for the skill model). + ## Consequences - (+) Prevents duplicate effort — assignment signals ownership @@ -72,13 +112,18 @@ CI passes before requesting review. After merge, verify acceptance criteria and - (+) Prevents rework — predecessor validation catches out-of-order work - (+) Issue body stays current — threads are folded back - (+) Cross-reference audit catches duplicates early +- (+) Enforcement mechanisms catch bypass at multiple points - (-) Pre-start overhead for small tasks - (-) Requires discipline to fold threads into body +- (-) Commit-msg hook adds friction for rapid iteration on approved work - (!) Assumes priority labels exist and are maintained - (!) Self-assignment is not atomic — concurrent agents may race; mitigate by verifying assignment after claiming via refresh +- (!) Conversational approval bypass is the most common failure — enforcement must be structural, not behavioral ## References - Issue #134 — full RFC with open questions and automation requirements - Roadmap: Scale and collaboration (Agent swarm, Multi-user and teams) - ADR-001 — delivery methodology referenced by completion rules +- ADR-012 (proposed) — operational knowledge stack; planned enforcement via skill gates +- ADR-013 (proposed) — tiered validation; planned enforcement hooks at Tier 0 and Tier 1 diff --git a/docs/decisions/005-feedback-loop.md b/docs/decisions/005-feedback-loop.md new file mode 100644 index 0000000..540b40b --- /dev/null +++ b/docs/decisions/005-feedback-loop.md @@ -0,0 +1,68 @@ +# ADR-005: Feedback loop — PR reviews propagate to issues and ADRs + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +PR review comments are addressed locally (fix the code) but systemic issues they reveal are not propagated upstream. A reviewer says "this approach is wrong" but the issue still says "use this approach." ADRs are treated as immutable when they should be living decisions that evolve with implementation experience. + +Without a feedback protocol, review insights are lost, issue bodies rot, and architectural mistakes persist across stacked PR chains. + +## Decision + +### Review comment classification + +| Type | Action | Propagates to | +|------|--------|---------------| +| Nit (style, naming) | Fix in PR | Nothing | +| Bug (logic error) | Fix in PR | Nothing (unless systemic) | +| Design concern | Pause PR; evaluate | Issue body | +| Architecture challenge | Pause PR; escalate | ADR (supersede? amend?) | +| Scope question | Clarify | Issue body | +| Blocker (won't approve as-is) | Pause PR | Issue body | + +### Upstream propagation + +When a review surfaces a design concern or architecture challenge: + +1. **Pause** — Do not force-merge. Do not continue stacked PRs above this one. +2. **Assess** — Does this invalidate the issue's approach? The ADR's decision? +3. **Propagate** — Update the relevant upstream document (issue body, ADR, stacked PR dependents). +4. **Resolve** — Revise the approach, defend with evidence, or cancel the work. +5. **Resume** — Once resolved, unblock the PR and dependents. + +### ADR evolution + +| Trigger | Response | +|---------|----------| +| Implementation reveals the decision doesn't work | New RFC proposing a successor ADR | +| Reviewer challenges the architectural premise | `**UNRESOLVED**` on the issue; pause | +| New information makes the decision obsolete | Successor ADR with `Supersedes: ADR-NNN` | +| Decision works but needs refinement | Amend via PR (minor, no new ADR) | + +Never silently ignore a challenged decision. + +### Stacked PR chain revision + +When feedback on PR N invalidates PRs N+1 through N+M: +1. Comment on all affected PRs +2. Do not rebase dependent PRs until the base is stable +3. If architectural: re-evaluate whether the remaining stack is valid +4. If redesign needed: close dependent PRs, revise issue, re-plan + +## Consequences + +- (+) Review insights propagate to architectural decisions +- (+) Issue bodies stay current with implementation learnings +- (+) ADRs evolve rather than silently becoming outdated +- (+) Stacked PR chains have a defined recovery protocol +- (-) Adds process overhead to reviews (classification step) +- (-) Pausing stacked chains delays delivery +- (!) Requires discipline to actually propagate feedback upstream + +## References + +- Issue #136 — full RFC with open questions +- ADR-003 — governance (issue body as source of truth) +- ADR-001 — stacked PRs (chain revision protocol) diff --git a/docs/decisions/006-feature-flags.md b/docs/decisions/006-feature-flags.md new file mode 100644 index 0000000..979187c --- /dev/null +++ b/docs/decisions/006-feature-flags.md @@ -0,0 +1,82 @@ +# ADR-006: Feature flags for concurrent development + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Multiple agents working on related features in the same area must serialize — one waits for the other to merge. Incomplete features either block the main branch or require long-lived branches that diverge. SRE needs kill switches without reverting commits. + +Feature flags enable trunk-based development where incomplete work merges safely behind toggles, and concurrent contributors avoid blocking each other. + +## Decision + +### When to use flags + +| Situation | Use a flag? | +|-----------|-------------| +| Feature spans multiple PRs, incomplete state is unsafe | Yes | +| Two contributors touch the same module for different purposes | Yes | +| SRE needs a kill switch for a new capability | Yes | +| Simple refactor with no behavioral change | No | +| Bug fix | No | +| One-PR feature, complete on merge | No | + +### Flag ownership + +- Every flag has an owner (the issue that introduced it) +- Every flag has an expiration (the issue/PR that removes it) +- Flags without a removal plan are rejected in review + +### Separation of concerns + +- **Planners** decide which features get flags (issue/RFC level) +- **Implementors** add/use flags in code (PR level) +- **SRE/operators** toggle flags in production (runtime level) +- **No self-approval** — the person who introduces a flag cannot approve its removal + +### Flag lifecycle + +1. **Proposed** — issue identifies the need for a flag +2. **Introduced** — PR adds the flag (default: off) +3. **Active** — feature behind flag is in development +4. **Verified** — feature complete, flag toggled on in testing +5. **Permanent** — flag removed, feature is always-on (or removed entirely) + +### Lifecycle metadata + +Each flag must track: + +| Field | Required | Source | +|-------|----------|--------| +| Flag name | Yes | Code constant | +| Purpose / linked issue | Yes | Issue reference | +| First merge date | Yes | Auto from git log | +| Max lifetime | Yes | Declared at creation (default: 4 weeks) | +| Expected removal date | Yes | first_merge + max_lifetime | +| Actual removal date | — | Auto when flag deleted | +| Days active | — | Computed | + +### Maximum lifetime + +Flags must be removed within the declared max lifetime (default: 4 weeks) of the feature being verified. The max lifetime can be overridden per-flag with justification in the issue. Stale flags are treated as technical debt and surfaced in periodic reviews. + +### Mechanism constraint + +Flags MUST be resolvable at synth time for infrastructure flags and at runtime for behavior flags. The specific storage mechanism (CDK context, DynamoDB, SSM Parameter Store, env vars) is context-dependent and follows from this split — it is not prescribed by this ADR. + +## Consequences + +- (+) Concurrent work proceeds without blocking +- (+) Trunk-based development: main stays deployable +- (+) SRE can disable features without code changes +- (+) Partial features merge safely +- (-) Flag management overhead +- (-) Combinatorial testing complexity if many flags exist simultaneously +- (!) Maximum lifetime must be enforced or flags accumulate indefinitely + +## References + +- Issue #137 — full RFC with open questions on mechanism (CDK context vs. DynamoDB vs. env vars) +- ADR-003 — governance (flag introduction requires approval) +- ADR-005 — feedback loop (reviewer may flag-gate a feature during review) diff --git a/docs/decisions/007-knowledge-acquisition.md b/docs/decisions/007-knowledge-acquisition.md new file mode 100644 index 0000000..8f9f7fd --- /dev/null +++ b/docs/decisions/007-knowledge-acquisition.md @@ -0,0 +1,79 @@ +# ADR-007: Knowledge acquisition through progressive failure + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Agents with fresh context (tabula rasa) attempt to follow documentation and hit gaps they cannot resolve. These gaps are silently worked around (agent asks a human) rather than systematically fixed. The system cannot self-improve its onboarding because failures are not captured. + +Knowledge acquisition starts from zero. Each iteration creates the roadmap to better knowledge by discovering gaps through actual failures. + +## Decision + +### Zero-context execution attempts + +Periodically, an agent with no project memory attempts to follow guides end-to-end. The agent follows ONLY what is written — no inference, no training data knowledge, no asking colleagues. + +### Failure capture protocol + +At each failure point, the agent: +1. **Stops** — does not attempt to work around or guess +2. **Documents** — creates an issue: which document, which step, what was missing +3. **Continues** — attempts the next step (if possible) to find additional gaps + +### Retrospectives + +After completing a task, project milestone, or sprint, agents produce a retrospective artifact: +- What worked well (patterns to repeat) +- What failed or caused friction (patterns to avoid) +- Actionable experiments for future workflows + +Retrospectives are a first-class knowledge artifact — they feed into documentation improvements, inform ADR amendments, and surface systemic issues that individual task failures cannot. + +### Knowledge artifacts (interim) + +Until documentation meets ADR-004, agents may create ephemeral artifacts: +- Semantic indices of the codebase (call graphs, dependency maps) +- Annotated walkthroughs of successful executions +- "What I learned" summaries after completing a task +- Retrospectives (see above) + +These are scaffolding that informs documentation improvements, not documentation themselves. + +### Maturity model + +| Level | State | Agent capability | +|-------|-------|-----------------| +| 0 | No docs | Cannot start; files issue for missing docs | +| 1 | Partial docs | Follows docs, stops at gaps, files issues | +| 2 | Complete docs (ADR-004) | Completes end-to-end without help | +| 3 | Self-improving | Detects drift between docs and code, auto-files issues | + +### The self-improvement loop + +``` +Agent starts fresh → follows docs → hits failure → + files issue → issue gets fixed → next agent goes further → + hits next failure → files issue → ... + until end-to-end works from zero context +``` + +This runs continuously because code changes outpace documentation and different agent implementations fail at different points. + +## Consequences + +- (+) Documentation gaps become bugs with reproduction steps +- (+) Priority ordering emerges naturally (most common failures surface first) +- (+) The system self-improves without human identification of gaps +- (+) Creates a natural definition of "docs are done" (Level 2 achieved) +- (-) Generates issue volume that needs triage +- (-) Requires periodic investment in zero-context test runs +- (!) The gap between Level 1 and Level 2 may be large — patience required + +## References + +- Issue #138 — full RFC with open questions +- ADR-004 — defines the quality target (tabula rasa test) +- ADR-003 — governance for issues filed by failing agents +- ADR-008 — Level 4 Definition of Done depends on this protocol diff --git a/docs/decisions/008-definition-of-done.md b/docs/decisions/008-definition-of-done.md new file mode 100644 index 0000000..e552ec8 --- /dev/null +++ b/docs/decisions/008-definition-of-done.md @@ -0,0 +1,82 @@ +# ADR-008: Definition of Done (progressive maturity) + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +"Done" is implicit and varies by contributor. Some consider a passing build sufficient; others expect documentation, tests, and deployment verification. Agents have no unambiguous checklist to know they have completed work. Over-engineering "done" early blocks velocity; under-defining it ships incomplete work. + +The definition must be progressive — rising as the project matures — so it does not block early momentum but ensures quality at scale. + +## Decision + +### Progressive levels + +**Level 1 — Basic (minimum viable):** +- Code compiles without errors +- Existing tests pass (no regressions) +- New code has tests (unit level minimum) +- Linting passes +- PR description explains what and why +- Linked issue exists + +**Level 2 — Standard (current project default):** +- All of Level 1 +- Pre-commit hooks pass +- CDK synth succeeds (if infrastructure changes) +- Security scans pass (no new HIGH/CRITICAL findings) +- Documentation updated if behavior changes +- Starlight mirrors synced (if docs changed) + +**Level 3 — Rigorous (critical paths):** +- All of Level 2 +- Integration or E2E test covers the happy path +- Error paths tested +- Reviewer approved (human or qualified agent) +- Deployed to ephemeral stack and smoke-tested (if infrastructure) +- ADR written (if architectural decision made) + +**Level 4 — Self-verifying (future target):** +- All of Level 3 +- Tabula rasa agent can replicate the outcome using only docs +- CI includes behavioral verification +- Documentation drift detection passes + +### Default level by issue type + +| Issue type | Default level | +|-----------|---------------| +| Bug fix | Level 2 | +| New feature | Level 2-3 (based on blast radius) | +| Infrastructure/IAM change | Level 3 | +| Documentation only | Level 1 | +| Security fix | Level 3 | +| RFC/ADR implementation | Level 2 + ADR written | + +Issues may override by specifying `Done: Level N` in the body. + +### Verification responsibility + +| Level | Who verifies | +|-------|-------------| +| 1 | CI (automated) | +| 2 | CI + self-check by implementor | +| 3 | CI + reviewer + implementor | +| 4 | CI + reviewer + independent agent | + +## Consequences + +- (+) Agents have an unambiguous completion checklist +- (+) Quality bar rises as the project matures +- (+) Over-engineering is prevented (Level 1 for simple docs changes) +- (+) Critical paths get rigorous verification (Level 3) +- (-) Requires labeling or explicit level assignment per issue +- (-) Level 4 is aspirational and depends on ADR-007 (knowledge acquisition) +- (!) The project must eventually graduate from Level 2 to Level 3 default + +## References + +- Issue #139 — full RFC with open questions +- ADR-003 — governance (defines when to start; this defines when to stop) +- ADR-007 — knowledge acquisition (Level 4 depends on tabula rasa verification) diff --git a/docs/decisions/009-security-posture-dev-agents.md b/docs/decisions/009-security-posture-dev-agents.md new file mode 100644 index 0000000..6a67fd7 --- /dev/null +++ b/docs/decisions/009-security-posture-dev-agents.md @@ -0,0 +1,73 @@ +# ADR-009: Security posture and blast radius for development-time agents + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +The existing `SECURITY.md` covers runtime agent execution (inside MicroVMs). It does not cover **development-time agents** — those writing code, creating PRs, and modifying infrastructure in this repository. A development-time agent operates with the credentials of whoever invoked it, creating a risk of self-approval, policy modification, and unbounded blast radius. + +The core principle: **planners and implementors must be separated by context and ideally by identity. No self-approval.** + +## Decision + +### Role separation + +| Role | Can do | Cannot do | +|------|--------|-----------| +| **Planner** | Create/edit issues, write RFCs/ADRs, define roadmap and revisit vision | Write code, push branches, approve PRs | +| **Implementor** | Write code, create PRs, push branches, run tests | Approve own PRs, merge own PRs, modify CI/security config | +| **Reviewer** | Approve PRs, request changes, merge, suggest code (no commits) | Write code on the same PR being reviewed | +| **Admin** | All of the above + modify policies, approve issues | Still requires 2P for policy changes | + +### Blast radius classification + +| Action | Risk | Gate | +|--------|------|------| +| Edit code in existing patterns | Low | CI + peer review | +| Add new dependency | Medium | Security scan + review | +| Modify IAM policy / security config | High | 2P review + admin approval | +| Modify CI/CD workflow | High | 2P review + admin approval | +| Modify branch protection / approval rules | Critical | Admin-only + audit trail | +| Modify governance ADRs | Critical | Admin-only + 2P review | +| Delete or force-push protected branches | Critical | Never automated; human-only | + +### 2P (two-person) review + +For High and Critical actions: +- The author cannot be one of the two approvers +- At least one approver must be a human +- Approvals reference the specific risk being accepted + +### No self-approval (structural) + +- Branch protection requires review from someone other than the pusher +- Approval cannot come from the last committer on the branch +- If an agent plans AND implements, review must come from an identity that did neither +- The identity that writes code cannot approve or merge it + +### Credential scoping + +| Agent context | Minimum credentials | +|---------------|-------------------| +| Planning (issues, RFCs) | GitHub Issues write, read-only repo | +| Implementation (code, PRs) | Repo write, PR create, no merge capability | +| Review | PR review write, no push capability | +| Deployment | Separate deploy key, environment approval gate | + +## Consequences + +- (+) Prevents self-approval of dangerous changes +- (+) Blast radius is explicit and enforceable +- (+) Role separation enables audit trail +- (+) 2P review catches compromised or confused agents +- (-) Credential management complexity increases +- (-) Small tasks require multi-identity orchestration +- (!) Personal PATs grant all permissions — structural enforcement requires GitHub Apps or fine-grained tokens + +## References + +- Issue #140 — full RFC with open questions +- `docs/design/SECURITY.md` — runtime agent security (complementary) +- Cedar HITL gates (PR #88) — runtime tool-call governance +- ADR-003 — governance (approval gates enforced here technically) diff --git a/docs/decisions/010-error-recovery.md b/docs/decisions/010-error-recovery.md new file mode 100644 index 0000000..dd3883e --- /dev/null +++ b/docs/decisions/010-error-recovery.md @@ -0,0 +1,69 @@ +# ADR-010: Error recovery and rollback protocol + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +When merged code breaks something, the response is ad-hoc. Agents operating autonomously may merge code that passes CI but breaks integration. No protocol defines when to revert vs. fix forward, who decides, or how stacked PR chains recover. + +## Decision + +### Decision tree + +``` +Broken thing detected +├─ Production affected (users impacted NOW)? +│ └─ Yes → REVERT immediately, investigate after +├─ Fix obvious and < 30 minutes? +│ └─ Yes → Fix forward (new PR, not amend) +├─ Stacked PR chain? +│ └─ Yes → Pause dependent PRs, fix the base +└─ Scope of damage unclear? + └─ Yes → REVERT (safe default), then investigate +``` + +### Revert protocol + +1. Create a revert commit (not force-push) — preserves history +2. Open an issue: what broke, why CI did not catch it, what the fix needs +3. The fix goes through normal review (no rushing, no skipping gates) + +### Fix-forward protocol + +1. Only if the fix is obvious, small, and low-risk +2. Must still go through PR + review +3. If the fix introduces new complexity — revert instead + +### Stacked PR chain recovery + +1. Identify which PR introduced the breakage +2. Pause/close all PRs above it +3. Fix the base PR +4. Rebase and re-evaluate dependent PRs +5. Re-run CI on each before re-opening + +### Agents must NEVER do during recovery + +- Force-push to shared branches +- Delete branches with others' work +- Amend published commits +- Skip review "because it's urgent" +- Self-approve a revert + +## Consequences + +- (+) Clear decision tree prevents analysis paralysis during incidents +- (+) Revert-first default limits blast radius +- (+) Stacked chain recovery is defined (not improvised) +- (+) History is preserved (revert commits, not force-push) +- (-) Reverts create noise in git history +- (-) Fix-forward temptation may lead to rushed fixes +- (!) "Production affected" requires definition per deployment (self-hosted varies) + +## References + +- Issue #141 — full RFC with open questions +- ADR-003 — governance (no bypasses during recovery) +- ADR-001 — stacked PRs (chain recovery protocol) +- ADR-009 — security (revert authority tied to role) diff --git a/docs/decisions/011-conflict-resolution.md b/docs/decisions/011-conflict-resolution.md new file mode 100644 index 0000000..067d05e --- /dev/null +++ b/docs/decisions/011-conflict-resolution.md @@ -0,0 +1,64 @@ +# ADR-011: Conflict resolution protocol + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Multiple concurrent contributors — human or AI — will propose incompatible approaches, create merge conflicts, and disagree on design. Without a defined escalation path, work stalls or the loudest voice wins. + +## Decision + +### Escalation ladder + +``` +Level 1: Contributor discussion (PR comments, issue thread) + ↓ (no resolution within 2 interactions) +Level 2: Request additional reviewer (fresh perspective) + ↓ (still no resolution) +Level 3: Competing proposals in the issue body (explicit trade-off comparison) + ↓ (still no resolution) +Level 4: Admin decision (binding, documented in issue body) +``` + +### Decision criteria + +When comparing approaches, evaluate on: +1. **Correctness** — does it solve the stated problem? +2. **Simplicity** — fewer moving parts wins when correctness is equal +3. **Consistency** — follows existing codebase patterns? +4. **Reversibility** — can we change our mind later? +5. **Blast radius** — what breaks if this is wrong? + +### Merge conflict ownership + +| Situation | Who resolves | +|-----------|-------------| +| Two PRs modify same file, one merged first | Second PR's author rebases | +| Stacked PR conflict from lower change | Lower PR author notifies; upper PRs rebase after stable | +| Concurrent agents modified same module | First to merge wins; second adapts | +| Architectural conflict (both valid) | Escalate to Level 3 | + +### Human vs. agent disagreement + +- Agents present evidence (code, tests, measurements) not authority +- Humans can override but must document why +- Agents do not repeatedly argue a rejected point +- If an agent believes a human decision causes harm (security, data loss), it escalates to admin + +## Consequences + +- (+) Disagreements have a defined path to resolution +- (+) Merge conflicts have clear ownership +- (+) Competing approaches are compared on criteria, not authority +- (+) Admin decision is the final backstop (no infinite loops) +- (-) Escalation takes time; may slow delivery +- (-) Level 3 (written trade-off) requires effort +- (!) Must not become a veto mechanism for slow contributors + +## References + +- Issue #142 — full RFC with open questions +- ADR-003 — governance (issue body as resolution record) +- ADR-005 — feedback loop (reviewer disagreements feed into this) +- ADR-009 — security (authority levels for decisions) diff --git a/docs/decisions/012-operational-knowledge-stack.md b/docs/decisions/012-operational-knowledge-stack.md new file mode 100644 index 0000000..273b6e3 --- /dev/null +++ b/docs/decisions/012-operational-knowledge-stack.md @@ -0,0 +1,265 @@ +# ADR-012: Operational knowledge as a three-layer stack (Decision → Guide → Skill) + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Several ADRs in this repository contain operational runbook material embedded directly in the decision record. ADR-003 (contribution governance) prescribes a full pre-start review checklist. ADR-010 (error recovery) defines a decision tree and step-by-step protocols. ADR-008 (definition of done) provides per-issue-type checklists. + +This creates three problems: + +1. **Stale procedures** — Teams hesitate to update ADRs for minor procedural tweaks (timing thresholds, label names), so runbooks drift from practice. +2. **Agent execution gap** — Agents must parse prose ADRs, extract the operational steps, and interpret judgment calls. The ADR format is optimized for decision rationale, not execution. +3. **Persona mismatch** — A planner reading ADR-003 for the governance philosophy gets bogged down in GraphQL query syntax. An implementor executing the pre-start checklist must skip rationale paragraphs to find the steps. + +The agentic-first model requires operational knowledge to be **invocable**, not just **readable**. An agent should execute a governance workflow the same way it invokes a tool — with defined inputs, gates, and outputs. + +## Decision + +### Three-layer operational knowledge stack + +Every operational procedure identified in an ADR is decomposed into three layers: + +``` +┌─────────────────────────────────────────┐ +│ Layer 1: ADR (Decision Record) │ Immutable-ish +│ WHY we do it this way │ Changes: decision is superseded +│ Consumer: architects, future deciders │ +└─────────────────────────┬───────────────┘ + │ references +┌─────────────────────────▼───────────────┐ +│ Layer 2: Guide (Reference Document) │ Living document +│ WHAT to do, organized by persona │ Changes: process is refined +│ Consumer: humans + agents needing │ +│ context │ +└─────────────────────────┬───────────────┘ + │ operationalized by +┌─────────────────────────▼───────────────┐ +│ Layer 3: Skill (Executable Runbook) │ Versioned, invocable +│ HOW to execute, with gates and outputs │ Changes: implementation shifts +│ Consumer: agents during execution │ +└─────────────────────────────────────────┘ +``` + +### Layer definitions + +**Layer 1 — ADR (Decision Record)** + +- Records the architectural or process decision and its rationale +- States WHAT was decided and WHY +- Does NOT contain step-by-step procedures (those belong in Layer 2/3) +- References the guide(s) that operationalize the decision +- Changes only when the decision itself is superseded or amended + +**Layer 2 — Guide (Reference Document)** + +- Lives in `docs/guides/` +- Organized by persona (planner, implementor, reviewer, admin) +- Contains the WHAT and WHEN — what to do in which situations +- Includes context that helps humans (and agents needing background) understand the workflow +- References the ADR for justification +- Links to the skill(s) that mechanize the workflow +- Changes when the process is refined + +**Layer 3 — Skill (Executable Runbook)** + +- Lives as a Claude Code skill (or plugin skill) — invocable by name +- Encodes the HOW — the mechanical execution with explicit gates, inputs, outputs +- Structured as bounded, invocable units with clear entry/exit criteria +- An agent invokes the skill rather than parsing the guide/ADR +- References the guide for context when judgment is needed +- Changes when implementation details shift + +### Reference direction + +References always point upward: + +- Skill → references Guide (for context) +- Guide → references ADR (for justification) +- ADR → references Guide (for operationalization, "see Guide X for the workflow") + +This means a change at any layer triggers review of layers below: + +- ADR amended → review Guide → review Skill +- Guide refined → review Skill +- Skill updated → no upstream change needed (unless the procedure itself changed) + +### When a layer is NOT needed + +| Situation | Layers needed | +|-----------|---------------| +| Pure policy decision (no steps to follow) | ADR only | +| Decision with human-executed steps (rare, non-repeatable) | ADR + Guide | +| Decision with agent-executable procedure | ADR + Guide + Skill | +| Lightweight procedure (< 3 steps, no gates) | ADR + Guide (skill is overhead) | + +### ADR content rules (post-adoption) + +After adoption, ADRs: +- **MUST** contain: Context, Decision (the choice made), Consequences, References +- **MUST NOT** contain: Step-by-step procedures, checklists with >3 items, decision trees with branches, protocol sequences +- **SHOULD** contain: A one-paragraph summary of the operational approach (enough to understand without reading the guide) +- **SHOULD** reference: The guide that operationalizes the decision + +Existing ADRs are updated incrementally (not rewritten) — operational content is extracted, and a reference to the new guide/skill is added. + +### Skill structure requirements + +Skills that operationalize ADRs must: +- State which ADR/guide they implement (in frontmatter or header) +- Define explicit gates (conditions that MUST be true to proceed) +- Define explicit outputs (what the skill produces on completion) +- Be independently invocable (no implicit state from prior skills) +- Fail loudly at gates (not silently skip) + +## Example: ADR-003 decomposition + +ADR-003 (Contribution Governance) is the first ADR to be decomposed under this pattern because it is the most frequently executed procedure and the dependency root for other governance ADRs. + +### Current state (ADR-003 contains everything) + +ADR-003 currently holds: +- The decision to govern contributions (rationale) ✓ belongs in ADR +- Pre-start review checklist (8 mechanical steps) ✗ belongs in Guide + Skill +- Priority evaluation procedure ✗ belongs in Guide + Skill +- Predecessor validation with GraphQL queries ✗ belongs in Skill +- Cross-reference audit steps ✗ belongs in Guide + Skill +- Work-in-progress discipline rules ✗ belongs in Guide +- Completion and handoff procedure ✗ belongs in Guide + Skill + +### Target state (three layers) + +**Layer 1 — ADR-003 (slimmed)** + +Retains: +- Context (why governance is needed for async agents) +- Decision summary: "Every contribution follows: issue → approval → assignment → pre-start validation → implementation → completion" +- The principles: no PRs without issues, issue quality bar, admin approval gate, no self-approval, GraphQL as authoritative dependency source +- Consequences +- Reference: "See `docs/guides/CONTRIBUTOR_WORKFLOW.md` for the full workflow" + +Removes (extracted to Guide/Skill): +- The detailed pre-start review checklist +- GraphQL query specifics +- Step-by-step completion protocol + +**Layer 2 — `docs/guides/CONTRIBUTOR_WORKFLOW.md`** + +Organized by persona: + +```markdown +# Contributor Workflow + +> Operationalizes [ADR-003](../decisions/003-contribution-governance.md) + +## For Planners +- Issue quality bar (what makes an issue "ready") +- Approval process +- Priority labeling +- Dependency graph maintenance + +## For Implementors +- How to pick up an issue +- Pre-start review (summary — invoke skill for execution) +- Work-in-progress signals +- Completion criteria (references ADR-008 guide) + +## For Reviewers +- Review comment classification (references ADR-005 guide) +- When to block vs. approve +- Propagation responsibilities +``` + +**Layer 3 — Skills (invocable by agents)** + +| Skill | Inputs | Gates | Outputs | +|-------|--------|-------|---------| +| `pickup-issue` | Issue number | Issue approved, unassigned, no unresolved conflicts, predecessors complete (GraphQL check) | Assignment confirmed, "Starting implementation" comment | +| `validate-dependencies` | Issue number | GraphQL `blockedBy` returns no open blockers | Dependency report (clear / blocked with reason) | +| `complete-work` | Issue number, PR number | CI passes, DoD level met (ADR-008), no stale assignments | Completion comment, follow-up issues created | +| `cross-reference-audit` | Issue number | No duplicate issues, no conflicting open PRs | Audit report (clear / conflicts listed) | + +Each skill is a bounded unit. An agent picking up work invokes `pickup-issue` — it doesn't read ADR-003 and improvise. + +## Why prose alone fails: observed failure mode + +This ADR was itself initially created in violation of ADR-003. The agent (author) had ADR-003 loaded in context, analyzed it, called it "ready for contributing" — then immediately began implementation without creating an issue, requesting approval, or self-assigning. + +**The rationalization chain:** +1. "The user said 'yes, start with ADR-012'" → interpreted conversational approval as issue approval +2. "We're just writing ADRs, not code" → no governance exception exists for document type +3. "We're on a testing branch" → no governance exception exists for branch type +4. "Momentum — we're exploring" → governance exists precisely to interrupt unstructured momentum + +**What this proves:** An agent with full knowledge of the governance rules will still bypass them when the rules are prose-only. The agent *understood* ADR-003 intellectually but had no structural enforcement preventing violation. Reading a rule is not the same as being gated by it. + +**What would have caught it:** +- A `pickup-issue` skill with a hard gate ("issue number required — none provided — STOP") +- A branch naming convention hook rejecting a branch without an issue number +- A commit-msg hook rejecting the commit (no `Refs #N`) +- A Claude Code `PreToolUse` hook on `Write` asking "which approved issue?" + +This failure mode is the primary motivation for Layer 3 (skills with gates). Prose governance (Layer 1) establishes the rule. Guides (Layer 2) explain how to follow it. But only executable skills with hard gates (Layer 3) *enforce* it at the point of action. + +## Migration plan + +### Phase 1: Establish pattern (this ADR) + +- Adopt this ADR +- No existing ADRs are modified yet (operational content stays in place until guides/skills exist) + +### Phase 2: Decompose ADR-003 (proof of concept) + +- Create `docs/guides/CONTRIBUTOR_WORKFLOW.md` +- Create skills: `pickup-issue`, `validate-dependencies`, `complete-work`, `cross-reference-audit` +- Slim ADR-003 to decision + rationale + reference to guide +- Validate: an agent can invoke the skills and complete the governance workflow + +### Phase 3: Decompose remaining ADRs (incremental) + +Priority order (by execution frequency and mechanical content): + +| ADR | Guide | Skills | +|-----|-------|--------| +| 010 (Error Recovery) | `ERROR_RECOVERY.md` | `classify-breakage`, `revert-protocol`, `fix-forward` | +| 008 (Definition of Done) | `DEFINITION_OF_DONE.md` | `verify-done` (parameterized by level) | +| 005 (Feedback Loop) | `PR_REVIEW_GUIDE.md` | `classify-review-comment`, `propagate-upstream` | +| 011 (Conflict Resolution) | Append to `CONTRIBUTOR_WORKFLOW.md` | `resolve-conflict` (escalation ladder) | + +ADRs without operational content (001, 002, 004, 006, 007, 009) remain unchanged. + +### Phase 4: Plugin marketplace (future) + +Skills become shareable across projects: +- Fork governance skills for team-specific thresholds +- Compose skills from multiple ADRs into project-specific workflows +- Version skills independently from the ADRs that justify them + +## Consequences + +- (+) ADRs stay stable as decision records — not burdened with procedure maintenance +- (+) Guides serve the human reader organized by what they need to do +- (+) Skills make agents execute consistently — no prose interpretation, no drift +- (+) Change cadence is appropriate per layer — procedures evolve without "amending an ADR" +- (+) The three layers serve different consumers without redundancy +- (+) Skills are testable — you can verify an agent follows the procedure correctly +- (+) Hard gates in skills prevent the "understood but violated" failure mode +- (-) Three artifacts per procedure increases maintenance surface +- (-) Migration of existing ADRs requires effort +- (-) Skill development requires understanding the skill format and tooling +- (!) Reference chain integrity must be maintained — a broken link between layers means drift goes undetected +- (!) Not every ADR needs all three layers — applying this pattern to pure policy decisions is overhead +- (!) Without Layer 3 enforcement, Layers 1 and 2 are advisory-only — agents WILL rationalize bypasses + +## References + +- Issue #148 — implementation tracking for this ADR +- ADR-003 — first decomposition target (contribution governance); enforcement mechanisms added +- ADR-004 — documentation quality standard (guides must meet tabula rasa test) +- ADR-007 — knowledge acquisition (skills enable Level 3 self-improving) +- ADR-008 — definition of done (skill `verify-done` is a natural fit) +- ADR-010 — error recovery (decision tree is a natural skill) +- ADR-013 (proposed) — tiered validation pyramid; depends on this ADR for skill-based agent interaction with validation tiers +- [agentskills.io](https://agentskills.io/) — skill marketplace concept for shareable operational knowledge +- Claude Code plugin/skill format — the implementation vehicle for Layer 3 diff --git a/docs/decisions/013-tiered-validation-pyramid.md b/docs/decisions/013-tiered-validation-pyramid.md new file mode 100644 index 0000000..0c81750 --- /dev/null +++ b/docs/decisions/013-tiered-validation-pyramid.md @@ -0,0 +1,213 @@ +# ADR-013: Tiered validation pyramid for agentic-first development + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +The current validation architecture has two operational tiers: + +- **Pre-commit hooks** (< 5s) — formatting, secrets scan, file-level linting +- **Remote CI** (5–20 min) — full build, test, synth, security scans, deploy verification + +The gap between these tiers is significant. When an agent (or human) makes a change that passes pre-commit but fails in CI, the feedback loop is: + +``` +Write code → commit → push → wait 5-20 min → CI fails → + read failure → fix → commit → push → wait 5-20 min → ... +``` + +For a human, this is annoying. For an autonomous agent, this is catastrophic: + +- **Compute waste** — the agent idles or context-switches while waiting for remote results +- **Context loss** — by the time CI reports back, the agent may have compacted context or moved on +- **Cascade failures** — in a stacked PR chain (ADR-001), a CI failure on PR 1 blocks PRs 2–N, multiplying the wait +- **Cost amplification** — each round-trip costs inference tokens for the agent to re-read the failure, re-analyze, and re-attempt + +The root cause: there is no **Tier 2** — a local, fast, high-fidelity validation layer that catches integration-level issues *before* pushing to remote. + +### What exists today + +| Tier | Time | What it catches | Gap | +|------|------|-----------------|-----| +| Pre-commit (Tier 0) | < 5s | Formatting, secrets, trailing whitespace | None — works well | +| mise build (Tier 1) | 30–90s | Compile, unit tests, CDK synth, docs sync, linting | Partial — available but not gated on push | +| Remote CI (Tier 3) | 5–20 min | Full matrix, security, E2E, deploy | Authoritative but slow | +| **Local integration (Tier 2)** | — | **Does not exist** | Integration-level validation without remote round-trip | + +### Agentic-first motivation + +In a repo where agents run autonomously (ABCA's own design goal), validation speed directly determines: + +- **Throughput** — an agent with 30s feedback loops delivers 10–20x more iterations per hour than one with 15-minute loops +- **Quality** — fast feedback enables test-driven approaches; slow feedback encourages "push and pray" +- **Cost** — fewer remote CI runs, fewer wasted inference tokens on retry cycles +- **Autonomy** — an agent that can self-validate locally needs fewer human interventions + +## Decision + +### The validation pyramid + +``` + ┌─────────┐ + │ Tier 3 │ Remote CI (authoritative) + │ 5-20min │ Full matrix, deploy, E2E + ─┴─────────┴─ + ┌─────────────┐ + │ Tier 2 │ Local sandbox (high-fidelity) + │ 1-5 min │ Integration, ephemeral stack + ─┴─────────────┴─ + ┌─────────────────┐ + │ Tier 1 │ Local build (fast check) + │ 30-90s │ Compile, unit test, synth + ─┴─────────────────┴─ + ┌─────────────────────┐ + │ Tier 0 │ Pre-commit (gate) + │ < 5s │ Format, lint, secrets + └─────────────────────┘ +``` + +Each tier is **necessary but not sufficient** — passing a lower tier is required before attempting the next. Higher tiers never repeat work done by lower tiers. + +### Tier definitions + +**Tier 0 — Pre-commit (< 5s, gates every commit)** + +- Trailing whitespace, end-of-file fix +- Merge conflict markers +- Secrets scan (gitleaks) +- ESLint (file-level, staged files only) +- Docs sync check (no stale mirrors) +- YAML/JSON syntax validation + +Status: **Implemented** (prek hooks) + +**Tier 1 — Local build (30–90s, gates push)** + +- TypeScript compilation (all packages) +- Unit test suite (Jest) +- CDK synth (CloudFormation template generation) +- Agent quality checks (Python linting, type checking) +- Docs site build (astro check) +- Type sync drift (CDK ↔ CLI types in sync) +- Constants drift (cross-language contract check) + +Status: **Partially implemented** — available as `mise run build` but not enforced as a push gate. Agents can invoke this but often skip it. + +Requirement: Make `mise run build` (or a subset) the pre-push gate. Consider splitting into `mise run check:fast` (compile + lint, 30s) and `mise run check:full` (compile + test + synth, 90s). + +**Tier 2 — Local sandbox (1–5 min, on-demand before PR)** + +This tier does not exist today. It should provide: + +- Container-based integration tests against mocked AWS services (LocalStack or moto) +- CDK deploy to a local/ephemeral sandbox (validate IAM, resource creation without real cloud) +- Agent runtime smoke test (run the agent pipeline against a test repo in a local container) +- Cross-package integration (API → handler → agent contract verification) +- Policy validation (Cedar policy evaluation against test fixtures) + +Status: **Gap — does not exist.** This is the primary investment needed. + +Progressive build-out: + +| Phase | Capability | Mechanism | Catches | +|-------|-----------|-----------|---------| +| 2a | Container integration tests | `mise run test:integration` → Docker Compose + LocalStack | AWS API call failures, DynamoDB schema issues, SQS message format | +| 2b | Agent pipeline smoke | `mise run test:agent-smoke` → build agent container, run against fixture repo | Agent crashes, tool failures, prompt regressions | +| 2c | Ephemeral stack deploy | `mise run deploy:ephemeral` → CDK deploy to a disposable environment with auto-destroy | IAM permission gaps (ADR-002 preflight), resource wiring, real API behavior | +| 2d | Full local sandbox | `mise run sandbox` → MicroVM matching prod topology | End-to-end flow in production-equivalent isolation | + +**Tier 3 — Remote CI (5–20 min, authoritative, gates merge)** + +- Full test matrix (multiple Node versions if applicable) +- Security scans (Semgrep SAST, OSV deps, Grype container, Retire.js, zizmor) +- CDK diff against deployed stack +- Multi-account deployment verification +- E2E tests against real AWS services +- Performance/cost regression checks +- Documentation mutation check (fail if Starlight mirrors are stale) + +Status: **Implemented** (GitHub Actions). This remains the authoritative gate for merge. + +### Enforcement model + +| Event | Required tier | Enforcement | +|-------|--------------|-------------| +| `git commit` | Tier 0 | Pre-commit hook (prek) | +| `git push` | Tier 1 | Pre-push hook | +| PR created/updated | Tier 3 | GitHub Actions required status checks | +| Agent self-validation (before PR) | Tier 1 + Tier 2 (when available) | Skill-driven (agent invokes `validate-locally`) | +| Merge | Tier 3 passed + reviewer approved | Branch protection | + +### Agent interaction model + +Agents interact with validation tiers through skills (depends on ADR-012 for the skill model): + +``` +Agent completes implementation + → invokes `validate-locally` skill + → skill runs Tier 1 (`mise run check:full`) + → if Tier 2 available: runs Tier 2 (`mise run test:integration`) + → reports: PASS (safe to push) / FAIL (fix before push, here's why) + → agent fixes failures locally (fast loop) + → pushes only when local validation passes + → Tier 3 runs remotely (confirmatory, not exploratory) +``` + +The critical shift: **Tier 3 becomes confirmatory, not exploratory.** Agents should not discover failures in remote CI — they should confirm that locally-validated work passes the authoritative gate. + +### Investment priority + +The gap analysis dictates priority: + +| Priority | Investment | Impact | +|----------|-----------|--------| +| P0 | Enforce Tier 1 as pre-push gate | Eliminates "pushed without building" class of CI failures | +| P1 | `mise run test:integration` (Tier 2a — LocalStack) | Eliminates 60%+ of CI-only failures (AWS API contract mismatches) | +| P2 | Agent smoke test (Tier 2b) | Catches agent runtime regressions before PR | +| P3 | Ephemeral stack deploy (Tier 2c) | Catches IAM/wiring issues that only surface in real deployment | +| P4 | Full local sandbox (Tier 2d) | Production-equivalent local validation (long-term target) | + +### Design constraints + +- **Tier 2 must not require cloud credentials for basic operation** — agents running in isolation (MicroVM, CI runner) need to validate without AWS access. LocalStack/moto fills this. +- **Tier 2 must be optional until stable** — a failing Tier 2 should warn, not block, during build-out. Once stable, it becomes a gate. +- **Tier 2 must be cacheable** — container images, LocalStack state, and fixture repos should be cached between runs. An agent shouldn't rebuild the world every time. +- **No tier should duplicate work from a lower tier** — if Tier 0 checks formatting, Tier 1 does not re-check it. If Tier 1 runs unit tests, Tier 3 does not re-run them (it may run *additional* tests but not the same ones). + +### Escape hatches + +| Situation | Allowed bypass | +|-----------|---------------| +| Hotfix with production down | Skip Tier 2, expedite Tier 3 review | +| Documentation-only change | Tier 0 + Tier 1 (synth not needed) | +| Dependency bump (Dependabot) | Tier 0 + Tier 3 (CI validates compatibility) | +| Agent cannot run Tier 2 (tooling unavailable) | Push with Tier 1 only, note in PR that Tier 2 was skipped | + +Escape hatches must be explicit (noted in PR description, not silent). + +## Consequences + +- (+) Agent feedback loops drop from 15 minutes to 30–90 seconds for most issues +- (+) Remote CI failure rate drops — issues caught locally before push +- (+) Agents can self-validate autonomously without waiting for external systems +- (+) Investment is progressive — each tier delivers value independently +- (+) Clear ownership: Tier 0–2 are developer/agent responsibility; Tier 3 is platform responsibility +- (+) Cost reduction — fewer CI minutes wasted on obviously-broken pushes +- (-) Tier 2 infrastructure requires maintenance (LocalStack config, container images, fixtures) +- (-) Local machine requirements increase (Docker, disk space for containers) +- (-) Tier 2 may diverge from real AWS behavior — LocalStack is not 100% faithful +- (-) Pre-push gate adds 30–90s to every push (mitigation: `mise run check:fast` for safe paths) +- (!) LocalStack fidelity gaps must be documented — when Tier 2 passes but Tier 3 fails, document the divergence and add it to Tier 2's scope +- (!) Tier 2 "optional until stable" phase must have a defined graduation criteria, or it stays optional forever + +## References + +- Issue #149 — implementation tracking for this ADR +- ADR-002 — bootstrap policies (Tier 2c validates IAM preflight locally) +- ADR-008 — definition of done (tier requirements per DoD level) +- ADR-012 (prerequisite) — operational knowledge stack; this ADR depends on 012's skill model for agent interaction with validation tiers +- Current hooks: `.pre-commit-config.yaml` (Tier 0 implementation) +- Current build: `mise.toml` root + package-level configs (Tier 1 implementation) +- LocalStack: https://localstack.cloud (candidate for Tier 2a) +- Firecracker MicroVMs: https://firecracker-microvm.github.io (candidate for Tier 2d) diff --git a/docs/src/content/docs/decisions/003-contribution-governance.md b/docs/src/content/docs/decisions/003-contribution-governance.md index 722c3db..5453488 100644 --- a/docs/src/content/docs/decisions/003-contribution-governance.md +++ b/docs/src/content/docs/decisions/003-contribution-governance.md @@ -15,6 +15,10 @@ The rules below define how any contributor — human or AI — picks up, owns, a ## Decision +### No branches without an Issue + +Every feature branch references an issue in its name (e.g., `feat/123-short-description` or `fix/456-bug-name`). A branch without an issue reference is unauthorized work. This prevents the failure mode where work is started "just to explore" and then snowballs into a PR without governance. + ### No PRs without an Issue Every PR references an issue. The issue provides rationale, sufficient context for the solution to be obvious, and verifiable acceptance criteria. @@ -31,9 +35,9 @@ Issues align to the [product roadmap](https://github.com/aws-samples/sample-auto Only permitted users can mark an issue `approved` — a GitHub Actions workflow validates that the label applicant is authorized. An issue is not workable until it is both approved and assigned. After approval, the issue is considered scope-frozen: further revisions that change deliverables require re-approval. -### Self-assignment on start +### Assignments -Unassigned means available. On starting work, self-assign. Multiple assignees (>1) require intentionality verification. +Unassigned means available. Assignment may happen via self-assignment, directed assignment by another agent/human, or priority-based pickup (inspect open tasks for highest priority + earliest predecessor). Multiple assignees (>1) require intentionality verification. ### Issue body as primary directive @@ -51,10 +55,16 @@ Before implementation, the assigned contributor must: **Priority evaluation:** Identify priority (`p0`/`p1`/`p2`). If asked to work a lower-priority item while higher-priority items are unassigned, challenge: "Should I work on #X (p0) instead?" -**Predecessor validation:** If predecessors are incomplete, unassigned, and not in a stacked PR — challenge: "Steps 1-3 are incomplete. Starting step 4 may cause rework." +**Predecessor validation (GraphQL dependency graph is authoritative):** +- Query the issue's `blockedBy` field via GraphQL — if any blocking issue is open, this issue is **not ready** (hard gate) +- Check `parent`/`subIssues` ordering — verify prior siblings are complete or in-flight +- Reconcile graph vs. prose — graph is authoritative for enforcement; prose explains rationale +- If predecessors are incomplete, unassigned, and not in a stacked PR — challenge: "Steps 1-3 are incomplete. Starting step 4 may cause rework." **Cross-reference audit:** Search open issues for duplicates. Search open PRs (including drafts) for conflicts. Flag overlaps. Check the full dependency graph. Forward-look into downstream actions to ensure alignment. +**Dependency graph maintenance:** When creating/modifying issues with dependencies, use GraphQL mutations (`addBlockedBy`, `addSubIssue`) to maintain the machine-enforceable graph. Update prose to explain rationale. If they diverge, fix the wrong one (usually prose — graph is set programmatically). + **Final gate:** If all checks pass, comment "Starting implementation." ### Identity and attribution @@ -69,6 +79,36 @@ Provide progress signals at checkpoints. If blocked or abandoning, comment and u CI passes before requesting review. After merge, verify acceptance criteria and close. Create follow-up issues for discovered work before closing. +### Conversational approval is NOT issue approval + +A user saying "yes, do it" or "go ahead" in a conversation does NOT satisfy the governance gate. The correct response to conversational approval is: + +1. Create an issue with acceptance criteria +2. Request the `approved` label from an admin +3. Self-assign once approved +4. Then begin implementation + +**Known failure mode:** Agents interpret conversational momentum ("Yes start with X") as authorization to skip issue creation. This is the most common governance bypass — it feels like permission because the user explicitly directed the work, but the governance requires a *durable, reviewable artifact* (the issue), not a transient conversation. + +**Why this matters:** Conversations are ephemeral. Issues are auditable. If an agent creates work based on a conversation and that conversation is lost (context compaction, session end), no record exists of what was authorized, what the acceptance criteria were, or why the work was started. + +### Enforcement mechanisms (planned) + +Prose governance is necessary but insufficient. The following enforcement points are planned to prevent bypass progressively. Mechanisms are deployed incrementally — see #186 for implementation tracking. + +| Mechanism | Layer | What it catches | Status | +|-----------|-------|-----------------|--------| +| AGENTS.md directive | Agent prompt | Explicit instruction: "Do NOT begin implementation without an approved issue, even if the user says 'go ahead' in conversation" | Implemented | +| Branch name convention | Git workflow | Branch must match `(feat|fix|chore|docs)/-*` — rejects branches without issue reference | Planned | +| Commit-msg hook (Tier 0) | Pre-commit | Rejects commits without `Refs #N` or `Fixes #N` | Planned | +| Pre-push hook (Tier 1) | Pre-push | Validates referenced issue exists and has `approved` label via `gh` API | Planned | +| Claude Code hook (`PreToolUse: Write`) | Agent runtime | Blocks file creation in governed paths without declared issue context | Planned | +| Skill gate: `pickup-issue` | Agent workflow | Agent must invoke before implementation — hard-fails without valid issue | Planned | + +**Transition:** Branch naming and commit-msg rules apply to branches created after the corresponding hooks are deployed. Existing branches (including this PR's) pre-date enforcement. + +**Progressive enforcement:** Start with the commit-msg hook (cheapest, catches all contributors). Add pre-push validation next. Skill gates enforce at the agent-workflow level (see ADR-012, proposed, for the skill model). + ## Consequences - (+) Prevents duplicate effort — assignment signals ownership @@ -76,13 +116,18 @@ CI passes before requesting review. After merge, verify acceptance criteria and - (+) Prevents rework — predecessor validation catches out-of-order work - (+) Issue body stays current — threads are folded back - (+) Cross-reference audit catches duplicates early +- (+) Enforcement mechanisms catch bypass at multiple points - (-) Pre-start overhead for small tasks - (-) Requires discipline to fold threads into body +- (-) Commit-msg hook adds friction for rapid iteration on approved work - (!) Assumes priority labels exist and are maintained - (!) Self-assignment is not atomic — concurrent agents may race; mitigate by verifying assignment after claiming via refresh +- (!) Conversational approval bypass is the most common failure — enforcement must be structural, not behavioral ## References - Issue #134 — full RFC with open questions and automation requirements - Roadmap: Scale and collaboration (Agent swarm, Multi-user and teams) - ADR-001 — delivery methodology referenced by completion rules +- ADR-012 (proposed) — operational knowledge stack; planned enforcement via skill gates +- ADR-013 (proposed) — tiered validation; planned enforcement hooks at Tier 0 and Tier 1 diff --git a/docs/src/content/docs/decisions/005-feedback-loop.md b/docs/src/content/docs/decisions/005-feedback-loop.md new file mode 100644 index 0000000..174713f --- /dev/null +++ b/docs/src/content/docs/decisions/005-feedback-loop.md @@ -0,0 +1,72 @@ +--- +title: 005 feedback loop +--- + +# ADR-005: Feedback loop — PR reviews propagate to issues and ADRs + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +PR review comments are addressed locally (fix the code) but systemic issues they reveal are not propagated upstream. A reviewer says "this approach is wrong" but the issue still says "use this approach." ADRs are treated as immutable when they should be living decisions that evolve with implementation experience. + +Without a feedback protocol, review insights are lost, issue bodies rot, and architectural mistakes persist across stacked PR chains. + +## Decision + +### Review comment classification + +| Type | Action | Propagates to | +|------|--------|---------------| +| Nit (style, naming) | Fix in PR | Nothing | +| Bug (logic error) | Fix in PR | Nothing (unless systemic) | +| Design concern | Pause PR; evaluate | Issue body | +| Architecture challenge | Pause PR; escalate | ADR (supersede? amend?) | +| Scope question | Clarify | Issue body | +| Blocker (won't approve as-is) | Pause PR | Issue body | + +### Upstream propagation + +When a review surfaces a design concern or architecture challenge: + +1. **Pause** — Do not force-merge. Do not continue stacked PRs above this one. +2. **Assess** — Does this invalidate the issue's approach? The ADR's decision? +3. **Propagate** — Update the relevant upstream document (issue body, ADR, stacked PR dependents). +4. **Resolve** — Revise the approach, defend with evidence, or cancel the work. +5. **Resume** — Once resolved, unblock the PR and dependents. + +### ADR evolution + +| Trigger | Response | +|---------|----------| +| Implementation reveals the decision doesn't work | New RFC proposing a successor ADR | +| Reviewer challenges the architectural premise | `**UNRESOLVED**` on the issue; pause | +| New information makes the decision obsolete | Successor ADR with `Supersedes: ADR-NNN` | +| Decision works but needs refinement | Amend via PR (minor, no new ADR) | + +Never silently ignore a challenged decision. + +### Stacked PR chain revision + +When feedback on PR N invalidates PRs N+1 through N+M: +1. Comment on all affected PRs +2. Do not rebase dependent PRs until the base is stable +3. If architectural: re-evaluate whether the remaining stack is valid +4. If redesign needed: close dependent PRs, revise issue, re-plan + +## Consequences + +- (+) Review insights propagate to architectural decisions +- (+) Issue bodies stay current with implementation learnings +- (+) ADRs evolve rather than silently becoming outdated +- (+) Stacked PR chains have a defined recovery protocol +- (-) Adds process overhead to reviews (classification step) +- (-) Pausing stacked chains delays delivery +- (!) Requires discipline to actually propagate feedback upstream + +## References + +- Issue #136 — full RFC with open questions +- ADR-003 — governance (issue body as source of truth) +- ADR-001 — stacked PRs (chain revision protocol) diff --git a/docs/src/content/docs/decisions/006-feature-flags.md b/docs/src/content/docs/decisions/006-feature-flags.md new file mode 100644 index 0000000..da778eb --- /dev/null +++ b/docs/src/content/docs/decisions/006-feature-flags.md @@ -0,0 +1,86 @@ +--- +title: 006 feature flags +--- + +# ADR-006: Feature flags for concurrent development + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Multiple agents working on related features in the same area must serialize — one waits for the other to merge. Incomplete features either block the main branch or require long-lived branches that diverge. SRE needs kill switches without reverting commits. + +Feature flags enable trunk-based development where incomplete work merges safely behind toggles, and concurrent contributors avoid blocking each other. + +## Decision + +### When to use flags + +| Situation | Use a flag? | +|-----------|-------------| +| Feature spans multiple PRs, incomplete state is unsafe | Yes | +| Two contributors touch the same module for different purposes | Yes | +| SRE needs a kill switch for a new capability | Yes | +| Simple refactor with no behavioral change | No | +| Bug fix | No | +| One-PR feature, complete on merge | No | + +### Flag ownership + +- Every flag has an owner (the issue that introduced it) +- Every flag has an expiration (the issue/PR that removes it) +- Flags without a removal plan are rejected in review + +### Separation of concerns + +- **Planners** decide which features get flags (issue/RFC level) +- **Implementors** add/use flags in code (PR level) +- **SRE/operators** toggle flags in production (runtime level) +- **No self-approval** — the person who introduces a flag cannot approve its removal + +### Flag lifecycle + +1. **Proposed** — issue identifies the need for a flag +2. **Introduced** — PR adds the flag (default: off) +3. **Active** — feature behind flag is in development +4. **Verified** — feature complete, flag toggled on in testing +5. **Permanent** — flag removed, feature is always-on (or removed entirely) + +### Lifecycle metadata + +Each flag must track: + +| Field | Required | Source | +|-------|----------|--------| +| Flag name | Yes | Code constant | +| Purpose / linked issue | Yes | Issue reference | +| First merge date | Yes | Auto from git log | +| Max lifetime | Yes | Declared at creation (default: 4 weeks) | +| Expected removal date | Yes | first_merge + max_lifetime | +| Actual removal date | — | Auto when flag deleted | +| Days active | — | Computed | + +### Maximum lifetime + +Flags must be removed within the declared max lifetime (default: 4 weeks) of the feature being verified. The max lifetime can be overridden per-flag with justification in the issue. Stale flags are treated as technical debt and surfaced in periodic reviews. + +### Mechanism constraint + +Flags MUST be resolvable at synth time for infrastructure flags and at runtime for behavior flags. The specific storage mechanism (CDK context, DynamoDB, SSM Parameter Store, env vars) is context-dependent and follows from this split — it is not prescribed by this ADR. + +## Consequences + +- (+) Concurrent work proceeds without blocking +- (+) Trunk-based development: main stays deployable +- (+) SRE can disable features without code changes +- (+) Partial features merge safely +- (-) Flag management overhead +- (-) Combinatorial testing complexity if many flags exist simultaneously +- (!) Maximum lifetime must be enforced or flags accumulate indefinitely + +## References + +- Issue #137 — full RFC with open questions on mechanism (CDK context vs. DynamoDB vs. env vars) +- ADR-003 — governance (flag introduction requires approval) +- ADR-005 — feedback loop (reviewer may flag-gate a feature during review) diff --git a/docs/src/content/docs/decisions/007-knowledge-acquisition.md b/docs/src/content/docs/decisions/007-knowledge-acquisition.md new file mode 100644 index 0000000..b137b2c --- /dev/null +++ b/docs/src/content/docs/decisions/007-knowledge-acquisition.md @@ -0,0 +1,83 @@ +--- +title: 007 knowledge acquisition +--- + +# ADR-007: Knowledge acquisition through progressive failure + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Agents with fresh context (tabula rasa) attempt to follow documentation and hit gaps they cannot resolve. These gaps are silently worked around (agent asks a human) rather than systematically fixed. The system cannot self-improve its onboarding because failures are not captured. + +Knowledge acquisition starts from zero. Each iteration creates the roadmap to better knowledge by discovering gaps through actual failures. + +## Decision + +### Zero-context execution attempts + +Periodically, an agent with no project memory attempts to follow guides end-to-end. The agent follows ONLY what is written — no inference, no training data knowledge, no asking colleagues. + +### Failure capture protocol + +At each failure point, the agent: +1. **Stops** — does not attempt to work around or guess +2. **Documents** — creates an issue: which document, which step, what was missing +3. **Continues** — attempts the next step (if possible) to find additional gaps + +### Retrospectives + +After completing a task, project milestone, or sprint, agents produce a retrospective artifact: +- What worked well (patterns to repeat) +- What failed or caused friction (patterns to avoid) +- Actionable experiments for future workflows + +Retrospectives are a first-class knowledge artifact — they feed into documentation improvements, inform ADR amendments, and surface systemic issues that individual task failures cannot. + +### Knowledge artifacts (interim) + +Until documentation meets ADR-004, agents may create ephemeral artifacts: +- Semantic indices of the codebase (call graphs, dependency maps) +- Annotated walkthroughs of successful executions +- "What I learned" summaries after completing a task +- Retrospectives (see above) + +These are scaffolding that informs documentation improvements, not documentation themselves. + +### Maturity model + +| Level | State | Agent capability | +|-------|-------|-----------------| +| 0 | No docs | Cannot start; files issue for missing docs | +| 1 | Partial docs | Follows docs, stops at gaps, files issues | +| 2 | Complete docs (ADR-004) | Completes end-to-end without help | +| 3 | Self-improving | Detects drift between docs and code, auto-files issues | + +### The self-improvement loop + +``` +Agent starts fresh → follows docs → hits failure → + files issue → issue gets fixed → next agent goes further → + hits next failure → files issue → ... + until end-to-end works from zero context +``` + +This runs continuously because code changes outpace documentation and different agent implementations fail at different points. + +## Consequences + +- (+) Documentation gaps become bugs with reproduction steps +- (+) Priority ordering emerges naturally (most common failures surface first) +- (+) The system self-improves without human identification of gaps +- (+) Creates a natural definition of "docs are done" (Level 2 achieved) +- (-) Generates issue volume that needs triage +- (-) Requires periodic investment in zero-context test runs +- (!) The gap between Level 1 and Level 2 may be large — patience required + +## References + +- Issue #138 — full RFC with open questions +- ADR-004 — defines the quality target (tabula rasa test) +- ADR-003 — governance for issues filed by failing agents +- ADR-008 — Level 4 Definition of Done depends on this protocol diff --git a/docs/src/content/docs/decisions/008-definition-of-done.md b/docs/src/content/docs/decisions/008-definition-of-done.md new file mode 100644 index 0000000..caeda51 --- /dev/null +++ b/docs/src/content/docs/decisions/008-definition-of-done.md @@ -0,0 +1,86 @@ +--- +title: 008 definition of done +--- + +# ADR-008: Definition of Done (progressive maturity) + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +"Done" is implicit and varies by contributor. Some consider a passing build sufficient; others expect documentation, tests, and deployment verification. Agents have no unambiguous checklist to know they have completed work. Over-engineering "done" early blocks velocity; under-defining it ships incomplete work. + +The definition must be progressive — rising as the project matures — so it does not block early momentum but ensures quality at scale. + +## Decision + +### Progressive levels + +**Level 1 — Basic (minimum viable):** +- Code compiles without errors +- Existing tests pass (no regressions) +- New code has tests (unit level minimum) +- Linting passes +- PR description explains what and why +- Linked issue exists + +**Level 2 — Standard (current project default):** +- All of Level 1 +- Pre-commit hooks pass +- CDK synth succeeds (if infrastructure changes) +- Security scans pass (no new HIGH/CRITICAL findings) +- Documentation updated if behavior changes +- Starlight mirrors synced (if docs changed) + +**Level 3 — Rigorous (critical paths):** +- All of Level 2 +- Integration or E2E test covers the happy path +- Error paths tested +- Reviewer approved (human or qualified agent) +- Deployed to ephemeral stack and smoke-tested (if infrastructure) +- ADR written (if architectural decision made) + +**Level 4 — Self-verifying (future target):** +- All of Level 3 +- Tabula rasa agent can replicate the outcome using only docs +- CI includes behavioral verification +- Documentation drift detection passes + +### Default level by issue type + +| Issue type | Default level | +|-----------|---------------| +| Bug fix | Level 2 | +| New feature | Level 2-3 (based on blast radius) | +| Infrastructure/IAM change | Level 3 | +| Documentation only | Level 1 | +| Security fix | Level 3 | +| RFC/ADR implementation | Level 2 + ADR written | + +Issues may override by specifying `Done: Level N` in the body. + +### Verification responsibility + +| Level | Who verifies | +|-------|-------------| +| 1 | CI (automated) | +| 2 | CI + self-check by implementor | +| 3 | CI + reviewer + implementor | +| 4 | CI + reviewer + independent agent | + +## Consequences + +- (+) Agents have an unambiguous completion checklist +- (+) Quality bar rises as the project matures +- (+) Over-engineering is prevented (Level 1 for simple docs changes) +- (+) Critical paths get rigorous verification (Level 3) +- (-) Requires labeling or explicit level assignment per issue +- (-) Level 4 is aspirational and depends on ADR-007 (knowledge acquisition) +- (!) The project must eventually graduate from Level 2 to Level 3 default + +## References + +- Issue #139 — full RFC with open questions +- ADR-003 — governance (defines when to start; this defines when to stop) +- ADR-007 — knowledge acquisition (Level 4 depends on tabula rasa verification) diff --git a/docs/src/content/docs/decisions/009-security-posture-dev-agents.md b/docs/src/content/docs/decisions/009-security-posture-dev-agents.md new file mode 100644 index 0000000..7fa62f8 --- /dev/null +++ b/docs/src/content/docs/decisions/009-security-posture-dev-agents.md @@ -0,0 +1,77 @@ +--- +title: 009 security posture dev agents +--- + +# ADR-009: Security posture and blast radius for development-time agents + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +The existing `SECURITY.md` covers runtime agent execution (inside MicroVMs). It does not cover **development-time agents** — those writing code, creating PRs, and modifying infrastructure in this repository. A development-time agent operates with the credentials of whoever invoked it, creating a risk of self-approval, policy modification, and unbounded blast radius. + +The core principle: **planners and implementors must be separated by context and ideally by identity. No self-approval.** + +## Decision + +### Role separation + +| Role | Can do | Cannot do | +|------|--------|-----------| +| **Planner** | Create/edit issues, write RFCs/ADRs, define roadmap and revisit vision | Write code, push branches, approve PRs | +| **Implementor** | Write code, create PRs, push branches, run tests | Approve own PRs, merge own PRs, modify CI/security config | +| **Reviewer** | Approve PRs, request changes, merge, suggest code (no commits) | Write code on the same PR being reviewed | +| **Admin** | All of the above + modify policies, approve issues | Still requires 2P for policy changes | + +### Blast radius classification + +| Action | Risk | Gate | +|--------|------|------| +| Edit code in existing patterns | Low | CI + peer review | +| Add new dependency | Medium | Security scan + review | +| Modify IAM policy / security config | High | 2P review + admin approval | +| Modify CI/CD workflow | High | 2P review + admin approval | +| Modify branch protection / approval rules | Critical | Admin-only + audit trail | +| Modify governance ADRs | Critical | Admin-only + 2P review | +| Delete or force-push protected branches | Critical | Never automated; human-only | + +### 2P (two-person) review + +For High and Critical actions: +- The author cannot be one of the two approvers +- At least one approver must be a human +- Approvals reference the specific risk being accepted + +### No self-approval (structural) + +- Branch protection requires review from someone other than the pusher +- Approval cannot come from the last committer on the branch +- If an agent plans AND implements, review must come from an identity that did neither +- The identity that writes code cannot approve or merge it + +### Credential scoping + +| Agent context | Minimum credentials | +|---------------|-------------------| +| Planning (issues, RFCs) | GitHub Issues write, read-only repo | +| Implementation (code, PRs) | Repo write, PR create, no merge capability | +| Review | PR review write, no push capability | +| Deployment | Separate deploy key, environment approval gate | + +## Consequences + +- (+) Prevents self-approval of dangerous changes +- (+) Blast radius is explicit and enforceable +- (+) Role separation enables audit trail +- (+) 2P review catches compromised or confused agents +- (-) Credential management complexity increases +- (-) Small tasks require multi-identity orchestration +- (!) Personal PATs grant all permissions — structural enforcement requires GitHub Apps or fine-grained tokens + +## References + +- Issue #140 — full RFC with open questions +- `docs/design/SECURITY.md` — runtime agent security (complementary) +- Cedar HITL gates (PR #88) — runtime tool-call governance +- ADR-003 — governance (approval gates enforced here technically) diff --git a/docs/src/content/docs/decisions/010-error-recovery.md b/docs/src/content/docs/decisions/010-error-recovery.md new file mode 100644 index 0000000..d16c44b --- /dev/null +++ b/docs/src/content/docs/decisions/010-error-recovery.md @@ -0,0 +1,73 @@ +--- +title: 010 error recovery +--- + +# ADR-010: Error recovery and rollback protocol + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +When merged code breaks something, the response is ad-hoc. Agents operating autonomously may merge code that passes CI but breaks integration. No protocol defines when to revert vs. fix forward, who decides, or how stacked PR chains recover. + +## Decision + +### Decision tree + +``` +Broken thing detected +├─ Production affected (users impacted NOW)? +│ └─ Yes → REVERT immediately, investigate after +├─ Fix obvious and < 30 minutes? +│ └─ Yes → Fix forward (new PR, not amend) +├─ Stacked PR chain? +│ └─ Yes → Pause dependent PRs, fix the base +└─ Scope of damage unclear? + └─ Yes → REVERT (safe default), then investigate +``` + +### Revert protocol + +1. Create a revert commit (not force-push) — preserves history +2. Open an issue: what broke, why CI did not catch it, what the fix needs +3. The fix goes through normal review (no rushing, no skipping gates) + +### Fix-forward protocol + +1. Only if the fix is obvious, small, and low-risk +2. Must still go through PR + review +3. If the fix introduces new complexity — revert instead + +### Stacked PR chain recovery + +1. Identify which PR introduced the breakage +2. Pause/close all PRs above it +3. Fix the base PR +4. Rebase and re-evaluate dependent PRs +5. Re-run CI on each before re-opening + +### Agents must NEVER do during recovery + +- Force-push to shared branches +- Delete branches with others' work +- Amend published commits +- Skip review "because it's urgent" +- Self-approve a revert + +## Consequences + +- (+) Clear decision tree prevents analysis paralysis during incidents +- (+) Revert-first default limits blast radius +- (+) Stacked chain recovery is defined (not improvised) +- (+) History is preserved (revert commits, not force-push) +- (-) Reverts create noise in git history +- (-) Fix-forward temptation may lead to rushed fixes +- (!) "Production affected" requires definition per deployment (self-hosted varies) + +## References + +- Issue #141 — full RFC with open questions +- ADR-003 — governance (no bypasses during recovery) +- ADR-001 — stacked PRs (chain recovery protocol) +- ADR-009 — security (revert authority tied to role) diff --git a/docs/src/content/docs/decisions/011-conflict-resolution.md b/docs/src/content/docs/decisions/011-conflict-resolution.md new file mode 100644 index 0000000..b9068b6 --- /dev/null +++ b/docs/src/content/docs/decisions/011-conflict-resolution.md @@ -0,0 +1,68 @@ +--- +title: 011 conflict resolution +--- + +# ADR-011: Conflict resolution protocol + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Multiple concurrent contributors — human or AI — will propose incompatible approaches, create merge conflicts, and disagree on design. Without a defined escalation path, work stalls or the loudest voice wins. + +## Decision + +### Escalation ladder + +``` +Level 1: Contributor discussion (PR comments, issue thread) + ↓ (no resolution within 2 interactions) +Level 2: Request additional reviewer (fresh perspective) + ↓ (still no resolution) +Level 3: Competing proposals in the issue body (explicit trade-off comparison) + ↓ (still no resolution) +Level 4: Admin decision (binding, documented in issue body) +``` + +### Decision criteria + +When comparing approaches, evaluate on: +1. **Correctness** — does it solve the stated problem? +2. **Simplicity** — fewer moving parts wins when correctness is equal +3. **Consistency** — follows existing codebase patterns? +4. **Reversibility** — can we change our mind later? +5. **Blast radius** — what breaks if this is wrong? + +### Merge conflict ownership + +| Situation | Who resolves | +|-----------|-------------| +| Two PRs modify same file, one merged first | Second PR's author rebases | +| Stacked PR conflict from lower change | Lower PR author notifies; upper PRs rebase after stable | +| Concurrent agents modified same module | First to merge wins; second adapts | +| Architectural conflict (both valid) | Escalate to Level 3 | + +### Human vs. agent disagreement + +- Agents present evidence (code, tests, measurements) not authority +- Humans can override but must document why +- Agents do not repeatedly argue a rejected point +- If an agent believes a human decision causes harm (security, data loss), it escalates to admin + +## Consequences + +- (+) Disagreements have a defined path to resolution +- (+) Merge conflicts have clear ownership +- (+) Competing approaches are compared on criteria, not authority +- (+) Admin decision is the final backstop (no infinite loops) +- (-) Escalation takes time; may slow delivery +- (-) Level 3 (written trade-off) requires effort +- (!) Must not become a veto mechanism for slow contributors + +## References + +- Issue #142 — full RFC with open questions +- ADR-003 — governance (issue body as resolution record) +- ADR-005 — feedback loop (reviewer disagreements feed into this) +- ADR-009 — security (authority levels for decisions) diff --git a/docs/src/content/docs/decisions/012-operational-knowledge-stack.md b/docs/src/content/docs/decisions/012-operational-knowledge-stack.md new file mode 100644 index 0000000..5c6ab7a --- /dev/null +++ b/docs/src/content/docs/decisions/012-operational-knowledge-stack.md @@ -0,0 +1,269 @@ +--- +title: 012 operational knowledge stack +--- + +# ADR-012: Operational knowledge as a three-layer stack (Decision → Guide → Skill) + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +Several ADRs in this repository contain operational runbook material embedded directly in the decision record. ADR-003 (contribution governance) prescribes a full pre-start review checklist. ADR-010 (error recovery) defines a decision tree and step-by-step protocols. ADR-008 (definition of done) provides per-issue-type checklists. + +This creates three problems: + +1. **Stale procedures** — Teams hesitate to update ADRs for minor procedural tweaks (timing thresholds, label names), so runbooks drift from practice. +2. **Agent execution gap** — Agents must parse prose ADRs, extract the operational steps, and interpret judgment calls. The ADR format is optimized for decision rationale, not execution. +3. **Persona mismatch** — A planner reading ADR-003 for the governance philosophy gets bogged down in GraphQL query syntax. An implementor executing the pre-start checklist must skip rationale paragraphs to find the steps. + +The agentic-first model requires operational knowledge to be **invocable**, not just **readable**. An agent should execute a governance workflow the same way it invokes a tool — with defined inputs, gates, and outputs. + +## Decision + +### Three-layer operational knowledge stack + +Every operational procedure identified in an ADR is decomposed into three layers: + +``` +┌─────────────────────────────────────────┐ +│ Layer 1: ADR (Decision Record) │ Immutable-ish +│ WHY we do it this way │ Changes: decision is superseded +│ Consumer: architects, future deciders │ +└─────────────────────────┬───────────────┘ + │ references +┌─────────────────────────▼───────────────┐ +│ Layer 2: Guide (Reference Document) │ Living document +│ WHAT to do, organized by persona │ Changes: process is refined +│ Consumer: humans + agents needing │ +│ context │ +└─────────────────────────┬───────────────┘ + │ operationalized by +┌─────────────────────────▼───────────────┐ +│ Layer 3: Skill (Executable Runbook) │ Versioned, invocable +│ HOW to execute, with gates and outputs │ Changes: implementation shifts +│ Consumer: agents during execution │ +└─────────────────────────────────────────┘ +``` + +### Layer definitions + +**Layer 1 — ADR (Decision Record)** + +- Records the architectural or process decision and its rationale +- States WHAT was decided and WHY +- Does NOT contain step-by-step procedures (those belong in Layer 2/3) +- References the guide(s) that operationalize the decision +- Changes only when the decision itself is superseded or amended + +**Layer 2 — Guide (Reference Document)** + +- Lives in `docs/guides/` +- Organized by persona (planner, implementor, reviewer, admin) +- Contains the WHAT and WHEN — what to do in which situations +- Includes context that helps humans (and agents needing background) understand the workflow +- References the ADR for justification +- Links to the skill(s) that mechanize the workflow +- Changes when the process is refined + +**Layer 3 — Skill (Executable Runbook)** + +- Lives as a Claude Code skill (or plugin skill) — invocable by name +- Encodes the HOW — the mechanical execution with explicit gates, inputs, outputs +- Structured as bounded, invocable units with clear entry/exit criteria +- An agent invokes the skill rather than parsing the guide/ADR +- References the guide for context when judgment is needed +- Changes when implementation details shift + +### Reference direction + +References always point upward: + +- Skill → references Guide (for context) +- Guide → references ADR (for justification) +- ADR → references Guide (for operationalization, "see Guide X for the workflow") + +This means a change at any layer triggers review of layers below: + +- ADR amended → review Guide → review Skill +- Guide refined → review Skill +- Skill updated → no upstream change needed (unless the procedure itself changed) + +### When a layer is NOT needed + +| Situation | Layers needed | +|-----------|---------------| +| Pure policy decision (no steps to follow) | ADR only | +| Decision with human-executed steps (rare, non-repeatable) | ADR + Guide | +| Decision with agent-executable procedure | ADR + Guide + Skill | +| Lightweight procedure (< 3 steps, no gates) | ADR + Guide (skill is overhead) | + +### ADR content rules (post-adoption) + +After adoption, ADRs: +- **MUST** contain: Context, Decision (the choice made), Consequences, References +- **MUST NOT** contain: Step-by-step procedures, checklists with >3 items, decision trees with branches, protocol sequences +- **SHOULD** contain: A one-paragraph summary of the operational approach (enough to understand without reading the guide) +- **SHOULD** reference: The guide that operationalizes the decision + +Existing ADRs are updated incrementally (not rewritten) — operational content is extracted, and a reference to the new guide/skill is added. + +### Skill structure requirements + +Skills that operationalize ADRs must: +- State which ADR/guide they implement (in frontmatter or header) +- Define explicit gates (conditions that MUST be true to proceed) +- Define explicit outputs (what the skill produces on completion) +- Be independently invocable (no implicit state from prior skills) +- Fail loudly at gates (not silently skip) + +## Example: ADR-003 decomposition + +ADR-003 (Contribution Governance) is the first ADR to be decomposed under this pattern because it is the most frequently executed procedure and the dependency root for other governance ADRs. + +### Current state (ADR-003 contains everything) + +ADR-003 currently holds: +- The decision to govern contributions (rationale) ✓ belongs in ADR +- Pre-start review checklist (8 mechanical steps) ✗ belongs in Guide + Skill +- Priority evaluation procedure ✗ belongs in Guide + Skill +- Predecessor validation with GraphQL queries ✗ belongs in Skill +- Cross-reference audit steps ✗ belongs in Guide + Skill +- Work-in-progress discipline rules ✗ belongs in Guide +- Completion and handoff procedure ✗ belongs in Guide + Skill + +### Target state (three layers) + +**Layer 1 — ADR-003 (slimmed)** + +Retains: +- Context (why governance is needed for async agents) +- Decision summary: "Every contribution follows: issue → approval → assignment → pre-start validation → implementation → completion" +- The principles: no PRs without issues, issue quality bar, admin approval gate, no self-approval, GraphQL as authoritative dependency source +- Consequences +- Reference: "See `docs/guides/CONTRIBUTOR_WORKFLOW.md` for the full workflow" + +Removes (extracted to Guide/Skill): +- The detailed pre-start review checklist +- GraphQL query specifics +- Step-by-step completion protocol + +**Layer 2 — `docs/guides/CONTRIBUTOR_WORKFLOW.md`** + +Organized by persona: + +```markdown +# Contributor Workflow + +> Operationalizes [ADR-003](/architecture/003-contribution-governance) + +## For Planners +- Issue quality bar (what makes an issue "ready") +- Approval process +- Priority labeling +- Dependency graph maintenance + +## For Implementors +- How to pick up an issue +- Pre-start review (summary — invoke skill for execution) +- Work-in-progress signals +- Completion criteria (references ADR-008 guide) + +## For Reviewers +- Review comment classification (references ADR-005 guide) +- When to block vs. approve +- Propagation responsibilities +``` + +**Layer 3 — Skills (invocable by agents)** + +| Skill | Inputs | Gates | Outputs | +|-------|--------|-------|---------| +| `pickup-issue` | Issue number | Issue approved, unassigned, no unresolved conflicts, predecessors complete (GraphQL check) | Assignment confirmed, "Starting implementation" comment | +| `validate-dependencies` | Issue number | GraphQL `blockedBy` returns no open blockers | Dependency report (clear / blocked with reason) | +| `complete-work` | Issue number, PR number | CI passes, DoD level met (ADR-008), no stale assignments | Completion comment, follow-up issues created | +| `cross-reference-audit` | Issue number | No duplicate issues, no conflicting open PRs | Audit report (clear / conflicts listed) | + +Each skill is a bounded unit. An agent picking up work invokes `pickup-issue` — it doesn't read ADR-003 and improvise. + +## Why prose alone fails: observed failure mode + +This ADR was itself initially created in violation of ADR-003. The agent (author) had ADR-003 loaded in context, analyzed it, called it "ready for contributing" — then immediately began implementation without creating an issue, requesting approval, or self-assigning. + +**The rationalization chain:** +1. "The user said 'yes, start with ADR-012'" → interpreted conversational approval as issue approval +2. "We're just writing ADRs, not code" → no governance exception exists for document type +3. "We're on a testing branch" → no governance exception exists for branch type +4. "Momentum — we're exploring" → governance exists precisely to interrupt unstructured momentum + +**What this proves:** An agent with full knowledge of the governance rules will still bypass them when the rules are prose-only. The agent *understood* ADR-003 intellectually but had no structural enforcement preventing violation. Reading a rule is not the same as being gated by it. + +**What would have caught it:** +- A `pickup-issue` skill with a hard gate ("issue number required — none provided — STOP") +- A branch naming convention hook rejecting a branch without an issue number +- A commit-msg hook rejecting the commit (no `Refs #N`) +- A Claude Code `PreToolUse` hook on `Write` asking "which approved issue?" + +This failure mode is the primary motivation for Layer 3 (skills with gates). Prose governance (Layer 1) establishes the rule. Guides (Layer 2) explain how to follow it. But only executable skills with hard gates (Layer 3) *enforce* it at the point of action. + +## Migration plan + +### Phase 1: Establish pattern (this ADR) + +- Adopt this ADR +- No existing ADRs are modified yet (operational content stays in place until guides/skills exist) + +### Phase 2: Decompose ADR-003 (proof of concept) + +- Create `docs/guides/CONTRIBUTOR_WORKFLOW.md` +- Create skills: `pickup-issue`, `validate-dependencies`, `complete-work`, `cross-reference-audit` +- Slim ADR-003 to decision + rationale + reference to guide +- Validate: an agent can invoke the skills and complete the governance workflow + +### Phase 3: Decompose remaining ADRs (incremental) + +Priority order (by execution frequency and mechanical content): + +| ADR | Guide | Skills | +|-----|-------|--------| +| 010 (Error Recovery) | `ERROR_RECOVERY.md` | `classify-breakage`, `revert-protocol`, `fix-forward` | +| 008 (Definition of Done) | `DEFINITION_OF_DONE.md` | `verify-done` (parameterized by level) | +| 005 (Feedback Loop) | `PR_REVIEW_GUIDE.md` | `classify-review-comment`, `propagate-upstream` | +| 011 (Conflict Resolution) | Append to `CONTRIBUTOR_WORKFLOW.md` | `resolve-conflict` (escalation ladder) | + +ADRs without operational content (001, 002, 004, 006, 007, 009) remain unchanged. + +### Phase 4: Plugin marketplace (future) + +Skills become shareable across projects: +- Fork governance skills for team-specific thresholds +- Compose skills from multiple ADRs into project-specific workflows +- Version skills independently from the ADRs that justify them + +## Consequences + +- (+) ADRs stay stable as decision records — not burdened with procedure maintenance +- (+) Guides serve the human reader organized by what they need to do +- (+) Skills make agents execute consistently — no prose interpretation, no drift +- (+) Change cadence is appropriate per layer — procedures evolve without "amending an ADR" +- (+) The three layers serve different consumers without redundancy +- (+) Skills are testable — you can verify an agent follows the procedure correctly +- (+) Hard gates in skills prevent the "understood but violated" failure mode +- (-) Three artifacts per procedure increases maintenance surface +- (-) Migration of existing ADRs requires effort +- (-) Skill development requires understanding the skill format and tooling +- (!) Reference chain integrity must be maintained — a broken link between layers means drift goes undetected +- (!) Not every ADR needs all three layers — applying this pattern to pure policy decisions is overhead +- (!) Without Layer 3 enforcement, Layers 1 and 2 are advisory-only — agents WILL rationalize bypasses + +## References + +- Issue #148 — implementation tracking for this ADR +- ADR-003 — first decomposition target (contribution governance); enforcement mechanisms added +- ADR-004 — documentation quality standard (guides must meet tabula rasa test) +- ADR-007 — knowledge acquisition (skills enable Level 3 self-improving) +- ADR-008 — definition of done (skill `verify-done` is a natural fit) +- ADR-010 — error recovery (decision tree is a natural skill) +- ADR-013 (proposed) — tiered validation pyramid; depends on this ADR for skill-based agent interaction with validation tiers +- [agentskills.io](https://agentskills.io/) — skill marketplace concept for shareable operational knowledge +- Claude Code plugin/skill format — the implementation vehicle for Layer 3 diff --git a/docs/src/content/docs/decisions/013-tiered-validation-pyramid.md b/docs/src/content/docs/decisions/013-tiered-validation-pyramid.md new file mode 100644 index 0000000..610abcc --- /dev/null +++ b/docs/src/content/docs/decisions/013-tiered-validation-pyramid.md @@ -0,0 +1,217 @@ +--- +title: 013 tiered validation pyramid +--- + +# ADR-013: Tiered validation pyramid for agentic-first development + +**Status:** proposed +**Date:** 2026-05-19 + +## Context + +The current validation architecture has two operational tiers: + +- **Pre-commit hooks** (< 5s) — formatting, secrets scan, file-level linting +- **Remote CI** (5–20 min) — full build, test, synth, security scans, deploy verification + +The gap between these tiers is significant. When an agent (or human) makes a change that passes pre-commit but fails in CI, the feedback loop is: + +``` +Write code → commit → push → wait 5-20 min → CI fails → + read failure → fix → commit → push → wait 5-20 min → ... +``` + +For a human, this is annoying. For an autonomous agent, this is catastrophic: + +- **Compute waste** — the agent idles or context-switches while waiting for remote results +- **Context loss** — by the time CI reports back, the agent may have compacted context or moved on +- **Cascade failures** — in a stacked PR chain (ADR-001), a CI failure on PR 1 blocks PRs 2–N, multiplying the wait +- **Cost amplification** — each round-trip costs inference tokens for the agent to re-read the failure, re-analyze, and re-attempt + +The root cause: there is no **Tier 2** — a local, fast, high-fidelity validation layer that catches integration-level issues *before* pushing to remote. + +### What exists today + +| Tier | Time | What it catches | Gap | +|------|------|-----------------|-----| +| Pre-commit (Tier 0) | < 5s | Formatting, secrets, trailing whitespace | None — works well | +| mise build (Tier 1) | 30–90s | Compile, unit tests, CDK synth, docs sync, linting | Partial — available but not gated on push | +| Remote CI (Tier 3) | 5–20 min | Full matrix, security, E2E, deploy | Authoritative but slow | +| **Local integration (Tier 2)** | — | **Does not exist** | Integration-level validation without remote round-trip | + +### Agentic-first motivation + +In a repo where agents run autonomously (ABCA's own design goal), validation speed directly determines: + +- **Throughput** — an agent with 30s feedback loops delivers 10–20x more iterations per hour than one with 15-minute loops +- **Quality** — fast feedback enables test-driven approaches; slow feedback encourages "push and pray" +- **Cost** — fewer remote CI runs, fewer wasted inference tokens on retry cycles +- **Autonomy** — an agent that can self-validate locally needs fewer human interventions + +## Decision + +### The validation pyramid + +``` + ┌─────────┐ + │ Tier 3 │ Remote CI (authoritative) + │ 5-20min │ Full matrix, deploy, E2E + ─┴─────────┴─ + ┌─────────────┐ + │ Tier 2 │ Local sandbox (high-fidelity) + │ 1-5 min │ Integration, ephemeral stack + ─┴─────────────┴─ + ┌─────────────────┐ + │ Tier 1 │ Local build (fast check) + │ 30-90s │ Compile, unit test, synth + ─┴─────────────────┴─ + ┌─────────────────────┐ + │ Tier 0 │ Pre-commit (gate) + │ < 5s │ Format, lint, secrets + └─────────────────────┘ +``` + +Each tier is **necessary but not sufficient** — passing a lower tier is required before attempting the next. Higher tiers never repeat work done by lower tiers. + +### Tier definitions + +**Tier 0 — Pre-commit (< 5s, gates every commit)** + +- Trailing whitespace, end-of-file fix +- Merge conflict markers +- Secrets scan (gitleaks) +- ESLint (file-level, staged files only) +- Docs sync check (no stale mirrors) +- YAML/JSON syntax validation + +Status: **Implemented** (prek hooks) + +**Tier 1 — Local build (30–90s, gates push)** + +- TypeScript compilation (all packages) +- Unit test suite (Jest) +- CDK synth (CloudFormation template generation) +- Agent quality checks (Python linting, type checking) +- Docs site build (astro check) +- Type sync drift (CDK ↔ CLI types in sync) +- Constants drift (cross-language contract check) + +Status: **Partially implemented** — available as `mise run build` but not enforced as a push gate. Agents can invoke this but often skip it. + +Requirement: Make `mise run build` (or a subset) the pre-push gate. Consider splitting into `mise run check:fast` (compile + lint, 30s) and `mise run check:full` (compile + test + synth, 90s). + +**Tier 2 — Local sandbox (1–5 min, on-demand before PR)** + +This tier does not exist today. It should provide: + +- Container-based integration tests against mocked AWS services (LocalStack or moto) +- CDK deploy to a local/ephemeral sandbox (validate IAM, resource creation without real cloud) +- Agent runtime smoke test (run the agent pipeline against a test repo in a local container) +- Cross-package integration (API → handler → agent contract verification) +- Policy validation (Cedar policy evaluation against test fixtures) + +Status: **Gap — does not exist.** This is the primary investment needed. + +Progressive build-out: + +| Phase | Capability | Mechanism | Catches | +|-------|-----------|-----------|---------| +| 2a | Container integration tests | `mise run test:integration` → Docker Compose + LocalStack | AWS API call failures, DynamoDB schema issues, SQS message format | +| 2b | Agent pipeline smoke | `mise run test:agent-smoke` → build agent container, run against fixture repo | Agent crashes, tool failures, prompt regressions | +| 2c | Ephemeral stack deploy | `mise run deploy:ephemeral` → CDK deploy to a disposable environment with auto-destroy | IAM permission gaps (ADR-002 preflight), resource wiring, real API behavior | +| 2d | Full local sandbox | `mise run sandbox` → MicroVM matching prod topology | End-to-end flow in production-equivalent isolation | + +**Tier 3 — Remote CI (5–20 min, authoritative, gates merge)** + +- Full test matrix (multiple Node versions if applicable) +- Security scans (Semgrep SAST, OSV deps, Grype container, Retire.js, zizmor) +- CDK diff against deployed stack +- Multi-account deployment verification +- E2E tests against real AWS services +- Performance/cost regression checks +- Documentation mutation check (fail if Starlight mirrors are stale) + +Status: **Implemented** (GitHub Actions). This remains the authoritative gate for merge. + +### Enforcement model + +| Event | Required tier | Enforcement | +|-------|--------------|-------------| +| `git commit` | Tier 0 | Pre-commit hook (prek) | +| `git push` | Tier 1 | Pre-push hook | +| PR created/updated | Tier 3 | GitHub Actions required status checks | +| Agent self-validation (before PR) | Tier 1 + Tier 2 (when available) | Skill-driven (agent invokes `validate-locally`) | +| Merge | Tier 3 passed + reviewer approved | Branch protection | + +### Agent interaction model + +Agents interact with validation tiers through skills (depends on ADR-012 for the skill model): + +``` +Agent completes implementation + → invokes `validate-locally` skill + → skill runs Tier 1 (`mise run check:full`) + → if Tier 2 available: runs Tier 2 (`mise run test:integration`) + → reports: PASS (safe to push) / FAIL (fix before push, here's why) + → agent fixes failures locally (fast loop) + → pushes only when local validation passes + → Tier 3 runs remotely (confirmatory, not exploratory) +``` + +The critical shift: **Tier 3 becomes confirmatory, not exploratory.** Agents should not discover failures in remote CI — they should confirm that locally-validated work passes the authoritative gate. + +### Investment priority + +The gap analysis dictates priority: + +| Priority | Investment | Impact | +|----------|-----------|--------| +| P0 | Enforce Tier 1 as pre-push gate | Eliminates "pushed without building" class of CI failures | +| P1 | `mise run test:integration` (Tier 2a — LocalStack) | Eliminates 60%+ of CI-only failures (AWS API contract mismatches) | +| P2 | Agent smoke test (Tier 2b) | Catches agent runtime regressions before PR | +| P3 | Ephemeral stack deploy (Tier 2c) | Catches IAM/wiring issues that only surface in real deployment | +| P4 | Full local sandbox (Tier 2d) | Production-equivalent local validation (long-term target) | + +### Design constraints + +- **Tier 2 must not require cloud credentials for basic operation** — agents running in isolation (MicroVM, CI runner) need to validate without AWS access. LocalStack/moto fills this. +- **Tier 2 must be optional until stable** — a failing Tier 2 should warn, not block, during build-out. Once stable, it becomes a gate. +- **Tier 2 must be cacheable** — container images, LocalStack state, and fixture repos should be cached between runs. An agent shouldn't rebuild the world every time. +- **No tier should duplicate work from a lower tier** — if Tier 0 checks formatting, Tier 1 does not re-check it. If Tier 1 runs unit tests, Tier 3 does not re-run them (it may run *additional* tests but not the same ones). + +### Escape hatches + +| Situation | Allowed bypass | +|-----------|---------------| +| Hotfix with production down | Skip Tier 2, expedite Tier 3 review | +| Documentation-only change | Tier 0 + Tier 1 (synth not needed) | +| Dependency bump (Dependabot) | Tier 0 + Tier 3 (CI validates compatibility) | +| Agent cannot run Tier 2 (tooling unavailable) | Push with Tier 1 only, note in PR that Tier 2 was skipped | + +Escape hatches must be explicit (noted in PR description, not silent). + +## Consequences + +- (+) Agent feedback loops drop from 15 minutes to 30–90 seconds for most issues +- (+) Remote CI failure rate drops — issues caught locally before push +- (+) Agents can self-validate autonomously without waiting for external systems +- (+) Investment is progressive — each tier delivers value independently +- (+) Clear ownership: Tier 0–2 are developer/agent responsibility; Tier 3 is platform responsibility +- (+) Cost reduction — fewer CI minutes wasted on obviously-broken pushes +- (-) Tier 2 infrastructure requires maintenance (LocalStack config, container images, fixtures) +- (-) Local machine requirements increase (Docker, disk space for containers) +- (-) Tier 2 may diverge from real AWS behavior — LocalStack is not 100% faithful +- (-) Pre-push gate adds 30–90s to every push (mitigation: `mise run check:fast` for safe paths) +- (!) LocalStack fidelity gaps must be documented — when Tier 2 passes but Tier 3 fails, document the divergence and add it to Tier 2's scope +- (!) Tier 2 "optional until stable" phase must have a defined graduation criteria, or it stays optional forever + +## References + +- Issue #149 — implementation tracking for this ADR +- ADR-002 — bootstrap policies (Tier 2c validates IAM preflight locally) +- ADR-008 — definition of done (tier requirements per DoD level) +- ADR-012 (prerequisite) — operational knowledge stack; this ADR depends on 012's skill model for agent interaction with validation tiers +- Current hooks: `.pre-commit-config.yaml` (Tier 0 implementation) +- Current build: `mise.toml` root + package-level configs (Tier 1 implementation) +- LocalStack: https://localstack.cloud (candidate for Tier 2a) +- Firecracker MicroVMs: https://firecracker-microvm.github.io (candidate for Tier 2d)