End-to-end install integration test: local + Actions modes, Copilot CLI as the agent, targets a scratch repo

## Summary

Build an end-to-end integration test that exercises the `install.md` Quick Start flow against a real GitHub repository, verifies everything is wired up correctly, and cleans up after itself. The test uses **Copilot CLI** as the agent that reads and follows the install instructions, so regressions in `install.md` or in the compile/init flow show up as test failures rather than silent rot.

Two modes:

1. **Local mode** — a shell script a maintainer runs on their workstation. Primary use.
2. **Actions mode** — a `workflow_dispatch` trigger in this repo that runs the same flow on a GitHub-hosted runner and reports pass/fail. **Not part of CI** — manually invoked only.

## Target repo

Runs against **`mrjf/autoloop-test`** (private, already provisioned as an empty repo with a base state on `main`). The base state will contain enough content for autoloop to have a plausible program target:

- `README.md` — explains the repo's purpose as the install-integration target.
- `src/minimize.py` — a naive minimizer for a well-known optimization function (Rastrigin); the thing autoloop iterates on.
- `src/evaluate.py` — runs the minimizer, emits `{"metric": <float>}` on stdout.
- `tests/test_minimize.py` — correctness pin for the minimizer.

Every test run starts by **hard-resetting `mrjf/autoloop-test` to `origin/main`** (the base state), then runs the install, verifies, then hard-resets again so the next run starts from a known-clean state. No repo creation/deletion per run — the repo is long-lived; the state is transient.

## What the test verifies

### Phase 1 — install (required)

The agent follows `install.md` and produces:

- `gh aw` extension installed on the runner.
- `gh aw init` completed (`.gitattributes`, dispatcher, copilot-setup-steps).
- Autoloop workflow files copied: `.github/workflows/autoloop.md`, `.github/workflows/sync-branches.md` (or, when #52 lands, only `autoloop.md`), `.github/workflows/shared/`.
- Lock files generated: `.github/workflows/autoloop.lock.yml` (and `sync-branches.lock.yml` if still present).
- Issue template present: `.github/ISSUE_TEMPLATE/autoloop-program.md`.
- Programs directory present: `.autoloop/programs/`.
- Install PR opened against `main`.
- Lock file is **idempotent**: re-running `gh aw compile autoloop` does not change the file (sha256 before == sha256 after).

### Phase 2 — program creation (required)

After the install PR is merged:

- Agent creates an autoloop-program issue via the issue template (either file-based program in `.autoloop/programs/<name>/program.md` or issue-based program with the `autoloop-program` label).
- The program targets `src/minimize.py` with `src/evaluate.py` as the evaluation script.
- The program is discovered by the scheduler on a manually-triggered workflow run (`gh workflow run autoloop.lock.yml -f program=<name>`).
- The first iteration runs to completion (accepted OR rejected OR errored — all three are valid "it ran" outcomes).

### Phase 3 — teardown (required)

- Hard-reset `main` on the test repo to a saved base-state SHA.
- Force-push.
- Delete any `autoloop/*` branches created during the test.
- Close any issues and PRs created during the test.
- Delete the `memory/autoloop` branch if created.

## Local mode

Script lives at `tests/install-integration/run.sh`. Usage:

```bash
# from the autoloop repo root:
./tests/install-integration/run.sh
```

Requirements:

- `gh` authenticated as a user with write access to `mrjf/autoloop-test`.
- `copilot` CLI on PATH.
- `python3` available.
- `INSTALL_TEST_REPO` env var (default `mrjf/autoloop-test`) in case someone wants to point at a different target.

Behavior:

```
1. Pre-flight: verify gh auth, copilot on PATH, python3 on PATH.
2. Capture the current base-state SHA of the target repo's main branch.
3. Reset target repo to base state (no-op on first run; discards prior test debris on subsequent runs).
4. Clone target locally to a temp dir.
5. Feed install.md to copilot CLI with a tight prompt ("follow these steps exactly, do not improvise, report the install PR URL").
6. Phase 1 verification (file presence, lock idempotency, PR exists).
7. Merge the install PR via `gh pr merge --auto --squash`, wait for merge to land.
8. Create a program via the issue template (or file-based — prefer file-based for determinism).
9. Trigger the autoloop workflow: `gh workflow run autoloop.lock.yml -f program=<name>`. Poll until completion.
10. Phase 2 verification (program discovered, iteration ran to completion, state file written to memory/autoloop).
11. Teardown: reset to base state, close test issues/PRs, delete autoloop branches.
12. Report PASS/FAIL with a summary.
```

Exit non-zero on any failed assertion. `trap 'teardown' EXIT` ensures cleanup even on abort.

## Actions mode

A workflow file at `.github/workflows/install-integration-test.yml` with `workflow_dispatch`:

```yaml
name: Install Integration Test
on:
  workflow_dispatch:
    inputs:
      keep_state_on_failure:
        description: "Leave test repo in failure state for inspection"
        type: boolean
        default: false

jobs:
  install-integration:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - name: Install gh aw extension
        run: gh extension install github/gh-aw
        env:
          GH_TOKEN: ${{ secrets.INSTALL_TEST_TOKEN }}
      - name: Install Copilot CLI
        run: |
          # whatever the current install path is for copilot CLI
      - name: Run integration test
        run: ./tests/install-integration/run.sh
        env:
          GH_TOKEN: ${{ secrets.INSTALL_TEST_TOKEN }}
          INSTALL_TEST_REPO: mrjf/autoloop-test
          KEEP_STATE_ON_FAILURE: ${{ inputs.keep_state_on_failure }}
```

Requires a repo secret `INSTALL_TEST_TOKEN` — a PAT with `repo` scope on `mrjf/autoloop-test`. This token is how Actions mode authenticates to the target repo (the default `GITHUB_TOKEN` has no access to repos outside the host).

Actions mode is **not on any schedule and not on PRs**. It exists so maintainers can click "Run workflow" from the web UI and get a full end-to-end pass/fail in ~15–25 minutes without having to set up `copilot` locally.

## Files to ship in this repo

```
tests/install-integration/
├── run.sh                          # the driver (local + Actions)
├── prompt.md                       # the Copilot CLI prompt (external for easy editing)
├── verify-phase1.sh                # file-presence + idempotency assertions
├── verify-phase2.sh                # program discovery + first iteration assertions
└── teardown.sh                     # reset-to-base + close issues + delete branches

.github/workflows/
└── install-integration-test.yml    # workflow_dispatch wrapper around run.sh
```

Keep the script small and readable; the value is in the flow, not the test framework.

## Copilot CLI prompt sketch

Separate file (`tests/install-integration/prompt.md`) so it's editable without touching the driver. Something like:

```
You are installing autoloop into a freshly-reset GitHub repository.

Your working directory is the root of that repository, cloned locally. The
repository is empty except for the base fixtures in `src/` and `tests/`.

Follow the install instructions at the URL below, EXACTLY AS WRITTEN. Execute
each step using shell commands. Do not skip steps. Do not improvise. Do not
optimize or "improve" the instructions.

When you finish: print a single line `INSTALL_PR=<url>` with the URL of the
PR you opened in step 5. Then stop.

Install instructions: https://github.com/githubnext/autoloop/blob/main/install.md
```

The driver captures stdout and greps for `INSTALL_PR=` to get the PR URL it needs for Phase 2.

## Base state on `mrjf/autoloop-test`

Initial content pushed to `main` (by a maintainer, one-time setup — **not part of the test**):

- `README.md` — documents the repo's purpose and warns that `main` is force-pushed by the integration test.
- `src/minimize.py` — naive Rastrigin minimizer; the optimization target.
- `src/evaluate.py` — runs `minimize.py`, emits `{"metric": <value>}` (lower is better).
- `tests/test_minimize.py` — pins correctness of the minimizer's signature and a smoke case.
- `.gitignore` — standard Python ignores.

The test script captures `origin/main`'s SHA at startup and resets to exactly that SHA at teardown. No assumption about what's in it beyond "it's a valid baseline." Future expansion (more fixtures, different language) only requires updating the base state on the test repo — the driver doesn't need changes.

## Failure modes the test should catch

1. `install.md` instructions silently rot (a step references a file or command that no longer exists).
2. `gh aw compile` becomes non-idempotent (re-running changes the lock file).
3. A new file is added to autoloop's `workflows/` that `install.md` doesn't mention copying.
4. Copilot CLI changes in a way that breaks "follow a numbered list of shell commands."
5. The issue template's front-matter drifts from what the scheduler parses.
6. The first iteration of a freshly-created program cannot complete (missing permissions, missing secrets, workflow compile error).

None of these are caught by any other existing test.

## Out of scope

- CI scheduling. Integration tests with real repos + LLM calls don't belong on every PR. Manual-dispatch only.
- Testing Copilot CLI itself. We treat Copilot as a black box — if it can't follow clear instructions, the test fails and we investigate separately.
- Repo creation per run. Using a long-lived target repo with reset-to-base semantics is cheaper and avoids accumulating abandoned repos across runs.

## Acceptance

- `tests/install-integration/run.sh` exists, passes locally when run against `mrjf/autoloop-test`.
- `.github/workflows/install-integration-test.yml` runs the same script in Actions and reports PASS when triggered.
- `mrjf/autoloop-test` is in the expected base state after a passing run.
- If the test fails, a maintainer can pass `--keep` (local) or set the input `keep_state_on_failure=true` (Actions) to inspect the failure state before teardown runs.

## Related

- `install.md` — the document under test. Every regression in install caught here is a regression caught before it hits consumers.
- #46 — versioning. When `install.md` starts copying a `VERSION` file (per #46), the test adds a Phase 1 assertion that `.github/autoloop-version` exists in the install PR's diff.
- #52 — remove sync-branches. When that lands, the test drops its `sync-branches.lock.yml` assertions.
- #54 — install CI-readiness guidance. The test validates that a fresh install produces a state where that guidance's acceptance criteria hold ("no-op PR passes CI").


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end install integration test: local + Actions modes, Copilot CLI as the agent, targets a scratch repo #55

Summary

Target repo

What the test verifies

Phase 1 — install (required)

Phase 2 — program creation (required)

Phase 3 — teardown (required)

Local mode

Actions mode

Files to ship in this repo

Copilot CLI prompt sketch

Base state on `mrjf/autoloop-test`

Failure modes the test should catch

Out of scope

Acceptance

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

End-to-end install integration test: local + Actions modes, Copilot CLI as the agent, targets a scratch repo #55

Description

Summary

Target repo

What the test verifies

Phase 1 — install (required)

Phase 2 — program creation (required)

Phase 3 — teardown (required)

Local mode

Actions mode

Files to ship in this repo

Copilot CLI prompt sketch

Base state on mrjf/autoloop-test

Failure modes the test should catch

Out of scope

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Base state on `mrjf/autoloop-test`