Skip to content

End-to-end install integration test: local + Actions modes, Copilot CLI as the agent, targets a scratch repo #55

@mrjf

Description

@mrjf

Summary

Build an end-to-end integration test that exercises the install.md Quick Start flow against a real GitHub repository, verifies everything is wired up correctly, and cleans up after itself. The test uses Copilot CLI as the agent that reads and follows the install instructions, so regressions in install.md or in the compile/init flow show up as test failures rather than silent rot.

Two modes:

  1. Local mode — a shell script a maintainer runs on their workstation. Primary use.
  2. Actions mode — a workflow_dispatch trigger in this repo that runs the same flow on a GitHub-hosted runner and reports pass/fail. Not part of CI — manually invoked only.

Target repo

Runs against mrjf/autoloop-test (private, already provisioned as an empty repo with a base state on main). The base state will contain enough content for autoloop to have a plausible program target:

  • README.md — explains the repo's purpose as the install-integration target.
  • src/minimize.py — a naive minimizer for a well-known optimization function (Rastrigin); the thing autoloop iterates on.
  • src/evaluate.py — runs the minimizer, emits {"metric": <float>} on stdout.
  • tests/test_minimize.py — correctness pin for the minimizer.

Every test run starts by hard-resetting mrjf/autoloop-test to origin/main (the base state), then runs the install, verifies, then hard-resets again so the next run starts from a known-clean state. No repo creation/deletion per run — the repo is long-lived; the state is transient.

What the test verifies

Phase 1 — install (required)

The agent follows install.md and produces:

  • gh aw extension installed on the runner.
  • gh aw init completed (.gitattributes, dispatcher, copilot-setup-steps).
  • Autoloop workflow files copied: .github/workflows/autoloop.md, .github/workflows/sync-branches.md (or, when Remove sync-branches workflow — made redundant by per-iteration Step 3 ahead/behind logic #52 lands, only autoloop.md), .github/workflows/shared/.
  • Lock files generated: .github/workflows/autoloop.lock.yml (and sync-branches.lock.yml if still present).
  • Issue template present: .github/ISSUE_TEMPLATE/autoloop-program.md.
  • Programs directory present: .autoloop/programs/.
  • Install PR opened against main.
  • Lock file is idempotent: re-running gh aw compile autoloop does not change the file (sha256 before == sha256 after).

Phase 2 — program creation (required)

After the install PR is merged:

  • Agent creates an autoloop-program issue via the issue template (either file-based program in .autoloop/programs/<name>/program.md or issue-based program with the autoloop-program label).
  • The program targets src/minimize.py with src/evaluate.py as the evaluation script.
  • The program is discovered by the scheduler on a manually-triggered workflow run (gh workflow run autoloop.lock.yml -f program=<name>).
  • The first iteration runs to completion (accepted OR rejected OR errored — all three are valid "it ran" outcomes).

Phase 3 — teardown (required)

  • Hard-reset main on the test repo to a saved base-state SHA.
  • Force-push.
  • Delete any autoloop/* branches created during the test.
  • Close any issues and PRs created during the test.
  • Delete the memory/autoloop branch if created.

Local mode

Script lives at tests/install-integration/run.sh. Usage:

# from the autoloop repo root:
./tests/install-integration/run.sh

Requirements:

  • gh authenticated as a user with write access to mrjf/autoloop-test.
  • copilot CLI on PATH.
  • python3 available.
  • INSTALL_TEST_REPO env var (default mrjf/autoloop-test) in case someone wants to point at a different target.

Behavior:

1. Pre-flight: verify gh auth, copilot on PATH, python3 on PATH.
2. Capture the current base-state SHA of the target repo's main branch.
3. Reset target repo to base state (no-op on first run; discards prior test debris on subsequent runs).
4. Clone target locally to a temp dir.
5. Feed install.md to copilot CLI with a tight prompt ("follow these steps exactly, do not improvise, report the install PR URL").
6. Phase 1 verification (file presence, lock idempotency, PR exists).
7. Merge the install PR via `gh pr merge --auto --squash`, wait for merge to land.
8. Create a program via the issue template (or file-based — prefer file-based for determinism).
9. Trigger the autoloop workflow: `gh workflow run autoloop.lock.yml -f program=<name>`. Poll until completion.
10. Phase 2 verification (program discovered, iteration ran to completion, state file written to memory/autoloop).
11. Teardown: reset to base state, close test issues/PRs, delete autoloop branches.
12. Report PASS/FAIL with a summary.

Exit non-zero on any failed assertion. trap 'teardown' EXIT ensures cleanup even on abort.

Actions mode

A workflow file at .github/workflows/install-integration-test.yml with workflow_dispatch:

name: Install Integration Test
on:
  workflow_dispatch:
    inputs:
      keep_state_on_failure:
        description: "Leave test repo in failure state for inspection"
        type: boolean
        default: false

jobs:
  install-integration:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - name: Install gh aw extension
        run: gh extension install github/gh-aw
        env:
          GH_TOKEN: ${{ secrets.INSTALL_TEST_TOKEN }}
      - name: Install Copilot CLI
        run: |
          # whatever the current install path is for copilot CLI
      - name: Run integration test
        run: ./tests/install-integration/run.sh
        env:
          GH_TOKEN: ${{ secrets.INSTALL_TEST_TOKEN }}
          INSTALL_TEST_REPO: mrjf/autoloop-test
          KEEP_STATE_ON_FAILURE: ${{ inputs.keep_state_on_failure }}

Requires a repo secret INSTALL_TEST_TOKEN — a PAT with repo scope on mrjf/autoloop-test. This token is how Actions mode authenticates to the target repo (the default GITHUB_TOKEN has no access to repos outside the host).

Actions mode is not on any schedule and not on PRs. It exists so maintainers can click "Run workflow" from the web UI and get a full end-to-end pass/fail in ~15–25 minutes without having to set up copilot locally.

Files to ship in this repo

tests/install-integration/
├── run.sh                          # the driver (local + Actions)
├── prompt.md                       # the Copilot CLI prompt (external for easy editing)
├── verify-phase1.sh                # file-presence + idempotency assertions
├── verify-phase2.sh                # program discovery + first iteration assertions
└── teardown.sh                     # reset-to-base + close issues + delete branches

.github/workflows/
└── install-integration-test.yml    # workflow_dispatch wrapper around run.sh

Keep the script small and readable; the value is in the flow, not the test framework.

Copilot CLI prompt sketch

Separate file (tests/install-integration/prompt.md) so it's editable without touching the driver. Something like:

You are installing autoloop into a freshly-reset GitHub repository.

Your working directory is the root of that repository, cloned locally. The
repository is empty except for the base fixtures in `src/` and `tests/`.

Follow the install instructions at the URL below, EXACTLY AS WRITTEN. Execute
each step using shell commands. Do not skip steps. Do not improvise. Do not
optimize or "improve" the instructions.

When you finish: print a single line `INSTALL_PR=<url>` with the URL of the
PR you opened in step 5. Then stop.

Install instructions: https://github.com/githubnext/autoloop/blob/main/install.md

The driver captures stdout and greps for INSTALL_PR= to get the PR URL it needs for Phase 2.

Base state on mrjf/autoloop-test

Initial content pushed to main (by a maintainer, one-time setup — not part of the test):

  • README.md — documents the repo's purpose and warns that main is force-pushed by the integration test.
  • src/minimize.py — naive Rastrigin minimizer; the optimization target.
  • src/evaluate.py — runs minimize.py, emits {"metric": <value>} (lower is better).
  • tests/test_minimize.py — pins correctness of the minimizer's signature and a smoke case.
  • .gitignore — standard Python ignores.

The test script captures origin/main's SHA at startup and resets to exactly that SHA at teardown. No assumption about what's in it beyond "it's a valid baseline." Future expansion (more fixtures, different language) only requires updating the base state on the test repo — the driver doesn't need changes.

Failure modes the test should catch

  1. install.md instructions silently rot (a step references a file or command that no longer exists).
  2. gh aw compile becomes non-idempotent (re-running changes the lock file).
  3. A new file is added to autoloop's workflows/ that install.md doesn't mention copying.
  4. Copilot CLI changes in a way that breaks "follow a numbered list of shell commands."
  5. The issue template's front-matter drifts from what the scheduler parses.
  6. The first iteration of a freshly-created program cannot complete (missing permissions, missing secrets, workflow compile error).

None of these are caught by any other existing test.

Out of scope

  • CI scheduling. Integration tests with real repos + LLM calls don't belong on every PR. Manual-dispatch only.
  • Testing Copilot CLI itself. We treat Copilot as a black box — if it can't follow clear instructions, the test fails and we investigate separately.
  • Repo creation per run. Using a long-lived target repo with reset-to-base semantics is cheaper and avoids accumulating abandoned repos across runs.

Acceptance

  • tests/install-integration/run.sh exists, passes locally when run against mrjf/autoloop-test.
  • .github/workflows/install-integration-test.yml runs the same script in Actions and reports PASS when triggered.
  • mrjf/autoloop-test is in the expected base state after a passing run.
  • If the test fails, a maintainer can pass --keep (local) or set the input keep_state_on_failure=true (Actions) to inspect the failure state before teardown runs.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions