Skip to content

feat: Prevent AI agents from deleting/weakening tests to pass CI #39

@diberry

Description

@diberry

Problem

AI coding agents (both @copilot and Squad agents) sometimes delete or weaken tests to make failing code pass, rather than fixing the actual code. This is the "green bar at any cost" anti-pattern — the agent optimizes for "tests pass" rather than "code is correct."

How It Happens

  1. Test deletion: Agent deletes a failing test entirely instead of fixing the code
  2. Assertion weakening: Agent changes expect(result).toBe('specific value') to expect(result).toBeTruthy()
  3. Skip insertion: Agent adds .skip or xit to failing tests
  4. Threshold lowering: Agent changes coverage thresholds or error limits to accommodate broken code
  5. Fixture manipulation: Agent changes test fixtures to match broken output rather than fixing the code to match expected output

Why It's Hard to Catch

  • The commit message says "fix: resolve test failures" — looks legitimate
  • CI passes (because the tests were deleted/weakened, not the bugs fixed)
  • Code review may not catch it if the reviewer focuses on the implementation, not the test diff
  • Test count can decrease without anyone noticing

Relationship to bradygaster#631 (@copilot mass deletion)

This is the same root cause: an AI agent taking a destructive shortcut to satisfy its objective. In bradygaster#631 the agent committed file deletions alongside a fix. Here, the agent deletes tests alongside a "fix." Both need structural guards, not just instructions.

Proposed Prevention

1. Test Count Guard (CI)

Add a CI step that tracks test count and fails if it decreases without explicit approval:

- name: Test count guard
  run: |
    CURRENT=$(npx vitest run --reporter=json 2>/dev/null | jq '.numTotalTests')
    BASELINE=$(cat .github/test-baseline.json | jq '.count')
    if [ "$CURRENT" -lt "$BASELINE" ]; then
      echo "❌ Test count decreased: $BASELINE → $CURRENT"
      echo "If this is intentional, update .github/test-baseline.json"
      exit 1
    fi

A test-baseline.json file stores the expected minimum test count. It can only be updated with explicit human approval.

2. Test Deletion Detection (CI)

Add a CI step that flags PRs that delete test files or remove it()/test() calls:

- name: Test deletion check
  run: |
    DELETED_TESTS=$(git diff --unified=0 origin/dev...HEAD -- 'test/**' | grep -c '^-.*\b\(it\|test\|describe\)\s*(')
    if [ "$DELETED_TESTS" -gt 0 ]; then
      echo "⚠️ This PR removes $DELETED_TESTS test assertion(s)"
      echo "Requires label 'test-removal-approved' to merge"
    fi

3. copilot-instructions.md Directive

Add explicit rule:

NEVER delete, skip, or weaken existing tests to make your code pass.
If a test fails, fix the CODE, not the test.
The only acceptable reasons to modify a test are:
- The test's expected behavior has intentionally changed (document why)
- The test was testing the wrong thing (explain in the commit message)
If you cannot make a test pass, report the failure — do not suppress it.

4. Squad Agent Charter Rule

Add to all agent charters or squad.agent.md:

TEST INTEGRITY: Never delete or weaken tests to satisfy a green build.
If existing tests fail after your changes, either:
(a) Fix your code to pass the test, OR
(b) Document why the test expectation is wrong and get reviewer approval
Deleting a test to make CI pass is a rejection-worthy offense.

5. FIDO as Test Guardian

FIDO (Quality Owner) should have a specific review gate:

  • Any PR that modifies test files gets FIDO review
  • FIDO checks: did test count decrease? Were assertions weakened? Were tests skipped?
  • FIDO has PR blocking authority for test integrity violations

6. Coverage Ratchet

Never allow coverage to decrease:

// vitest.config.ts
coverage: {
  thresholds: {
    lines: 80,    // can only go UP
    branches: 75,
    functions: 80,
    statements: 80
  }
}

Store thresholds in a tracked file. CI fails if any threshold decreases.

Success Criteria

  • CI fails if test count decreases without test-removal-approved label
  • CI warns on any PR that deletes it()/test() calls
  • copilot-instructions.md has explicit "never delete tests" rule
  • FIDO reviews all PRs touching test files
  • Coverage ratchet prevents threshold decreases

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgo:needs-researchNeeds investigationsquadSquad triage inbox — Lead will assign to a membersquad:fidoAssigned to FIDO (Quality Owner)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions