Skip to content

fix: filter AWF infrastructure lines from engine failure context#25314

Merged
pelikhan merged 3 commits intomainfrom
copilot/fix-workflow-engine-termination
Apr 8, 2026
Merged

fix: filter AWF infrastructure lines from engine failure context#25314
pelikhan merged 3 commits intomainfrom
copilot/fix-workflow-engine-termination

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 8, 2026

During the Apr 8 systemic outage, 13+ workflows failed with exit code 1 before producing any agent output. The buildEngineFailureContext fallback was showing AWF container lifecycle messages as "Last agent output" in failure issues — confusing and useless for diagnosis:

Container awf-squid  Removed
[SUCCESS] Containers stopped successfully
[WARN] Command completed with exit code: 1
Process exiting with code: 1

Changes

  • Infrastructure line filter: Uses the shared AWF_INFRA_LINE_RE constant (extracted into log_parser_shared.cjs) in buildEngineFailureContext. Infrastructure lines are stripped before the fallback tail is selected, so only actual engine output appears.

  • Shared constant: AWF_INFRA_LINE_RE is now defined once in log_parser_shared.cjs and imported by both parse_copilot_log.cjs and handle_agent_failure.cjs, eliminating duplication and preventing future drift.

  • Startup-failure detection: When the log contains only infrastructure lines — the engine exited before producing anything — a dedicated message is shown instead of the generic "terminated unexpectedly" + useless tail. The message is engine-aware: copilot engine failures reference the GitHub Copilot status page; all other engines (claude, codex, custom) reference a generic provider status page:

    ⚠️ Engine Failure: The copilot engine terminated before producing output.
    The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.

  • No change to issue creation logic: Failure issues are still created in all cases; only the diagnostic context surfaced in those issues is improved.

  • 5 new tests covering: infra-only log → startup-failure message; mixed log → infra lines excluded from tail; [entrypoint]/[health-check] prefix handling; engine ID label in startup-failure message; provider-agnostic status page guidance for non-copilot engines.

)

When the Copilot/Claude CLI exits with code 1 before producing any
substantive output (as observed in the Apr 8 systemic outage), the
buildEngineFailureContext fallback previously showed AWF infrastructure
shutdown messages (Container awf-squid Removed, [WARN] Command completed
with exit code: 1, Process exiting with code: 1) as "Last agent output",
which was confusing and not useful for diagnosis.

Fix:
- Add INFRA_LINE_RE pattern (consistent with parse_copilot_log.cjs) to
  filter AWF infrastructure lines from the fallback tail
- When log contains only infrastructure lines → show dedicated "engine
  terminated before producing output / possible transient issue" message
- When actual agent output exists → show filtered last 10 lines only

Adds 4 new tests covering the new behavior.

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3f85c846-cf45-418e-9ff8-200607fb878f

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix multiple workflows failing with exit code 1 fix: filter AWF infrastructure lines from engine failure context Apr 8, 2026
Copilot AI requested a review from pelikhan April 8, 2026 14:07
@pelikhan pelikhan marked this pull request as ready for review April 8, 2026 14:10
Copilot AI review requested due to automatic review settings April 8, 2026 14:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves diagnostic output for workflow engine failures by removing AWF (container/firewall wrapper) infrastructure noise from the buildEngineFailureContext fallback and emitting a dedicated “startup failure” message when the engine produces no real output.

Changes:

  • Filter AWF infrastructure lines from agent-stdio.log before selecting the fallback “tail” context.
  • Add a dedicated startup-failure message when logs contain only infrastructure lines.
  • Add new tests covering infra-only logs, mixed logs, [entrypoint]/[health-check] prefixes, and engine ID inclusion; add a patch changeset.
Show a summary per file
File Description
actions/setup/js/handle_agent_failure.cjs Filters infra lines from fallback tail and adds startup-failure messaging when no engine output exists.
actions/setup/js/handle_agent_failure.test.cjs Adds test cases validating infra filtering and the new startup-failure behavior.
.changeset/patch-fix-infra-lines-in-engine-failure-context.md Documents the patch-level change for release notes/versioning.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

Comment on lines +863 to +868
// (e.g., Copilot API service unavailable, rate-limiting, token not yet provisioned).
core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)");
let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`;
context +=
"The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " +
"If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n";
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup-failure message always tells users to check the “GitHub Copilot status page”, but this code path is used for any engine ID (e.g. tests already cover claude). For non-copilot engines this guidance is inaccurate/misleading. Consider making the wording engine-agnostic or conditionally mentioning the Copilot status page only when GH_AW_ENGINE_ID === "copilot" (and using a generic provider/status message otherwise).

Suggested change
// (e.g., Copilot API service unavailable, rate-limiting, token not yet provisioned).
core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)");
let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`;
context +=
"The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " +
"If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n";
// (e.g., API service unavailable, rate-limiting, token not yet provisioned).
core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)");
const recurringFailureGuidance =
process.env.GH_AW_ENGINE_ID === "copilot"
? "If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n"
: "If this failure recurs, check the provider status page (if available) and review the firewall audit logs.\n\n";
let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`;
context +=
"The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " +
recurringFailureGuidance;

Copilot uses AI. Check for mistakes.
Comment on lines +839 to +847
// AWF infrastructure lines written by the firewall/container wrapper — not produced by
// the engine itself. They must be filtered out of the fallback tail so the failure
// context surfaces actual agent output rather than container lifecycle noise
// (e.g. "Container awf-squid Removed", "[WARN] Command completed with exit code: 1",
// "Process exiting with code: 1"). Uses the same pattern as parse_copilot_log.cjs.
// Note: INFO/WARN/SUCCESS/ERROR are uppercase (AWF wrapper convention); entrypoint and
// health-check are lowercase (container script convention) — mixed casing is intentional
// and reflects the actual log output format produced by different AWF components.
const INFRA_LINE_RE = /^\[(INFO|WARN|SUCCESS|ERROR|entrypoint|health-check)\]|^ (?:Container|Network|Volume) |^Process exiting with code:/;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFRA_LINE_RE is duplicated here and in actions/setup/js/parse_copilot_log.cjs:156. Since the comment explicitly says this must stay in sync, consider extracting it into a shared constant/module (e.g. alongside other log parsing helpers) to prevent future drift when the infrastructure log format changes.

Copilot uses AI. Check for mistakes.
@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 8, 2026

@copilot review comments

…tus page guidance engine-specific

- Extract INFRA_LINE_RE as AWF_INFRA_LINE_RE into log_parser_shared.cjs
  so parse_copilot_log.cjs and handle_agent_failure.cjs share the same
  constant (no more duplication risk when infra log format changes)
- Make startup-failure status page guidance engine-specific: copilot gets
  "GitHub Copilot status page", all other engines get "provider status page"
- Add test verifying provider-agnostic guidance for claude engine
- Update existing engine ID startup-failure test to also verify Copilot
  status page text

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/de02b350-d94e-4ee6-aceb-0e1f8d2b9ae8

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 8, 2026

@copilot review comments

Both review comments addressed in f40da83:

  1. Engine-specific status page guidancebuildEngineFailureContext now shows "GitHub Copilot status page" only when GH_AW_ENGINE_ID === "copilot", and "provider status page (if available)" for all other engines (claude, codex, custom, etc.). New test shows provider-agnostic status page guidance for non-copilot engines covers this.

  2. INFRA_LINE_RE duplication — Extracted as AWF_INFRA_LINE_RE into log_parser_shared.cjs with full documentation. Both parse_copilot_log.cjs and handle_agent_failure.cjs now import it from the shared module — no more drift risk.

Copilot AI added a commit that referenced this pull request Apr 8, 2026
…25314)

Creates a new workflow that:
- Triggers on PRs labeled "implementation" or with >100 new lines in business logic dirs
- Detects missing ADRs and generates draft using AI analysis of PR diff
- Posts actionable comment guiding author to complete and commit the ADR
- Blocks merge until ADR is linked via REQUEST_CHANGES review
- Verifies implementation matches existing ADR when one is present
- ADRs stored as numbered Markdown files in /docs/adr/
- Configurable business logic paths via .design-gate.yml

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3d100092-4ec3-4509-8739-d270f7d79996

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@pelikhan pelikhan merged commit 58708f6 into main Apr 8, 2026
83 checks passed
@pelikhan pelikhan deleted the copilot/fix-workflow-engine-termination branch April 8, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P1 Systemic: Multiple workflows failing with "engine terminated unexpectedly" (exit code 1) — Apr 8

3 participants