Add copilot driver with retry logic for partial session failures#25329
Add copilot driver with retry logic for partial session failures#25329
Conversation
…rrors - Create actions/setup/js/copilot_driver.cjs: Node.js driver that wraps copilot CLI subprocess, detects CAPIError 400, and retries with --resume flag using exponential backoff (3 retries, 5s initial delay) - Add DriverProvider interface to pkg/workflow/agentic_engine.go - Implement GetDriverScriptName() on CopilotEngine returning copilot_driver.cjs - Modify GetExecutionSteps() to prepend driver to copilot command in both sandbox (AWF) and non-sandbox modes - Update golden test fixtures and recompile lock files Agent-Logs-Url: https://github.com/github/gh-aw/sessions/7f63d96b-1922-4261-888a-e40d5845c0f8 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot add extensive logging |
- Add log() helper with ISO timestamps and [copilot-driver] prefix - Log startup config (maxRetries, delays, backoff) - Log each attempt: pid, args (with --prompt redacted), stdout/stderr bytes, exit code, duration - Log retry decisions with context (isCAPIError400, hasOutput, retriesRemaining) - Log final summary with total duration - Add formatDuration() for human-readable elapsed times - Extend tests to cover formatDuration and log format Agent-Logs-Url: https://github.com/github/gh-aw/sessions/262cb7ee-9401-4cc8-bfd1-ffe0b6516d29 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in ee0b538. Added a
|
|
@copilot attempts should resume the session if a run was already partially executed. |
Previously the driver only retried when a CAPIError 400 was detected. Now it retries with --resume whenever the session produced any output (hasOutput), which covers all cases where a run was partially executed. CAPIError 400 is still detected and named in the retry log message as the well-known transient case; other failures report "partial execution" as the reason. No-output failures (process failed to start) are still not retried. Added 5 new unit tests covering the broadened policy. Agent-Logs-Url: https://github.com/github/gh-aw/sessions/1022cc92-f38f-403a-83c5-4b5027719a76 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in c58aa0d. The retry condition is now broadened: the driver retries with |
|
📰 BREAKING: Smoke Copilot is now investigating this pull request. Sources say the story is developing... |
|
🎬 THE END — Smoke Claude MISSION: ACCOMPLISHED! The hero saves the day! ✨ |
|
✅ All tools validated successfully! Agent Container Smoke Test confirms agent container is ready. |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
Agent Container Tool Check
Result: 12/12 tools available ✅ Overall Status: PASS
|
There was a problem hiding this comment.
Pull request overview
Adds a Node.js “copilot driver” wrapper to make Copilot CLI runs resilient to transient mid-session failures by retrying with --resume when a run has partially executed (produced output).
Changes:
- Introduces
actions/setup/js/copilot_driver.cjsto wrap Copilot CLI execution with resume-based retry and structured logging. - Adds a
DriverProviderinterface and wiresCopilotEngineto invoke Copilot via the driver in both sandbox and non-sandbox execution paths. - Updates compiled workflow lockfiles and WASM golden fixtures to reflect the new
node .../copilot_driver.cjs ...invocation.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/copilot_driver.cjs | New Node wrapper that retries Copilot runs with --resume after partial execution. |
| pkg/workflow/agentic_engine.go | Adds DriverProvider optional interface for engines to provide a JS driver script name. |
| pkg/workflow/copilot_engine.go | CopilotEngine implements GetDriverScriptName() returning copilot_driver.cjs. |
| pkg/workflow/copilot_engine_execution.go | Prefixes Copilot CLI invocation with node ${RUNNER_TEMP}/gh-aw/actions/copilot_driver.cjs .... |
| pkg/workflow/copilot_engine_test.go | Adds unit test asserting driver script name and that execution steps include the driver. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/basic-copilot.golden | Updates expected AWF sandbox command to include the Node driver wrapper. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/with-imports.golden | Updates expected AWF sandbox command to include the Node driver wrapper. |
| .github/workflows/workflow-health-manager.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/workflow-generator.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/weekly-safe-outputs-spec-review.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/weekly-blog-post-writer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/update-astro.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/test-workflow.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/test-project-url-default.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/test-dispatcher.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/terminal-stylist.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/technical-doc-writer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/super-linter.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/sub-issue-closer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/smoke-update-cross-repo-pr.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/smoke-service-ports.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/smoke-create-cross-repo-pr.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/security-review.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/security-compliance.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/research.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/repository-quality-improver.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/repo-tree-map.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/repo-audit-analyzer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/refiner.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/q.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/python-data-charts.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/pr-triage-agent.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/pr-nitpick-reviewer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/portfolio-analyst.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/plan.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/pdf-summary.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/org-health-report.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/notion-issue-summary.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/metrics-collector.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/jsweep.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/issue-triage-agent.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/issue-monster.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/gpclean.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/glossary-maintainer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/github-remote-mcp-auth-test.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/functional-pragmatist.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/firewall.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/example-permissions-warning.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/docs-noob-tester.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/dictation-prompt.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/dev.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/dependabot-go-checker.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/dependabot-burner.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/dead-code-remover.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-workflow-updater.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-team-status.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-semgrep-scan.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-secrets-analysis.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-regulatory.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-performance-summary.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-malicious-code-scan.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-integrity-analysis.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-firewall-report.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-cli-tools-tester.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-cli-performance.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-assign-issue-to-user.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/daily-architecture-diagram.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/craft.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/copilot-token-optimizer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/copilot-token-audit.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/copilot-pr-merged-report.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/contribution-check.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/constraint-solving-potd.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/code-simplifier.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/code-scanning-fixer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/ci-coach.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/brave.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/bot-detection.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/artifacts-summary.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/archie.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/agentic-observability-kit.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/agent-persona-explorer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/agent-performance-analyzer.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
| .github/workflows/ace-editor.lock.yml | Updates locked workflow command to invoke Copilot via the Node driver. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 128/128 changed files
- Comments generated: 4
| child.on("exit", (code, signal) => { | ||
| const durationMs = Date.now() - startTime; | ||
| const exitCode = code ?? 1; | ||
| log(`attempt ${attempt + 1}: process exited` + ` exitCode=${exitCode}` + (signal ? ` signal=${signal}` : "") + ` duration=${formatDuration(durationMs)}` + ` stdout=${stdoutBytes}B stderr=${stderrBytes}B hasOutput=${hasOutput}`); | ||
| resolve({ exitCode, output: collectedOutput, hasOutput, durationMs }); | ||
| }); |
There was a problem hiding this comment.
In runProcess, resolving on the child process exit event can miss trailing stdout/stderr data because exit may fire before stdio streams are fully drained/closed. That can incorrectly set hasOutput=false (skipping retries) and can miss the CAPIError: 400 pattern. Prefer resolving on the close event (which waits for stdio to close) and keep exit only for logging if needed.
| let collectedOutput = ""; | ||
| let hasOutput = false; | ||
| let stdoutBytes = 0; | ||
| let stderrBytes = 0; | ||
|
|
||
| child.stdout.on( | ||
| "data", | ||
| /** @param {Buffer} data */ data => { | ||
| hasOutput = true; | ||
| stdoutBytes += data.length; | ||
| collectedOutput += data.toString(); | ||
| process.stdout.write(data); | ||
| } | ||
| ); | ||
|
|
||
| child.stderr.on( | ||
| "data", | ||
| /** @param {Buffer} data */ data => { | ||
| hasOutput = true; | ||
| stderrBytes += data.length; | ||
| collectedOutput += data.toString(); | ||
| process.stderr.write(data); |
There was a problem hiding this comment.
collectedOutput concatenates the full stdout+stderr into a single string. Copilot sessions can emit large outputs, which can cause high memory usage or GC churn in the wrapper process. Since the driver only needs to detect a small error signature, consider capping what you retain (e.g., keep a rolling tail buffer / scan incrementally) while still computing hasOutput and byte counts.
| // GetDriverScriptName returns the filename of the JavaScript driver script that wraps | ||
| // the Copilot CLI with retry logic for transient CAPIError 400 errors. | ||
| func (e *CopilotEngine) GetDriverScriptName() string { | ||
| return "copilot_driver.cjs" | ||
| } |
There was a problem hiding this comment.
The comment says the driver wraps the CLI "with retry logic for transient CAPIError 400 errors", but the driver actually retries any non-zero exit after partial execution (any output), regardless of error type. Updating the doc comment to match the broader behavior will avoid confusion for future maintainers.
| // | ||
| // When a driver script is provided (GetDriverScriptName), wrap the copilot invocation with | ||
| // `node <driver> <commandName> <args>` to enable retry logic for transient CAPIError 400 errors. | ||
| driverScriptName := e.GetDriverScriptName() |
There was a problem hiding this comment.
This comment frames the driver wrapper as enabling retries specifically for transient CAPIError 400, but the driver retries any partially-executed run (any output + non-zero exit). Consider adjusting the wording here to reflect the actual policy (partial execution resume) so the behavior is discoverable in the Go code.
|
Smoke test run
|
|
🤖 Smoke test result for PR #25329 (
Overall: PASS ✅
|
There was a problem hiding this comment.
Reviewing PR #25329 — the copilot driver with retry logic for CAPIError 400 partial session failures. The implementation is clean and well-documented. Left 2 inline comments: one suggesting making MAX_RETRIES configurable, and another about the logging verbosity. Overall, solid approach to handling transient failures.
📰 BREAKING: Report filed by Smoke Copilot · ● 1.5M
| const { spawn } = require("child_process"); | ||
|
|
||
| // Maximum number of retry attempts after the initial run | ||
| const MAX_RETRIES = 3; |
There was a problem hiding this comment.
The MAX_RETRIES = 3 constant is hardcoded here. Consider making this configurable via an environment variable (e.g., COPILOT_DRIVER_MAX_RETRIES) to allow tuning in CI environments without code changes.
| * grep out of the combined agent-stdio.log. | ||
| * @param {string} message | ||
| */ | ||
| function log(message) { |
There was a problem hiding this comment.
The log function writes directly to process.stderr. Consider also supporting a DEBUG env var check to suppress verbose driver logs unless debugging is needed — helps keep agent-stdio.log readable in normal runs.
|
Commit pushed:
|
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
@copilot make sure the copilot exit code is propagated to the exit |
|
Smoke Test Run 24146254696 — PARTIAL ✅ Core: 1✅ 2✅ 3✅ 4✅ 5✅ 6✅ 7✅ 8✅ 9✅ 10✅ 11✅ 12✅ All non-skipped tests passed. Tests 16 & 19 skipped (expected).
|
There was a problem hiding this comment.
💥 Automated smoke test review - all systems nominal! (Run 24146254696)
💥 [THE END] — Illustrated by Smoke Claude · ● 289.1K
| hasOutput = true; | ||
| stderrBytes += data.length; | ||
| collectedOutput += data.toString(); | ||
| process.stderr.write(data); |
There was a problem hiding this comment.
Smoke test review comment #1 — The collectedOutput buffer captures full stdout/stderr for retry detection. Worth noting that the hasOutput check (any output → retry eligible) is correct for the broader retry policy described in this PR. Consider adding a max buffer cap to guard against extremely verbose sessions.
| // the Copilot CLI with retry logic for transient CAPIError 400 errors. | ||
| func (e *CopilotEngine) GetDriverScriptName() string { | ||
| return "copilot_driver.cjs" | ||
| } |
There was a problem hiding this comment.
Smoke test review comment #2 — The comment references CAPIError 400 specifically, but the actual retry policy in copilot_driver.cjs now retries on any partial execution (any output + non-zero exit). Updating this doc comment to say "partial execution retry" would keep Go and JS in sync semantically.
…ut are fully captured The 'exit' event can fire before stdout/stderr buffers are fully drained. Resolving on 'close' instead guarantees that collectedOutput and hasOutput are complete before the retry decision is made, and that the copilot exit code is faithfully propagated through the driver. The 'exit' handler is kept for early diagnostic logging only. Both events pass (code, signal); the close handler is now the single source of truth for exitCode used in process.exit(). Agent-Logs-Url: https://github.com/github/gh-aw/sessions/5d9da918-aa26-45c4-b173-ead01a7a9163 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Fixed in f04bbbd. The root cause was resolving the promise on the |
Copilot CLI exits with
CAPIError: 400 Bad Requestmid-session (after successful tool calls), wasting the entire premium request. The error is transient — not a malformed request — and warrants retry with--resumeto continue the session. More broadly, any failure that occurs after the session has partially executed (produced output) is now eligible for a resume retry.Changes
New:
actions/setup/js/copilot_driver.cjsTransparent subprocess wrapper for the Copilot CLI:
--resumeflag whenever a run was partially executed (produced output), regardless of error type (3 attempts, 5s→10s→20s backoff, 60s cap)"partial execution"[copilot-driver]prefix (grep-friendly inagent-stdio.log):--promptvalue redacted), stdout/stderr byte counts, exit code, signal, durationisCAPIError400,hasOutput, retries remaining, named reason)formatDuration()(e.g.3m 12s)New:
DriverProviderinterface (pkg/workflow/agentic_engine.go)Optional interface engines can implement to supply a JS driver script:
CopilotEnginewired to driver (copilot_engine.go,copilot_engine_execution.go)CopilotEngineimplementsDriverProvider.GetExecutionSteps()now prefixes the copilot invocation withnode ${RUNNER_TEMP}/gh-aw/actions/copilot_driver.cjsin both AWF-sandbox and standard execution modes:The driver file is accessible inside AWF via the existing
${RUNNER_TEMP}/gh-aw:romount, andnodeis available via--env-all+ chroot PATH passthrough.Changeset
--resume, improving reliability for transient mid-session errors after output has begun.✨ PR Review Safe Output Test - Run 24146254696