Skip to content

Detect incomplete Claude assistant runs#109

Closed
ejc3 wants to merge 2 commits intotty-interactive-modefrom
ci-log-truncation-check
Closed

Detect incomplete Claude assistant runs#109
ejc3 wants to merge 2 commits intotty-interactive-modefrom
ci-log-truncation-check

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Jan 5, 2026

Summary

Fixes a bug where Claude assistant jobs show "success" even when killed mid-execution.

Root cause: When GitHub cancels a job (e.g., via cancel-in-progress), the Claude process receives SIGTERM and exits. The async iterator in runClaude() stops yielding, the function returns normally, and the process exits with code 0. GitHub sees exit code 0 and marks the job as "success".

Evidence: Run #20726814304 shows "success" but logs show Claude was mid-execution when "Post job cleanup" started.

Changes

  • Add CLAUDE_ASSISTANT_COMPLETE marker after runClaude() finishes in index.ts
  • Verify marker exists in all workflow steps before declaring success
  • Applies to: review, manual-review, respond, ci-fix jobs

Test plan

  • Push to trigger Claude review
  • Verify completion marker appears in successful runs
  • Verify cancelled runs fail instead of showing success

Stacked on: tty-interactive-mode (PR #108)

When a Claude assistant job is killed mid-execution (e.g., by
cancel-in-progress), the process exits with code 0 and GitHub
marks the job as "success" even though it didn't complete.

Fix:
- Add completion marker in index.ts after runClaude finishes
- Verify marker exists in workflow steps before declaring success
- Applies to review, manual-review, respond, and ci-fix jobs

Tested: Verified old run 20726814304 has no completion marker
in logs, confirming it was killed mid-execution.
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Jan 5, 2026

🔍 Claude Review

SEVERITY: critical

Findings

[CRITICAL] Race condition in pipeline operation
The pattern 2>&1 | tee /tmp/claude-output.log followed by grep on the same file creates a race condition. The tee process may still be buffering output when grep executes. While unlikely in most cases, under high system load or with large output, the marker could be in tee's buffer but not yet written to /tmp/claude-output.log when grep runs.

[CRITICAL] Missing set -o pipefail
Without set -o pipefail, the pipeline pnpm exec tsx index.ts 2>&1 | tee /tmp/claude-output.log returns the exit code of the last command (tee), not the first (pnpm exec tsx). This means:

  • If index.ts throws an error (line 706: process.exit(1)), it exits with code 1
  • But tee succeeds (exit 0), so the pipeline succeeds
  • The grep check then runs and fails (marker not present)
  • The real error message from main().catch() is hidden by the generic "process may have been killed" message

[MEDIUM] No SIGTERM signal handler
The PR description identifies SIGTERM as the root cause of premature exits, but no SIGTERM handler is added. When Node.js receives SIGTERM (e.g., from cancel-in-progress):

  1. Current async operations may complete
  2. Process exits before reaching line 702 (console.log("CLAUDE_ASSISTANT_COMPLETE"))
  3. The marker is never printed, which is the intended behavior, BUT...
  4. Without a handler, you can't distinguish "killed by SIGTERM" from "hung/infinite loop"

[MEDIUM] Completion marker may not flush to stdout
console.log() writes to a buffered stdout stream. If the process receives SIGTERM (or exits quickly) after line 702 executes, Node.js may not flush the stdout buffer before terminating. The marker could be "printed" in-memory but never appear in logs.

Recommended Fixes

  1. Fix pipeline: Use a temporary exit code file instead of relying on the pipeline:

    cd scripts/claude-assistant && pnpm exec tsx index.ts 2>&1 | tee /tmp/claude-output.log; echo $? > /tmp/claude-exit-code.txt
    EXIT_CODE=$(cat /tmp/claude-exit-code.txt)
    if \! grep -q "CLAUDE_ASSISTANT_COMPLETE" /tmp/claude-output.log; then
      echo "ERROR: Claude assistant did not complete - process may have been killed"
      exit 1
    fi
    exit $EXIT_CODE
  2. Add pipefail (simpler alternative to Replace custom PassthroughFs with fuse-backend-rs #1):

    set -o pipefail
    cd scripts/claude-assistant && pnpm exec tsx index.ts 2>&1 | tee /tmp/claude-output.log
    # ... rest of check
  3. Add SIGTERM handler (in index.ts):

    let terminated = false;
    process.on('SIGTERM', () => {
      terminated = true;
      console.log("\n=== CLAUDE_ASSISTANT_TERMINATED_BY_SIGNAL ===");
      process.exit(143); // Standard SIGTERM exit code
    });
  4. Force flush stdout (replace line 702):

    process.stdout.write("\n=== CLAUDE_ASSISTANT_COMPLETE ===\n");
    // Ensure synchronous flush (process.stdout.write is sync for TTY, but be explicit)

Summary

The approach is sound, but the implementation has critical flaws that could cause false positives (reporting killed when actually errored) or false negatives (marker buffered but not flushed). The pipeline handling and stdout buffering issues need to be addressed.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Jan 5, 2026

🔧 Auto-Fix Attempted

I attempted to create fixes for the critical issues, but encountered a permissions issue:

Error: The GitHub App cannot push changes to .github/workflows/claude.yml without the workflows permission.

Changes I Attempted to Make

I created the following fixes on branch claude/fix-20727096323:

  1. Added SIGTERM handler (index.ts):

    • Distinguishes cancellation from incomplete execution
    • Outputs CLAUDE_ASSISTANT_TERMINATED_BY_SIGNAL marker
    • Exits with code 143 (standard SIGTERM exit code)
  2. Fixed stdout buffering (index.ts):

    • Changed console.log() to process.stdout.write() for completion marker
    • Ensures synchronous flush before process exits
  3. Added set -o pipefail (claude.yml, all 4 jobs):

    • Propagates exit codes from pnpm exec tsx through the pipeline
    • Prevents tee's success from masking index.ts errors

Required Actions

Option 1: Manually apply the changes from the review comments above

Option 2: Grant the workflows permission to the GitHub App and I can push the fixes

Option 3: Cherry-pick the commit from my branch:

git fetch origin claude/fix-20727096323
git cherry-pick d4fd160

View Claude Run

Addresses review feedback:

1. Add SIGTERM handler (index.ts)
   - Outputs CLAUDE_ASSISTANT_TERMINATED_BY_SIGNAL marker
   - Exits with code 143 (standard SIGTERM exit code)
   - Distinguishes "killed by signal" from "hung/crashed"

2. Fix stdout buffering (index.ts)
   - Use process.stdout.write() instead of console.log()
   - Ensures synchronous flush before process exits
   - Prevents marker from being lost in buffer

3. Add set -o pipefail (claude.yml, all 4 jobs)
   - Propagates exit codes through pipeline
   - Prevents tee's success from masking index.ts errors
   - review, manual-review, respond, ci-fix jobs updated

Tested: pnpm exec tsx --version (syntax check)
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Jan 5, 2026

🔍 Claude Review

SEVERITY: critical

Findings

[CRITICAL] Pipeline exit code not preserved
The workflow script does not check the pipeline exit code after the marker verification. If index.ts successfully writes the completion marker but then errors in main().catch() (lines 714-716), the workflow will:

  1. Execute the pipeline which exits with code 1 (due to set -o pipefail)
  2. Find the completion marker (grep succeeds)
  3. Exit the script without error (exit code 0)
  4. Mark the job as "success" despite the error

Example scenario:

await runClaude(prompt);
process.stdout.write("\n=== CLAUDE_ASSISTANT_COMPLETE ===\n");
// If an error occurs here or in cleanup code
throw new Error("Something failed");

The marker is in the logs, but the process errored. Current workflow reports success.

Fix: Capture and propagate the pipeline exit code:

set -o pipefail
cd scripts/claude-assistant && pnpm exec tsx index.ts 2>&1 | tee /tmp/claude-output.log
PIPELINE_EXIT=$?
if \! grep -q "CLAUDE_ASSISTANT_COMPLETE" /tmp/claude-output.log; then
  echo "ERROR: Claude assistant did not complete - process may have been killed"
  exit 1
fi
exit $PIPELINE_EXIT

This ensures:

  • If marker missing → exit 1 (killed/incomplete)
  • If marker present but pipeline failed → exit with pipeline code
  • If marker present and pipeline succeeded → exit 0

Summary

The completion marker detection is sound, but the workflow doesn't preserve the pipeline exit code. This can cause false positives where errored runs are marked as successful if the error occurs after the completion marker is printed. The fix is to capture and propagate the pipeline exit code.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Jan 5, 2026

🔧 Auto-Fix Attempted

I identified the critical issue and created a fix, but encountered a permissions issue:

Error: The GitHub App cannot push changes to .github/workflows/claude.yml without the workflows permission.

Fix Required

Add exit code preservation to all 4 Claude assistant jobs in .github/workflows/claude.yml:

         run: |
           set -o pipefail
           cd scripts/claude-assistant && pnpm exec tsx index.ts 2>&1 | tee /tmp/claude-output.log
+          PIPELINE_EXIT=$?
           if ! grep -q "CLAUDE_ASSISTANT_COMPLETE" /tmp/claude-output.log; then
             echo "ERROR: Claude assistant did not complete - process may have been killed"
             exit 1
           fi
+          exit $PIPELINE_EXIT

This needs to be applied to all 4 jobs: review, manual-review, respond, and ci-fix.

Why This Fix Is Needed

Without capturing the pipeline exit code, if index.ts successfully writes the completion marker but then errors (e.g., in main().catch()), the workflow will:

  1. Run the pipeline which fails (exit 1)
  2. Check for the marker (found)
  3. Exit without error (exit 0) ← BUG
  4. Report "success" despite the error

Manual Application

You can apply this fix manually by editing .github/workflows/claude.yml and adding the two lines (PIPELINE_EXIT=$? and exit $PIPELINE_EXIT) to each of the 4 jobs.

View Claude Run

@ejc3 ejc3 deleted the branch tty-interactive-mode January 6, 2026 04:11
@ejc3 ejc3 closed this Jan 6, 2026
@ejc3 ejc3 deleted the ci-log-truncation-check branch January 17, 2026 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant