feat: task execution integration tests (Run Now + auto-fire)#730
feat: task execution integration tests (Run Now + auto-fire)#730
Conversation
Three new tests that prove the scheduling engine works end-to-end: 1. RunNow_CreatesRunHistory — clicks Run Now, waits for completion, verifies run history entry with session name appears 2. ScheduledExecution_TaskFiresAutomatically — creates 1-min interval task, waits up to 120s for the timer to fire, verifies Last Run 3. RunNow_TwiceCreatesUniqueSessionNames — runs twice, verifies two distinct session names in history These prove the actual task execution pipeline works: timer fires → session created → prompt sent → run recorded → history displayed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code Review — Reviewer 1I reviewed the full 🔴 Finding 1 — Execution tests run twice in CI (trait filter overlap)
The class-level Failing scenario: The "fast" CI step silently runs the 120–180 second execution tests, causing unexpected timeouts. Then the "slow" step runs them again, doubling wall-clock time. Fix: Either (a) move execution tests to a separate class without the 🟡 Finding 2 —
|
There was a problem hiding this comment.
Code Review Summary — PR #730: Task Execution Integration Tests
Methodology: 3 independent reviewers with adversarial consensus. Findings included only when ≥2/3 reviewers agree. 1 disputed finding was escalated to follow-up review and confirmed unanimously.
Findings by Severity
| # | Severity | Finding | Reviewers |
|---|---|---|---|
| 1 | 🔴 CRITICAL | Class-level [Trait("Category", "ScheduledTasks")] causes all 3 execution tests to match both CI filter steps — they run twice, defeating the fast/slow split |
3/3 |
| 2 | 🟡 MODERATE | First batch lost ` | |
| 3 | 🟡 MODERATE | Execution tests run with ` | |
| 4 | 🟡 MODERATE | RunNow_TwiceCreatesUniqueSessionNames uses fixed Task.Delay(30_000) instead of polling — flaky on slow CI |
3/3 |
| 5 | 🟡 MODERATE | No try/finally around test bodies — orphaned 1-min interval tasks keep firing and pollute subsequent tests |
2/3 |
| 6 | 🟡 MODERATE | if (names.Length >= 2) guard silently skips the uniqueness assertion if CDP eval returns empty |
2/3 |
| 7 | 🟢 MINOR | Hardcoded "Last run" string check — inconsistent with other tests, causes misleading timeout on UI copy change |
3/3 (1 + 2 follow-up) |
| 8 | 🟢 MINOR | Comment says "90 seconds max" but loop is 90 × 2s = 180s | 2/3 |
| 9 | 🟢 MINOR | EscapeForJs duplicates EscapeJs from base class |
2/3 |
Key Recommendation
Finding #1 is the highest priority: the trait inheritance issue means execution tests run in the "fast" batch (now blocking due to finding #2) and the "slow" batch. Fix by moving execution tests to a separate class or adjusting the filter.
CI Status
- Commit status: Pending (no statuses reported)
- Check runs: 2 agent jobs in progress, activation/pre_activation completed successfully
- Test coverage: PR adds 3 new integration tests covering Run Now execution, auto-fire scheduling, and session uniqueness. No existing tests removed.
- Prior reviews: None
Generated by Expert Code Review (auto) for issue #730 · ● 5.8M
| // ─── Execution Tests ─── | ||
|
|
||
| [Fact] | ||
| [Trait("Category", "ScheduledTaskExecution")] |
There was a problem hiding this comment.
🔴 CRITICAL — Tests run in BOTH CI batches due to dual trait inheritance
The class already has [Trait("Category", "ScheduledTasks")] at the class level. Adding [Trait("Category", "ScheduledTaskExecution")] per-method means xUnit matches both traits. The first CI step (--filter "Category=ScheduledTasks") picks up all 3 execution tests (they inherit the class trait), then the second step (--filter "Category=ScheduledTaskExecution") runs them again.
Impact: Slow execution tests (up to 120s each + real scheduler waits) execute twice per CI run — ~10+ wasted minutes.
Fix: Move the 3 execution tests to a separate class without the class-level ScheduledTasks trait, or add an exclusion to the first filter: --filter "Category=ScheduledTasks&Category!=ScheduledTaskExecution".
Flagged by: 3/3 reviewers
| Output.WriteLine("Second Run Now triggered, waiting for completion..."); | ||
|
|
||
| // Wait for second run to appear in history | ||
| await Task.Delay(30_000); // Give it 30 seconds |
There was a problem hiding this comment.
🟡 MODERATE — Fixed 30s sleep instead of polling for second run completion
After triggering the second Run Now, the test sleeps a hard 30 seconds rather than polling for .run-status like the first run does (polled up to 90s). On a loaded CI machine, the agent can easily exceed 30s. The test then asserts count >= 2 on a history that may only show 1 entry, producing a spurious failure.
Fix: Replace with a polling loop:
for (var i = 0; i < 45; i++)
{
var currentCount = await CdpEvalAsync(
$"document.querySelectorAll(\"{EscapeForJs(card)} .run-entry\").length.toString()");
if (int.TryParse(currentCount, out var n) && n >= 2) break;
await Task.Delay(2000);
}Flagged by: 3/3 reviewers
| # Run execution tests (slow — waits for tasks to fire) | ||
| POLYPILOT_AGENT_PORT=$PORT dotnet test PolyPilot.IntegrationTests \ | ||
| --filter "Category=ScheduledTaskExecution" \ | ||
| --nologo --verbosity normal 2>&1 || true |
There was a problem hiding this comment.
🟡 MODERATE — Execution tests are permanently non-gating (|| true)
The PR description says these tests "close the biggest gap: proving the timer fires, prompts are dispatched, and runs are recorded." But with || true, a complete scheduler regression passes CI green. These tests can never fail the build.
Fix: Remove || true once the tests are stable, or document explicitly that these are informational-only. Consider a separate CI job with continue-on-error: true for visibility without blocking.
Flagged by: 3/3 reviewers
| Assert.False(string.IsNullOrWhiteSpace(sessionName), "Run entry should show session name"); | ||
| } | ||
|
|
||
| await DeleteTaskAsync(taskName); |
There was a problem hiding this comment.
🟡 MODERATE — No try/finally cleanup; orphaned tasks accumulate on failure
DeleteTaskAsync(taskName) is only called at the end of the test body. If any assertion throws before reaching it (e.g., the Assert.True(hasHistory, ...) at line 231), the task remains in the scheduler. All tests share a single AppFixture, so orphaned tasks persist across test runs. The 1-minute interval task from ScheduledExecution_TaskFiresAutomatically is especially problematic — it keeps firing and can pollute UI state for subsequent tests.
Fix: Wrap each test body in try/finally:
try { /* test body */ }
finally { await DeleteTaskAsync(taskName); }Flagged by: 2/3 reviewers
| # Run UI lifecycle tests first (fast) | ||
| POLYPILOT_AGENT_PORT=$PORT dotnet test PolyPilot.IntegrationTests \ | ||
| --filter "Category=ScheduledTasks" \ | ||
| --nologo --verbosity normal 2>&1 |
There was a problem hiding this comment.
🟡 MODERATE — First batch lost || true, making existing lifecycle tests a blocking gate
Before this PR, the entire ScheduledTasks run was non-blocking (|| true). This change removes that guard from the first batch. Any flaky existing test on a slow CI runner now fails the workflow step, preventing the artifact upload step (screenshots) from running.
Combined with the trait overlap issue above, the slow execution tests also run in this "fast" batch — a timing failure in the 120s scheduler wait aborts the step entirely.
Fix: Restore || true on the first batch, or deliberately ensure only stable tests run here.
Flagged by: 2/3 reviewers
| // The task creates a new session, sends the prompt, and waits up to 11 minutes. | ||
| // For a simple "echo" prompt, it should complete in ~30 seconds. | ||
| var hasHistory = false; | ||
| for (var i = 0; i < 90; i++) // 90 seconds max |
There was a problem hiding this comment.
🟢 MINOR — Comment says "90 seconds max" but actual max is 180s
for (var i = 0; i < 90; i++) with await Task.Delay(2000) = 90 × 2s = 180 seconds max wait, not 90. Either use i < 45 for a true 90s timeout, or update the comment to "180 seconds max."
Flagged by: 2/3 reviewers
| Output.WriteLine($"Session names: {sessions}"); | ||
|
|
||
| var names = sessions.Split('|', StringSplitOptions.RemoveEmptyEntries); | ||
| if (names.Length >= 2) |
There was a problem hiding this comment.
🟡 MODERATE — Conditional guard silently skips the uniqueness assertion
If CdpEvalAsync returns an empty string (e.g., on CDP parse error), names.Length < 2 and the if guard swallows the Assert.NotEqual entirely. The test passes green having never verified uniqueness — which is the entire point of this test. The preceding count >= 2 assertion passes via the DOM count, but the session-name extraction can independently fail.
Fix: Assert unconditionally:
Assert.True(names.Length >= 2,
$"Expected 2+ session names but got '{sessions}'");
Assert.NotEqual(names[0], names[1]);Flagged by: 2/3 reviewers
| if (i % 10 == 0) | ||
| Output.WriteLine($"Poll {i * 2}s: lastRun='{lastRunText}', status={statusExists}"); | ||
|
|
||
| if (statusExists && !string.IsNullOrWhiteSpace(lastRunText) && lastRunText.Contains("Last run")) |
There was a problem hiding this comment.
🟢 MINOR — Hardcoded "Last run" string couples test to UI copy
The other two tests check statusExists && !string.IsNullOrWhiteSpace(lastRunText) without matching specific text. This test adds .Contains("Last run") which means a UI wording change (e.g., "Ran at") causes a misleading 120s timeout failure instead of a clear assertion.
Fix: Drop the .Contains("Last run") clause — statusExists && !string.IsNullOrWhiteSpace(lastRunText) is sufficient and consistent with the other tests.
Flagged by: 3/3 reviewers (1 initial + 2 confirmed in follow-up)
|
|
||
| // ─── Helpers ─── | ||
|
|
||
| private static string EscapeForJs(string value) => |
There was a problem hiding this comment.
🟢 MINOR — EscapeForJs duplicates EscapeJs from base class
IntegrationTestBase already has an EscapeJs method that escapes \, ", and '. This local duplicate omits single-quote escaping — technically correct here but creates a maintenance trap. Consider promoting the base class method to protected and reusing it.
Flagged by: 2/3 reviewers
Adds 3 tests that prove the scheduling engine actually works:
These close the biggest gap: proving the timer fires, prompts are dispatched, and runs are recorded.