Batch/online unification — Phases 0–9 + bug fixes F1–F15#578
Merged
Conversation
* feat(vscode): add zoom and pan to Workflow DAG view (#545) Large workflows render too small to read because the mermaid SVG was constrained to max-width: 100% with no way to zoom in. Add interactive pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons (zoom in, zoom out, fit-to-view, reset). Auto-fits on first render. * test: add failing Phase 1 observability tests (red) TDD Commit A — these tests prove three observability gaps exist: - U-2.G: _mark_prep_failed silently returns when target_id is missing - U-2.H: passthrough_on_error swallows exceptions without mentioning it - U-2.E: prefilter_by_guard accepts mismatched original_data length All 3 must fail before Commit B (green) adds the fixes. * fix: add Phase 1 observability guards and warnings (green) TDD Commit B — all Phase 1 tests now pass: - U-2.G: _mark_prep_failed logs WARNING when target_id is missing, making prep failures on anonymous records visible to operators - U-2.H: passthrough_on_error exception handler logs WARNING with clause and error context (behavior unchanged — still passes) - U-2.E: prefilter_by_guard raises RuntimeError when original_data length mismatches data length (prevents silent record loss) - U-2.I: FILE vs RECORD merge-order divergence documented with rationale (sequential accumulation vs independent processing) - U-2.J: Schema compilation already per-action — no change needed No behavioral change except U-2.E (converts silent data loss into a loud RuntimeError). * chore: add changie entry for Phase 1 observability * refactor: simplify Phase 1 observability after review - evaluator.py: eliminate double WARNING on passthrough path — move generic warning to non-passthrough branch so each exception produces exactly one log line - test file: use _PREPARATOR_LOGGER/_EVALUATOR_LOGGER constants instead of hardcoded strings, tighten assertion from >= 1 to == 1, update module docstring (remove stale commit A/B narrative)
* feat(vscode): add zoom and pan to Workflow DAG view (#545) Large workflows render too small to read because the mermaid SVG was constrained to max-width: 100% with no way to zoom in. Add interactive pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons (zoom in, zoom out, fit-to-view, reset). Auto-fits on first render. * test(phase4): add failing tests for tombstone marker field parity build_tombstone() and build_exhausted_tombstone() do not produce _tombstone or _tombstone_reason fields. Downstream code must guess tombstone identity from _state values. These tests assert that all tombstone types carry consistent marker fields. All 13 tests fail — proving the gap exists. * feat(phase4): add _tombstone and _tombstone_reason to tombstone builders build_tombstone() and build_exhausted_tombstone() now set _tombstone=True and _tombstone_reason=<reason> on every tombstone record. This gives downstream code (result collector, preview CLI, FILE tool) a reliable structural marker instead of inferring tombstone identity from _state values. Both online and batch paths produce identical marker fields since they share the same builder functions. * fix(phase4): add tombstone markers to sibling builders, fix test gaps Pattern completeness (Quality Gate 5): ExhaustedRecordBuilder and PassthroughItemBuilder both construct tombstone-like records with agent_type="tombstone" but lacked _tombstone/_tombstone_reason markers. Fixed both. Test fixes: - Remove tautological set equality assertion - Add extra_metadata clobber test for build_exhausted_tombstone - Add sibling builder parity tests for both builders
* test: add TDD baseline test infrastructure for batch/online unification (Phase 0) Create tests/unit/unification/ with conftest.py and Phase 0 baseline tests verifying U-0.1 through U-0.4 are implemented: - U-0.1: Batch preflight calls prepare with skip_guard=False - U-0.2: Online process_record defaults skip_guard=False - U-0.3: Prefilter passes context=eval_item to guard evaluator - U-0.4: Both paths use shared _build_pipeline_context Includes paired_execution fixture for Phase 5+ parity tests. No source changes — tests only. * refactor: simplify test fixtures and fix misleading test name - Move imports to module level in conftest (no circular import risk) - Remove redundant inline comments in paired_execution fixture - Rename test_batch_path_calls_build_pipeline_context to accurately reflect what it verifies (type compatibility, not call tracing) * fix: address PR review findings — eliminate vacuous assertions and dead code - Replace vacuous assertion (result.skip_reason is not None always True via MagicMock) with precise checks: assert build_tombstone called and result.executed is False - Add assertion to guard-skipped-rows test (was no-assertion, would pass if method were a no-op) - Remove dead get_task_preparer patch in paired_execution (mock_preparer is passed directly to _process_single_item, patch was never exercised) - Fix paired_execution docstring to accurately describe what it tests (mock plumbing, not real evaluator parity)
* test(unification): add Phase 2 batch prep integrity tests U-2.B red test: _process_single_item sets INCLUDED before prepare() runs, so when prepare() throws, context_map is left with a lie. U-1.3a/U-2.C regression tests: confirm Phase 1 fixes hold — _mark_prep_failed uses transition() API, no raw _state bypass. test_prepare_failure_not_marked_included: FAILS (bug exists) All other tests: PASS (regression) * fix(batch): move INCLUDED status after prepare() succeeds (U-2.B) context_map previously marked records as INCLUDED before prepare() ran. If prepare() threw, the record was left with INCLUDED status at the _process_single_item level — a lie that downstream phases (3, 6, 7) would trust. Now INCLUDED is set only after prepare() succeeds and guard checks pass. If prepare() throws, the entry exists in context_map (so _mark_prep_failed can find it) but has no filter status, which _mark_prep_failed correctly sets to FAILED. * refactor(tests): deduplicate _enable_log_propagation fixture + extract prep_harness Move _enable_log_propagation to shared conftest (was duplicated in Phase 1 and Phase 2 test files). Extract repeated test setup in TestContextMapIntegrity into a prep_harness fixture and _run helper. * test(unification): cover all guard paths + strengthen transition assertion Address PR review findings: - Add FILTERED and UPSTREAM_UNPROCESSED guard path tests (was WARNING) - Assert stats side effects (skipped_items, filtered_items) in guard tests - Strengthen transition API test: assert on history entry content (action, to, reason) instead of just _state_history existence
* test(phase5): add failing TDD tests for guard context parity Proves the context divergence bug: prefilter_by_guard passes context=eval_item (record content only), while TaskPreparer.prepare builds full field_context with source, version, workflow namespaces and promoted output_fields. Guards referencing source.name, version.i, or promoted output_fields produce different decisions in online vs batch — 5 tests fail. * refactor(phase5): extract shared build_guard_context helper Extract guard context construction into processing/guard_context.py as the single source of truth for both batch and online paths. TaskPreparer._load_full_context now delegates to build_guard_context, ensuring the output_field promotion and source resolution logic lives in one place. This prevents the two paths from diverging. Update test that patched the old logger location. * feat(phase5): wire prefilter to shared guard context builder prefilter_by_guard now accepts optional pipeline context parameters (agent_indices, source_data, version_context, workflow_metadata, dependency_configs). When provided, it builds full field_context via build_guard_context — identical to what TaskPreparer.prepare() uses. Both _guard_filter and _guard_filter_file_mode in unified.py pass the ProcessingContext fields through, so online guards now see the same source, version, workflow namespaces and promoted output_fields that batch guards see. When no pipeline context is provided, prefilter falls back to eval_item (backward-compatible for callers outside unified.py). All 9 Phase 5 parity tests pass. 6847 existing tests pass. * chore: add changie entry for Phase 5 guard context alignment
* test(phase3): add failing tests for batch retrieve correctness
U-2.A: passthrough key collides with LLM field — assert LLM value kept
U-2.D: FAILED records in context_map — assert they appear in output
All 5 tests FAIL against current code, proving both bugs exist:
- Merge order: passthrough overwrites LLM output on key collision
- FAILED records: silently dropped during reconciliation
* fix(phase3): batch retrieve correctness — merge order and FAILED reconciliation
U-2.A: Reverse passthrough merge order so LLM output wins on key collision.
Previously `{**item, **stored_passthrough}` let passthrough overwrite LLM
output. Now `{**stored_passthrough, **item}` ensures LLM values take
precedence. Collisions on user-content fields are logged at INFO level;
framework fields (target_id, _state, etc.) are silently excluded.
U-2.D: Include FilterStatus.FAILED records in passthrough reconciliation.
Previously reconciler.get_passthrough_records() only emitted SKIPPED,
INCLUDED, and None — FAILED records silently vanished. Now they flow
through a dedicated _build_failed_passthrough() that emits a
ProcessingResult.failed() with the prep failure reason, ensuring all
records appear in output.
…rity (#552) * fix: remove _is_cascade_blocked fallback — single cascade authority (Phase 6) Remove duplicate CASCADE check in batch reconciliation. Prep is now the single authority for cascade decisions via get_skip_reason(). Changes: - Delete _is_cascade_blocked() function (zero remaining callers) - Remove _is_cascade_blocked fallback from _build_unprocessed_passthrough - Remove CASCADE_BLOCKING_VALUES and UPSTREAM_UNPROCESSED imports - Update 2 existing tests that relied on the old fallback path - Add 5 Phase 6 regression tests verifying single-authority behavior Without skip_reason from prep, cascade-blocked records now get BATCH_NOT_RETURNED (not UPSTREAM_UNPROCESSED from re-derived _state). * refactor: use UPSTREAM_UNPROCESSED constant and clarify docstring - Replace raw string "upstream_unprocessed" with UPSTREAM_UNPROCESSED constant in test_upstream_unprocessed_filter.py - Remove phase reference from docstring — describe the invariant, not the implementation history
* feat(vscode): add zoom and pan to Workflow DAG view (#545) Large workflows render too small to read because the mermaid SVG was constrained to max-width: 100% with no way to zoom in. Add interactive pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons (zoom in, zoom out, fit-to-view, reset). Auto-fits on first render. * test(phase7a): add failing tests for DEFERRED disposition on batch submit U-3.3: After batch submission, DISPOSITION_DEFERRED must be stamped for every record sent to the provider. Currently no dispositions are written at submit time — only on retrieve. All 3 tests fail — proving the gap exists. * feat(phase7a): stamp DISPOSITION_DEFERRED on batch submit U-3.3: After successful batch submission, every INCLUDED record now gets DISPOSITION_DEFERRED written to the storage backend. Records that were filtered/skipped during preparation do not receive DEFERRED. Implementation: - Add storage_backend parameter to BatchSubmissionService - _stamp_deferred() iterates context_map for INCLUDED records - Both callers (initial_pipeline, workflow/pipeline) pass backend - Reason includes batch_id for traceability This enables operators to query the disposition table for in-flight batch records and lets Phase 7b clear DEFERRED on retrieve. * fix(phase7a): use _safe_set_disposition, remove unused test param _stamp_deferred now uses _safe_set_disposition (matching all other callers) so a storage failure won't crash batch submission. Removed unused prepare_result parameter from test helper. * test(phase7a): add negative test for skipped/filtered records Verify that SKIPPED, FILTERED, and FAILED records do NOT receive DISPOSITION_DEFERRED — only INCLUDED records are stamped. Tests _stamp_deferred directly with a hand-crafted context_map containing all four FilterStatus values.
…554) * test: add failing Phase 9a observe null-safety tests (red) TDD Commit A — these tests prove the observe/guard null asymmetry: - observe raises RecordContextError for missing field from present namespace - guards resolve the same field to None (correct behavior) 3 tests fail, 2 pass (existing null-namespace and undeclared-namespace behaviors). * fix: observe resolves missing fields to None matching guard semantics (U-4.2) _resolve_missing_field now has three cases: 1. Namespace is None (guard-skipped/filtered) → return None (unchanged) 2. Namespace exists as dict but field missing → return None with WARNING 3. Namespace not in context at all → raise RecordContextError (unchanged) This eliminates the asymmetry where guards return None for missing fields but observe throws RecordContextError. Guards and observe now share the same null-field contract. Preflight still catches config typos (case 3). Drop+observe on the same field now resolves to None (not the original value) — the security invariant is preserved: dropped data is not leaked. 10 existing tests updated to match new behavior. 5 new Phase 9a tests. 6870 passed, ruff clean. * chore: add changie entry for Phase 9a observe null safety * test: tighten FILE mode missing-field enrichment assertion The test claimed to verify missing-field-with-None enrichment but only checked a present field survived. Now also asserts: second present field preserved, and missing field not injected as flat key (flat key injection only covers fields present in the namespace dict).
#557) * test(phase-7b): add failing tests for disposition retrieve parity Tests verify batch retrieve dispositions match online ResultCollector rules (U-3.2a, U-3.5a): - SUCCESS records must get DISPOSITION_SUCCESS - FILTERED records must get DISPOSITION_FILTERED - EXHAUSTED must include input_snapshot - FAILED must include input_snapshot and detail - Prompt traces must only be written for SUCCESS records All 9 tests fail against current code, proving the parity gaps exist. * fix(phase-7b): batch retrieve disposition parity with online collector Closes U-3.2a and U-3.5a — batch retrieve dispositions now match online ResultCollector.collect_results() rules: - SUCCESS records get DISPOSITION_SUCCESS (previously only DEFERRED cleared) - FILTERED records get DISPOSITION_FILTERED via new _write_filtered_dispositions - EXHAUSTED dispositions include input_snapshot for forensics - FAILED dispositions include input_snapshot and detail - Prompt trace updates restricted to SUCCESS records only (tombstones excluded)
* test(phase9b): add failing tests for NullNamespace observe resolution (red) Tests assert that NullNamespace(reason='skipped') in field_context resolves observe/passthrough fields to None instead of crashing. Currently fails because _resolve_missing_field only recognizes None, not NullNamespace instances. * feat(phase9b): NullNamespace sentinel for skipped namespace observe Introduce NullNamespace(reason="skipped") sentinel that replaces plain None when DependencyNamespaceBuilder encounters a null/absent upstream namespace. This lets downstream code introspect WHY a namespace is null (skipped vs filtered — Phase 9c will add the filtered reason). Changes: - _resolve_missing_field() uses is_null_namespace() to handle both NullNamespace instances and legacy None - DependencyNamespaceBuilder.build() wraps None deps in NullNamespace - Guard evaluator ast_nodes.py recognizes NullNamespace via helper - Existing tests updated to use is_null_namespace() assertions * chore(phase9b): add changie entry and update manifest for NullNamespace * refactor(phase9b): add reason constants and singleton for NullNamespace Replace bare "skipped" string with REASON_SKIPPED constant and use pre-built SKIPPED_NAMESPACE singleton to avoid repeated allocation. * fix(phase9b): use is_null_namespace in template error hint detection prompt/service.py:442 used `v is None` to identify null namespaces for Jinja error enrichment. With NullNamespace sentinel, this check missed skipped namespaces, producing unhelpful error messages.
* test: add Phase 9c filter/skip/observe consistency tests (U-4.3) TDD contract class TestFilteredNamespaceObserve plus filter/skip symmetry and fan-in interaction tests. 11 new tests verify that filtered and skipped namespaces produce identical null-safe observe behavior — all fields resolve to None instead of crashing. The behavioral implementation was completed in earlier work (PR #538 fan-in guard-filter null namespace + Phase 9a PR #554). These tests formalize the contract and lock the invariant: filter = skip for observe/passthrough resolution. * fix: tighten Phase 9c test assertions and fix changie timestamp - Replace tautological test_filtered_and_skipped_resolve_identically (identical inputs proving f(x)==f(x)) with test_all_directives_null_safe that exercises observe+passthrough+drop on a filtered namespace - Replace tautological isinstance(dict) assertion in drop test with meaningful checks on prompt_context content - Fix invalid changie timestamp (09:c0 → 09:30) - Remove comments that restate assertions * test: remove redundant fan-in observe+passthrough test Already covered by test_all_directives_null_safe_on_filtered (observe+passthrough+drop) and test_three_way_fan_in_one_filtered (fan-in layout). 10 tests remain.
* refactor(phase8a): extract collect_results_from_processing_results shared helper
Extract the core collect logic from ResultCollector.collect_results() into a
standalone module-level function collect_results_from_processing_results().
ResultCollector.collect_results() now delegates to it.
The shared helper has a simpler API (action_name + optional storage_backend
and agent_config) designed for both online and batch retrieve callers.
This is a pure extraction — no behavior change, no callers rewired.
10 new tests verify per-status parity (SUCCESS, FAILED, EXHAUSTED, SKIPPED,
FILTERED, UNPROCESSED, DEFERRED), mixed batch scenario, no-backend mode,
and equivalence with ResultCollector.collect_results() output.
6966 passed, ruff clean.
* fix(phase8a): review fixes — guard precision, test gaps, cleanup
- Fix `if effective_config:` → `if agent_config is not None:` so an
explicit empty config {} still runs exhausted-raise check (was
falsely skipped due to empty-dict truthiness)
- Add 3 tests for exhausted-raise guard: raises with config, skips
without config (batch path), runs with empty config
- Simplify _mock_storage_backend() — MagicMock auto-creates attrs
- Remove WHAT comment on arithmetic assertion
* test(phase8b): add failing TDD tests for UnifiedProcessor retrieve parity Red-phase tests for Phase 8b: - BatchResultStrategy must satisfy ProcessingStrategy protocol (invoke method) - UnifiedProcessor.enrich_and_collect for batch retrieve - FAILED results with pre-existing data preserved by collector - Per-result processing_context used for enrichment - CollectionStats parity between online and batch paths 9 tests fail, 3 pass (testing existing behavior only). * feat(phase8b): BatchResultStrategy implements ProcessingStrategy Add invoke() and prepare_invoke() methods to BatchResultStrategy. invoke() satisfies ProcessingStrategy protocol via structural typing. prepare_invoke() pre-loads batch data (batch_results, context_map, etc.) before the processor calls invoke(). Batch data is consumed and cleared after invoke() returns to prevent stale state across calls. * refactor(phase8b): wire batch retrieve through UnifiedProcessor Route batch results through UnifiedProcessor.enrich_and_collect() — the shared enrich→collect pipeline that online already uses. Key changes: - UnifiedProcessor.enrich_and_collect(): new entry point for pre-computed results (batch retrieve, where guard filtering happened at submit time) - UnifiedProcessor._enrich(): uses per-result processing_context when set (batch results carry their own). Falls back to shared context (online). Batch error results (no processing_context) skip enrichment entirely. - Collector FAILED handler: preserves pre-existing result.data (batch error items, tombstones) instead of always building a fresh tombstone. Online FAILED results (data=[]) still build tombstones as before. - _convert_batch_results_to_workflow_format: now returns (output, stats) via UnifiedProcessor instead of inline enrich loop + manual flatten. - finalize_batch_output: disposition writing and state stamping now handled by the shared collector. Only DEFERRED clearing and prompt trace updates remain batch-specific. - Test mocks updated: _convert returns ([], None) instead of []. * refactor(phase8b): delete duplicate batch code paths Remove dead code after routing batch retrieve through UnifiedProcessor: - Remove _stamp_batch_records (collector stamps _state via result.status) - Remove _write_record_dispositions from BatchProcessingService (collector writes dispositions inline during collection) - Remove write_record_dispositions import from processing_recovery - Remove RecordEnvelope import from processing_recovery (only used by _stamp_batch_records) - Update integration test to use UnifiedProcessor.enrich_and_collect() instead of inline loop + _stamp_batch_records - Remove TestStampBatchRecords tests (function deleted) - Update test mocks: remove write_record_dispositions patches, _convert returns (output, stats) tuple * chore: lint/format fixes and changie entry for Phase 8b * refactor(phase8b): extract _try_clear_deferred, cache UnifiedProcessor - Extract shared _try_clear_deferred() helper from duplicate clear_disposition try/except blocks in _clear_deferred_dispositions and _write_filtered_dispositions - Cache UnifiedProcessor on BatchProcessingService.__init__ instead of creating a new instance per _convert_batch_results_to_workflow_format call
…#560) Phase 8c policy decision: preflight validation stays batch-only. Online processes one record at a time — template errors surface immediately on the first row, making preflight redundant overhead. Batch submits thousands of rows in one API call — a template error discovered mid-batch wastes the entire submission. Preflight samples up to 5 rows before submission to catch these early. No dead preflight code exists in the online/realtime path (verified via grep). No code to clean up, no feature flag needed.
…imestamps, prefilter warning) (#561) * fix(changie): repair 4 invalid timestamp minutes Four entries used out-of-range minute values (70, 80, 90) that fail Go's strict time.Parse. On the next release run changie would either crash or silently drop these entries from CHANGELOG.md, hiding the Phase 7a / 9a / 9b / 8b user-visible work. Map: 00:70:00 → 00:07:00 Enhancement or New Feature-20260517-007000.yaml 00:90:00 → 00:09:00 Enhancement or New Feature-20260517-009000.yaml 00:90:02 → 00:09:02 Under the Hood-20260517-009002.yaml 00:80:02 → 00:08:02 Under the Hood-20260518-008002.yaml Verified: all 4 now parse via datetime.fromisoformat. No changie entry of its own — this fix repairs the changie infrastructure; adding another entry would be circular if the parser were still broken. * fix(batch): write FILTERED dispositions on production retrieve path BatchProcessingService has two retrieve entry points that diverged after Phase 7b (#557) introduced _write_filtered_dispositions: process_batch_results (single-batch, mostly used in tests/CLI): ✓ calls _write_filtered_dispositions process_all_batch_results → _process_single_batch_file → _process_original_batch → _finalize_batch_output → processing_recovery.finalize_batch_output (PRODUCTION path, via BatchLifecycleManager): ✗ did NOT call _write_filtered_dispositions The reconciler strips FILTERED rows from processed_data before the shared collector runs, so the collector never sees them. Without an explicit call in finalize_batch_output, FILTERED records stayed at DISPOSITION_DEFERRED (stamped at submit by Phase 7a) and never transitioned to DISPOSITION_FILTERED — silently breaking Phase 7b parity for every real batch run. Fix: add service._write_filtered_dispositions(context_map, effective_action_name) alongside the existing _clear_deferred_dispositions and _update_prompt_trace_responses calls in finalize_batch_output. Also updated the adjacent comment that incorrectly claimed the collector handles all disposition writes. Regression tests (red→green confirmed): - test_finalize_writes_filtered_dispositions - test_finalize_uses_service_action_name_when_action_name_none (covers the action_name=None fallback to service._action_name — mirrors how _clear_deferred and _update_prompt_trace behave) * chore(prefilter): warn when fallback guard-context path runs Phase 5 (#551) introduced build_guard_context as the single authority for guard context construction, used by both batch TaskPreparer.prepare and online prefilter_by_guard. The prefilter still retained an eval_item-only fallback for backward compatibility with 54 test call sites that don't pass pipeline context kwargs. All current production callers — UnifiedProcessor._guard_filter and _guard_filter_file_mode (unified.py:129, :187) — pass the full context. The fallback is dormant in production, but a future caller that omits the kwargs would silently reintroduce the pre-Phase-5 online/batch guard divergence the unification was meant to eliminate. That's exactly the "leave the bad branch as a fallback" anti-pattern from lessons/engineering-patterns.md. This commit adds a single WARNING log line (once per call, not per record) when the fallback path runs with a guard configured. The warning names the agent and explains the divergence so operators can trace any regression in ops logs. Not changed in this PR: making the kwargs required, which would force migration of 54 tests. That's a follow-up cleanup once the warning has shown the fallback truly never fires in production. Regression tests: - test_fallback_path_emits_warning: 3 records → exactly 1 warning, naming the agent - test_pipeline_context_path_silent: production path emits nothing - test_no_guard_configured_silent: early-return path emits nothing
Replace ns.update() with dict spread to create a new dict per record. Prevents cross-record data contamination when multiple records share the same namespace dict object through shallow copies. Two regression tests verify isolation between records sharing a namespace dict, both across separate results and within the same result.
write_record_dispositions() previously called clear_disposition(DEFERRED) before set_disposition(terminal). A crash between the two left the record with no disposition — invisible to queries. Reorder: write the terminal disposition first, then clear DEFERRED. The UNIQUE(action_name, record_id, disposition) constraint allows both rows to coexist since disposition values differ. If the process crashes after the terminal write but before the DEFERRED clear, the terminal state is already committed and queryable. 6 regression tests verify the ordering invariant and crash-safety.
…565) When a FILE-granularity tool returned empty/None, a single ProcessingResult.failed(source_guid=None) was created for all N input records. The collector's `if result.source_guid:` guard skipped the disposition write, leaving all N records invisible in the database. Now creates one FAILED result per input record, each carrying the record's source_guid and a deep copy of the record as source_snapshot. The collector writes N DISPOSITION_FAILED rows instead of zero.
…ws (#566) * fix(batch): write FAILED dispositions when batch file processing throws When _process_single_batch_file raises a non-RuntimeError exception, the except/continue block in process_all_batch_results silently abandoned records — they stayed DEFERRED forever, invisible to retry and rerun. Now loads the context_map for the failed batch file, iterates INCLUDED records, clears DEFERRED, and writes DISPOSITION_FAILED with the exception message as reason. Records in other batch files are unaffected. 7 regression tests covering: INCLUDED vs SKIPPED filtering, reason propagation, multi-file isolation, no-backend safety, context_map load failure, RuntimeError propagation, and missing source_guid. * refactor: reuse _safe_set_disposition, reduce test boilerplate - Replace inline try/except around set_disposition with existing _safe_set_disposition from result_collector (code reuse) - Extract _setup_single_file_failure and _run_with_failure test helpers to eliminate repeated manager/registry setup boilerplate - Use consistent _failed_disposition_record_ids helper for assertion filtering across all tests - Remove unused side_effect parameter (B008 lint fix)
Actions with partial failures (COMPLETED_WITH_FAILURES) were preserved on rerun, causing the entire action to be skipped. Failed records were never reprocessed unless the user explicitly ran the retry CLI command. Add COMPLETED_WITH_FAILURES to the retryable set in reset_retryable() so plain reruns reset these actions to PENDING and reprocess them. Extract the inline retryable set to a module-level RETRYABLE_STATUSES constant, consistent with COMPLETED_STATUSES and TERMINAL_STATUSES.
…#567) prefilter_by_guard() dropped filtered records entirely — they appeared in neither the passing nor skipped return lists. The callers in _guard_filter() and _guard_filter_file_mode() could only count them by subtraction and created ProcessingResult.filtered(source_guid=None). result_collector skips disposition writes when source_guid is None, so filtered records vanished from the DB. Fix: prefilter_by_guard() now returns filtered records as a 4th return value. Both callers iterate them to create ProcessingResult.filtered() with each record's actual source_guid, enabling proper disposition writes.
The _warn_orphaned_deferred() query reported all records with DEFERRED disposition as orphans, even when those records had already received a terminal disposition (SUCCESS, FAILED, etc.) — the stale DEFERRED row just hadn't been cleared yet. Changed to a two-step approach: first query DEFERRED records (indexed, fast — returns 0 in the happy path). Only if deferred records exist, query terminal dispositions to filter out false positives. Genuine orphans = deferred IDs with no terminal sibling. Observed: product_listing_enrichment had 71 false orphan warnings that are now suppressed.
* fix: stale recovery state from crashed run poisons reprompt on rerun The original batch path loaded recovery_state from disk and passed it to check_and_submit_reprompt. If a previous run crashed after writing state, the stale reprompt_attempt counter caused the next fresh run to skip reprompt (believing attempts were exhausted). Two fixes: 1. _process_original_batch no longer loads recovery state — in the original batch path, any existing file is stale by definition. 2. _finalize_batch_output now deletes any leftover recovery state file before finalizing, matching the cleanup already done in the recovery path's _finalize_and_cleanup. * simplify: remove redundant test_stale_state_does_not_skip_reprompt test_process_original_batch_ignores_stale_recovery_state already verifies that recovery_state=None is passed and RecoveryStateManager.load is not called. The removed test duplicated this assertion with a fragile side_effect capture pattern.
* fix: CLI batch path delegates to production retry/reprompt logic process_batch_results previously called retrieve_and_reconcile directly with no retry loop. Missing records were permanently tombstoned as BATCH_NOT_RETURNED instead of being retried. Now delegates to _process_single_batch_file, which has the full retry/reprompt machinery. If a recovery batch is submitted, raises ProcessingError to signal the caller to re-invoke after recovery completes. * simplify: remove dead params and redundant warning from process_batch_results - Remove unused base_directory and file_path parameters (output path is now determined by _determine_output_path via the production path) - Remove redundant logger.warning before ProcessingError raise - Improve docstring to describe behavior, not implementation
* fix: isolate enrichment failures per-record and enrich batch errors F9: Wrap per-record enrichment in try/except so a single enrichment failure produces a FAILED ProcessingResult instead of aborting the entire action. Previously, an exception in enrichment for record N killed all records 0 through end. F6: Remove the RunMode.ONLINE guard in _enrich() so batch error results (FAILED/EXHAUSTED without processing_context) go through the enrichment pipeline. Previously they were appended as-is, producing tombstones missing lineage, version IDs, and passthrough fields that online tombstones have. * refactor: hoist test imports to module level Move Enricher, EnrichmentPipeline, RunMode, and replace imports from inline test-method scope to module-level imports. Reduces duplication across 6 test methods.
) * fix: empty LLM response records no longer vanish from output (F10) When on_empty=warn (the default), an empty LLM response fell through to _transform_response producing data=[], which result_collector wrote as DISPOSITION_SUCCESS with zero output records — the record vanished. Now: on_empty=warn returns ProcessingResult.failed() so the record gets DISPOSITION_FAILED and is visible in error reporting. on_empty=skip returns ProcessingResult.skipped() with a tombstone so the record appears in output with a skip reason. * refactor: use EMPTY_OUTPUT constant, remove redundant warning log - Add EMPTY_OUTPUT to record/reasons.py (bare string violated module contract) - Remove logger.warning for on_empty=warn (RecordEmptyOutputEvent already fires) - Fix misleading "silently skip" comment — tombstone IS visible in output
clear_prompt_traces() existed on the storage backend but was never called from production code, causing unbounded growth of the prompt_trace table across runs. Add calls in both _clear_for_fresh_run and _reset_retryable_actions alongside the existing delete_target and clear_disposition calls.
The has_pipeline_context check in prefilter_by_guard silently fell back to context=eval_item when no pipeline kwargs were passed. All production callers already pass full context, making this a dormant anti-pattern that could reintroduce online/batch guard divergence if a future caller omitted the kwargs. Now prefilter_by_guard always calls build_guard_context unconditionally. Deleted 8 tests that asserted the old fallback behavior (context==eval_item, fallback warning). Updated the paired_execution conftest fixture to handle the richer context shape from build_guard_context.
…og (#576) Replaces 30 individual changie YAML files (one per PR) with a single consolidated entry covering the entire batch/online unification initiative (Phases 0–9) and bug fix wave (F1–F15). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: consolidate 30 changie entries into single unification changelog Replaces 30 individual changie YAML files (one per PR) with a single consolidated entry covering the entire batch/online unification initiative (Phases 0–9) and bug fix wave (F1–F15). * fix: tighten prefilter guard tests and log from PR #567 review - Assert len(filtered) in test_all_filtered and test_filter_removes_failing_records (would have caught the original pre-fix regression) - Log line uses len(filtered) instead of subtraction - Record-mode disposition test asserts args[2] + source_guid structurally (was string-matching with `in str(c)`) - Negative test fixes arg index: disposition is args[2], not args[3]
…#579) - batch.py:206 — guard None rid before adding to set[str] - submission.py:295 — early return when storage_backend is None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two schedulers, one framework. Batch and online now share the same guard context, collect path, disposition lifecycle, tombstone contract, and observe/null semantics.
31 PRs merged to the integration branch over 3 days:
What changed
Architecture:
guard_context.py— single source of truth for guard context construction, eliminating "works online, breaks in batch" divergenceUnifiedProcessor.enrich_and_collect()— batch retrieve flows through the shared enrichment/collection pipelineNullNamespacesentinel — skipped/filtered namespaces resolve to None in observe, not crashPhases:
Bug fixes (discovered during smoke validation):
Smoke validation
Test plan
tests/unit/unification/)ruff check+ruff format --checkclean