Skip to content

Batch/online unification — Phases 0–9 + bug fixes F1–F15#578

Merged
Muizzkolapo merged 32 commits into
mainfrom
integration/batch-online-unification
May 19, 2026
Merged

Batch/online unification — Phases 0–9 + bug fixes F1–F15#578
Muizzkolapo merged 32 commits into
mainfrom
integration/batch-online-unification

Conversation

@Muizzkolapo
Copy link
Copy Markdown
Owner

@Muizzkolapo Muizzkolapo commented May 19, 2026

Summary

Two schedulers, one framework. Batch and online now share the same guard context, collect path, disposition lifecycle, tombstone contract, and observe/null semantics.

31 PRs merged to the integration branch over 3 days:

What changed

Architecture:

  • guard_context.py — single source of truth for guard context construction, eliminating "works online, breaks in batch" divergence
  • UnifiedProcessor.enrich_and_collect() — batch retrieve flows through the shared enrichment/collection pipeline
  • NullNamespace sentinel — skipped/filtered namespaces resolve to None in observe, not crash
  • Disposition write ordering — terminal disposition committed before DEFERRED clear (crash-safe)

Phases:

  • Phase 0: TDD baseline tests and shared fixtures
  • Phase 1: Observability — prep-fail warnings, prefilter length guard, schema compile cache
  • Phase 2: Batch prep integrity — INCLUDED only after prepare() succeeds
  • Phase 3: Batch retrieve correctness — LLM output wins merge collisions, FAILED records reconciled
  • Phase 4: Tombstone parity — shared build_tombstone() across online and batch
  • Phase 5: Guard context alignment — prefilter and prepare see identical bus-shaped context
  • Phase 6: Reconcile authority — single CASCADE source via get_skip_reason()
  • Phase 7: Disposition lifecycle — DEFERRED stamped on submit, parity with online collector on retrieve
  • Phase 8: Collector convergence — shared collect helper, batch retrieve through UnifiedProcessor
  • Phase 9: Observe/null semantics — guards and observe share null-field contract, NullNamespace

Bug fixes (discovered during smoke validation):

  • F1: Guard-filtered records now get DISPOSITION_FILTERED in the DB
  • F2: Rerun retries COMPLETED_WITH_FAILURES actions
  • F3: Orphan DEFERRED warning only fires for genuine orphans
  • F4: Batch file exceptions write FAILED dispositions instead of abandoning records
  • F5: Prompt traces cleared on --fresh and rerun reset
  • F6+F9: Enrichment failures isolated per-record; batch errors enriched same as online
  • F8: Prefilter eval_item fallback removed
  • F10: Empty LLM responses produce explicit failure instead of silent SUCCESS
  • F11: PassthroughEnricher copies namespace dict before mutation
  • F12: FILE tool empty response writes per-record FAILED dispositions
  • F13: Stale recovery state cleaned up on successful completion
  • F14: Terminal disposition written before DEFERRED clear (crash-safe)
  • F15: CLI batch path now uses production retry/reprompt logic

Smoke validation

Workflow Mode Records Result
product_listing_enrichment batch 74 DEFERRED→SUCCESS lifecycle correct, guard skip writes disposition
review_analyzer online 72 13 guard-filtered records get FILTERED disposition, 3 parallel scorers work
book_catalog_enrichment online 3 Upstream failure cascade + tombstone propagation correct

Test plan

  • 135 unification tests pass (tests/unit/unification/)
  • Full test suite passes (6974+ tests)
  • ruff check + ruff format --check clean
  • Three production workflows validated (batch + online)
  • Merge and monitor: no "works online, breaks in batch" regressions

Muizzkolapo and others added 30 commits May 17, 2026 21:20
* feat(vscode): add zoom and pan to Workflow DAG view (#545)

Large workflows render too small to read because the mermaid SVG was
constrained to max-width: 100% with no way to zoom in. Add interactive
pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons
(zoom in, zoom out, fit-to-view, reset). Auto-fits on first render.

* test: add failing Phase 1 observability tests (red)

TDD Commit A — these tests prove three observability gaps exist:

- U-2.G: _mark_prep_failed silently returns when target_id is missing
- U-2.H: passthrough_on_error swallows exceptions without mentioning it
- U-2.E: prefilter_by_guard accepts mismatched original_data length

All 3 must fail before Commit B (green) adds the fixes.

* fix: add Phase 1 observability guards and warnings (green)

TDD Commit B — all Phase 1 tests now pass:

- U-2.G: _mark_prep_failed logs WARNING when target_id is missing,
  making prep failures on anonymous records visible to operators
- U-2.H: passthrough_on_error exception handler logs WARNING with
  clause and error context (behavior unchanged — still passes)
- U-2.E: prefilter_by_guard raises RuntimeError when original_data
  length mismatches data length (prevents silent record loss)
- U-2.I: FILE vs RECORD merge-order divergence documented with
  rationale (sequential accumulation vs independent processing)
- U-2.J: Schema compilation already per-action — no change needed

No behavioral change except U-2.E (converts silent data loss into
a loud RuntimeError).

* chore: add changie entry for Phase 1 observability

* refactor: simplify Phase 1 observability after review

- evaluator.py: eliminate double WARNING on passthrough path — move
  generic warning to non-passthrough branch so each exception produces
  exactly one log line
- test file: use _PREPARATOR_LOGGER/_EVALUATOR_LOGGER constants instead
  of hardcoded strings, tighten assertion from >= 1 to == 1, update
  module docstring (remove stale commit A/B narrative)
* feat(vscode): add zoom and pan to Workflow DAG view (#545)

Large workflows render too small to read because the mermaid SVG was
constrained to max-width: 100% with no way to zoom in. Add interactive
pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons
(zoom in, zoom out, fit-to-view, reset). Auto-fits on first render.

* test(phase4): add failing tests for tombstone marker field parity

build_tombstone() and build_exhausted_tombstone() do not produce
_tombstone or _tombstone_reason fields. Downstream code must guess
tombstone identity from _state values. These tests assert that all
tombstone types carry consistent marker fields.

All 13 tests fail — proving the gap exists.

* feat(phase4): add _tombstone and _tombstone_reason to tombstone builders

build_tombstone() and build_exhausted_tombstone() now set
_tombstone=True and _tombstone_reason=<reason> on every tombstone
record. This gives downstream code (result collector, preview CLI,
FILE tool) a reliable structural marker instead of inferring
tombstone identity from _state values.

Both online and batch paths produce identical marker fields since
they share the same builder functions.

* fix(phase4): add tombstone markers to sibling builders, fix test gaps

Pattern completeness (Quality Gate 5): ExhaustedRecordBuilder and
PassthroughItemBuilder both construct tombstone-like records with
agent_type="tombstone" but lacked _tombstone/_tombstone_reason
markers. Fixed both.

Test fixes:
- Remove tautological set equality assertion
- Add extra_metadata clobber test for build_exhausted_tombstone
- Add sibling builder parity tests for both builders
* test: add TDD baseline test infrastructure for batch/online unification (Phase 0)

Create tests/unit/unification/ with conftest.py and Phase 0 baseline tests
verifying U-0.1 through U-0.4 are implemented:

- U-0.1: Batch preflight calls prepare with skip_guard=False
- U-0.2: Online process_record defaults skip_guard=False
- U-0.3: Prefilter passes context=eval_item to guard evaluator
- U-0.4: Both paths use shared _build_pipeline_context

Includes paired_execution fixture for Phase 5+ parity tests.
No source changes — tests only.

* refactor: simplify test fixtures and fix misleading test name

- Move imports to module level in conftest (no circular import risk)
- Remove redundant inline comments in paired_execution fixture
- Rename test_batch_path_calls_build_pipeline_context to accurately
  reflect what it verifies (type compatibility, not call tracing)

* fix: address PR review findings — eliminate vacuous assertions and dead code

- Replace vacuous assertion (result.skip_reason is not None always True
  via MagicMock) with precise checks: assert build_tombstone called and
  result.executed is False
- Add assertion to guard-skipped-rows test (was no-assertion, would pass
  if method were a no-op)
- Remove dead get_task_preparer patch in paired_execution (mock_preparer
  is passed directly to _process_single_item, patch was never exercised)
- Fix paired_execution docstring to accurately describe what it tests
  (mock plumbing, not real evaluator parity)
* test(unification): add Phase 2 batch prep integrity tests

U-2.B red test: _process_single_item sets INCLUDED before prepare()
runs, so when prepare() throws, context_map is left with a lie.

U-1.3a/U-2.C regression tests: confirm Phase 1 fixes hold —
_mark_prep_failed uses transition() API, no raw _state bypass.

test_prepare_failure_not_marked_included: FAILS (bug exists)
All other tests: PASS (regression)

* fix(batch): move INCLUDED status after prepare() succeeds (U-2.B)

context_map previously marked records as INCLUDED before prepare()
ran. If prepare() threw, the record was left with INCLUDED status
at the _process_single_item level — a lie that downstream phases
(3, 6, 7) would trust.

Now INCLUDED is set only after prepare() succeeds and guard checks
pass. If prepare() throws, the entry exists in context_map (so
_mark_prep_failed can find it) but has no filter status, which
_mark_prep_failed correctly sets to FAILED.

* refactor(tests): deduplicate _enable_log_propagation fixture + extract prep_harness

Move _enable_log_propagation to shared conftest (was duplicated in
Phase 1 and Phase 2 test files). Extract repeated test setup in
TestContextMapIntegrity into a prep_harness fixture and _run helper.

* test(unification): cover all guard paths + strengthen transition assertion

Address PR review findings:
- Add FILTERED and UPSTREAM_UNPROCESSED guard path tests (was WARNING)
- Assert stats side effects (skipped_items, filtered_items) in guard tests
- Strengthen transition API test: assert on history entry content
  (action, to, reason) instead of just _state_history existence
* test(phase5): add failing TDD tests for guard context parity

Proves the context divergence bug: prefilter_by_guard passes
context=eval_item (record content only), while TaskPreparer.prepare
builds full field_context with source, version, workflow namespaces
and promoted output_fields.

Guards referencing source.name, version.i, or promoted output_fields
produce different decisions in online vs batch — 5 tests fail.

* refactor(phase5): extract shared build_guard_context helper

Extract guard context construction into processing/guard_context.py
as the single source of truth for both batch and online paths.

TaskPreparer._load_full_context now delegates to build_guard_context,
ensuring the output_field promotion and source resolution logic lives
in one place. This prevents the two paths from diverging.

Update test that patched the old logger location.

* feat(phase5): wire prefilter to shared guard context builder

prefilter_by_guard now accepts optional pipeline context parameters
(agent_indices, source_data, version_context, workflow_metadata,
dependency_configs). When provided, it builds full field_context via
build_guard_context — identical to what TaskPreparer.prepare() uses.

Both _guard_filter and _guard_filter_file_mode in unified.py pass
the ProcessingContext fields through, so online guards now see the
same source, version, workflow namespaces and promoted output_fields
that batch guards see.

When no pipeline context is provided, prefilter falls back to
eval_item (backward-compatible for callers outside unified.py).

All 9 Phase 5 parity tests pass. 6847 existing tests pass.

* chore: add changie entry for Phase 5 guard context alignment
* test(phase3): add failing tests for batch retrieve correctness

U-2.A: passthrough key collides with LLM field — assert LLM value kept
U-2.D: FAILED records in context_map — assert they appear in output

All 5 tests FAIL against current code, proving both bugs exist:
- Merge order: passthrough overwrites LLM output on key collision
- FAILED records: silently dropped during reconciliation

* fix(phase3): batch retrieve correctness — merge order and FAILED reconciliation

U-2.A: Reverse passthrough merge order so LLM output wins on key collision.
Previously `{**item, **stored_passthrough}` let passthrough overwrite LLM
output. Now `{**stored_passthrough, **item}` ensures LLM values take
precedence. Collisions on user-content fields are logged at INFO level;
framework fields (target_id, _state, etc.) are silently excluded.

U-2.D: Include FilterStatus.FAILED records in passthrough reconciliation.
Previously reconciler.get_passthrough_records() only emitted SKIPPED,
INCLUDED, and None — FAILED records silently vanished. Now they flow
through a dedicated _build_failed_passthrough() that emits a
ProcessingResult.failed() with the prep failure reason, ensuring all
records appear in output.
…rity (#552)

* fix: remove _is_cascade_blocked fallback — single cascade authority (Phase 6)

Remove duplicate CASCADE check in batch reconciliation. Prep is now the
single authority for cascade decisions via get_skip_reason().

Changes:
- Delete _is_cascade_blocked() function (zero remaining callers)
- Remove _is_cascade_blocked fallback from _build_unprocessed_passthrough
- Remove CASCADE_BLOCKING_VALUES and UPSTREAM_UNPROCESSED imports
- Update 2 existing tests that relied on the old fallback path
- Add 5 Phase 6 regression tests verifying single-authority behavior

Without skip_reason from prep, cascade-blocked records now get
BATCH_NOT_RETURNED (not UPSTREAM_UNPROCESSED from re-derived _state).

* refactor: use UPSTREAM_UNPROCESSED constant and clarify docstring

- Replace raw string "upstream_unprocessed" with UPSTREAM_UNPROCESSED
  constant in test_upstream_unprocessed_filter.py
- Remove phase reference from docstring — describe the invariant, not
  the implementation history
* feat(vscode): add zoom and pan to Workflow DAG view (#545)

Large workflows render too small to read because the mermaid SVG was
constrained to max-width: 100% with no way to zoom in. Add interactive
pan/zoom: mouse wheel to zoom, click-drag to pan, and control buttons
(zoom in, zoom out, fit-to-view, reset). Auto-fits on first render.

* test(phase7a): add failing tests for DEFERRED disposition on batch submit

U-3.3: After batch submission, DISPOSITION_DEFERRED must be stamped
for every record sent to the provider. Currently no dispositions are
written at submit time — only on retrieve.

All 3 tests fail — proving the gap exists.

* feat(phase7a): stamp DISPOSITION_DEFERRED on batch submit

U-3.3: After successful batch submission, every INCLUDED record now
gets DISPOSITION_DEFERRED written to the storage backend. Records
that were filtered/skipped during preparation do not receive DEFERRED.

Implementation:
- Add storage_backend parameter to BatchSubmissionService
- _stamp_deferred() iterates context_map for INCLUDED records
- Both callers (initial_pipeline, workflow/pipeline) pass backend
- Reason includes batch_id for traceability

This enables operators to query the disposition table for in-flight
batch records and lets Phase 7b clear DEFERRED on retrieve.

* fix(phase7a): use _safe_set_disposition, remove unused test param

_stamp_deferred now uses _safe_set_disposition (matching all other
callers) so a storage failure won't crash batch submission. Removed
unused prepare_result parameter from test helper.

* test(phase7a): add negative test for skipped/filtered records

Verify that SKIPPED, FILTERED, and FAILED records do NOT receive
DISPOSITION_DEFERRED — only INCLUDED records are stamped. Tests
_stamp_deferred directly with a hand-crafted context_map containing
all four FilterStatus values.
…554)

* test: add failing Phase 9a observe null-safety tests (red)

TDD Commit A — these tests prove the observe/guard null asymmetry:

- observe raises RecordContextError for missing field from present namespace
- guards resolve the same field to None (correct behavior)

3 tests fail, 2 pass (existing null-namespace and undeclared-namespace behaviors).

* fix: observe resolves missing fields to None matching guard semantics (U-4.2)

_resolve_missing_field now has three cases:
1. Namespace is None (guard-skipped/filtered) → return None (unchanged)
2. Namespace exists as dict but field missing → return None with WARNING
3. Namespace not in context at all → raise RecordContextError (unchanged)

This eliminates the asymmetry where guards return None for missing fields
but observe throws RecordContextError. Guards and observe now share the
same null-field contract. Preflight still catches config typos (case 3).

Drop+observe on the same field now resolves to None (not the original
value) — the security invariant is preserved: dropped data is not leaked.

10 existing tests updated to match new behavior. 5 new Phase 9a tests.
6870 passed, ruff clean.

* chore: add changie entry for Phase 9a observe null safety

* test: tighten FILE mode missing-field enrichment assertion

The test claimed to verify missing-field-with-None enrichment but only
checked a present field survived. Now also asserts: second present field
preserved, and missing field not injected as flat key (flat key injection
only covers fields present in the namespace dict).
#557)

* test(phase-7b): add failing tests for disposition retrieve parity

Tests verify batch retrieve dispositions match online ResultCollector
rules (U-3.2a, U-3.5a):
- SUCCESS records must get DISPOSITION_SUCCESS
- FILTERED records must get DISPOSITION_FILTERED
- EXHAUSTED must include input_snapshot
- FAILED must include input_snapshot and detail
- Prompt traces must only be written for SUCCESS records

All 9 tests fail against current code, proving the parity gaps exist.

* fix(phase-7b): batch retrieve disposition parity with online collector

Closes U-3.2a and U-3.5a — batch retrieve dispositions now match
online ResultCollector.collect_results() rules:

- SUCCESS records get DISPOSITION_SUCCESS (previously only DEFERRED cleared)
- FILTERED records get DISPOSITION_FILTERED via new _write_filtered_dispositions
- EXHAUSTED dispositions include input_snapshot for forensics
- FAILED dispositions include input_snapshot and detail
- Prompt trace updates restricted to SUCCESS records only (tombstones excluded)
* test(phase9b): add failing tests for NullNamespace observe resolution (red)

Tests assert that NullNamespace(reason='skipped') in field_context
resolves observe/passthrough fields to None instead of crashing.
Currently fails because _resolve_missing_field only recognizes None,
not NullNamespace instances.

* feat(phase9b): NullNamespace sentinel for skipped namespace observe

Introduce NullNamespace(reason="skipped") sentinel that replaces plain
None when DependencyNamespaceBuilder encounters a null/absent upstream
namespace.  This lets downstream code introspect WHY a namespace is
null (skipped vs filtered — Phase 9c will add the filtered reason).

Changes:
- _resolve_missing_field() uses is_null_namespace() to handle both
  NullNamespace instances and legacy None
- DependencyNamespaceBuilder.build() wraps None deps in NullNamespace
- Guard evaluator ast_nodes.py recognizes NullNamespace via helper
- Existing tests updated to use is_null_namespace() assertions

* chore(phase9b): add changie entry and update manifest for NullNamespace

* refactor(phase9b): add reason constants and singleton for NullNamespace

Replace bare "skipped" string with REASON_SKIPPED constant and use
pre-built SKIPPED_NAMESPACE singleton to avoid repeated allocation.

* fix(phase9b): use is_null_namespace in template error hint detection

prompt/service.py:442 used `v is None` to identify null namespaces
for Jinja error enrichment. With NullNamespace sentinel, this check
missed skipped namespaces, producing unhelpful error messages.
* test: add Phase 9c filter/skip/observe consistency tests (U-4.3)

TDD contract class TestFilteredNamespaceObserve plus filter/skip
symmetry and fan-in interaction tests.  11 new tests verify that
filtered and skipped namespaces produce identical null-safe observe
behavior — all fields resolve to None instead of crashing.

The behavioral implementation was completed in earlier work (PR #538
fan-in guard-filter null namespace + Phase 9a PR #554).  These tests
formalize the contract and lock the invariant: filter = skip for
observe/passthrough resolution.

* fix: tighten Phase 9c test assertions and fix changie timestamp

- Replace tautological test_filtered_and_skipped_resolve_identically
  (identical inputs proving f(x)==f(x)) with test_all_directives_null_safe
  that exercises observe+passthrough+drop on a filtered namespace
- Replace tautological isinstance(dict) assertion in drop test with
  meaningful checks on prompt_context content
- Fix invalid changie timestamp (09:c0 → 09:30)
- Remove comments that restate assertions

* test: remove redundant fan-in observe+passthrough test

Already covered by test_all_directives_null_safe_on_filtered
(observe+passthrough+drop) and test_three_way_fan_in_one_filtered
(fan-in layout). 10 tests remain.
* refactor(phase8a): extract collect_results_from_processing_results shared helper

Extract the core collect logic from ResultCollector.collect_results() into a
standalone module-level function collect_results_from_processing_results().
ResultCollector.collect_results() now delegates to it.

The shared helper has a simpler API (action_name + optional storage_backend
and agent_config) designed for both online and batch retrieve callers.
This is a pure extraction — no behavior change, no callers rewired.

10 new tests verify per-status parity (SUCCESS, FAILED, EXHAUSTED, SKIPPED,
FILTERED, UNPROCESSED, DEFERRED), mixed batch scenario, no-backend mode,
and equivalence with ResultCollector.collect_results() output.

6966 passed, ruff clean.

* fix(phase8a): review fixes — guard precision, test gaps, cleanup

- Fix `if effective_config:` → `if agent_config is not None:` so an
  explicit empty config {} still runs exhausted-raise check (was
  falsely skipped due to empty-dict truthiness)
- Add 3 tests for exhausted-raise guard: raises with config, skips
  without config (batch path), runs with empty config
- Simplify _mock_storage_backend() — MagicMock auto-creates attrs
- Remove WHAT comment on arithmetic assertion
* test(phase8b): add failing TDD tests for UnifiedProcessor retrieve parity

Red-phase tests for Phase 8b:
- BatchResultStrategy must satisfy ProcessingStrategy protocol (invoke method)
- UnifiedProcessor.enrich_and_collect for batch retrieve
- FAILED results with pre-existing data preserved by collector
- Per-result processing_context used for enrichment
- CollectionStats parity between online and batch paths

9 tests fail, 3 pass (testing existing behavior only).

* feat(phase8b): BatchResultStrategy implements ProcessingStrategy

Add invoke() and prepare_invoke() methods to BatchResultStrategy.
invoke() satisfies ProcessingStrategy protocol via structural typing.
prepare_invoke() pre-loads batch data (batch_results, context_map, etc.)
before the processor calls invoke().

Batch data is consumed and cleared after invoke() returns to prevent
stale state across calls.

* refactor(phase8b): wire batch retrieve through UnifiedProcessor

Route batch results through UnifiedProcessor.enrich_and_collect() —
the shared enrich→collect pipeline that online already uses.

Key changes:
- UnifiedProcessor.enrich_and_collect(): new entry point for pre-computed
  results (batch retrieve, where guard filtering happened at submit time)
- UnifiedProcessor._enrich(): uses per-result processing_context when set
  (batch results carry their own). Falls back to shared context (online).
  Batch error results (no processing_context) skip enrichment entirely.
- Collector FAILED handler: preserves pre-existing result.data (batch error
  items, tombstones) instead of always building a fresh tombstone. Online
  FAILED results (data=[]) still build tombstones as before.
- _convert_batch_results_to_workflow_format: now returns (output, stats)
  via UnifiedProcessor instead of inline enrich loop + manual flatten.
- finalize_batch_output: disposition writing and state stamping now handled
  by the shared collector. Only DEFERRED clearing and prompt trace updates
  remain batch-specific.
- Test mocks updated: _convert returns ([], None) instead of [].

* refactor(phase8b): delete duplicate batch code paths

Remove dead code after routing batch retrieve through UnifiedProcessor:

- Remove _stamp_batch_records (collector stamps _state via result.status)
- Remove _write_record_dispositions from BatchProcessingService
  (collector writes dispositions inline during collection)
- Remove write_record_dispositions import from processing_recovery
- Remove RecordEnvelope import from processing_recovery (only used by
  _stamp_batch_records)
- Update integration test to use UnifiedProcessor.enrich_and_collect()
  instead of inline loop + _stamp_batch_records
- Remove TestStampBatchRecords tests (function deleted)
- Update test mocks: remove write_record_dispositions patches,
  _convert returns (output, stats) tuple

* chore: lint/format fixes and changie entry for Phase 8b

* refactor(phase8b): extract _try_clear_deferred, cache UnifiedProcessor

- Extract shared _try_clear_deferred() helper from duplicate
  clear_disposition try/except blocks in _clear_deferred_dispositions
  and _write_filtered_dispositions
- Cache UnifiedProcessor on BatchProcessingService.__init__ instead
  of creating a new instance per _convert_batch_results_to_workflow_format
  call
…#560)

Phase 8c policy decision: preflight validation stays batch-only.

Online processes one record at a time — template errors surface
immediately on the first row, making preflight redundant overhead.
Batch submits thousands of rows in one API call — a template error
discovered mid-batch wastes the entire submission. Preflight samples
up to 5 rows before submission to catch these early.

No dead preflight code exists in the online/realtime path (verified
via grep). No code to clean up, no feature flag needed.
…imestamps, prefilter warning) (#561)

* fix(changie): repair 4 invalid timestamp minutes

Four entries used out-of-range minute values (70, 80, 90) that fail
Go's strict time.Parse. On the next release run changie would either
crash or silently drop these entries from CHANGELOG.md, hiding the
Phase 7a / 9a / 9b / 8b user-visible work.

Map:
  00:70:00 → 00:07:00  Enhancement or New Feature-20260517-007000.yaml
  00:90:00 → 00:09:00  Enhancement or New Feature-20260517-009000.yaml
  00:90:02 → 00:09:02  Under the Hood-20260517-009002.yaml
  00:80:02 → 00:08:02  Under the Hood-20260518-008002.yaml

Verified: all 4 now parse via datetime.fromisoformat.

No changie entry of its own — this fix repairs the changie infrastructure;
adding another entry would be circular if the parser were still broken.

* fix(batch): write FILTERED dispositions on production retrieve path

BatchProcessingService has two retrieve entry points that diverged after
Phase 7b (#557) introduced _write_filtered_dispositions:

  process_batch_results (single-batch, mostly used in tests/CLI):
    ✓ calls _write_filtered_dispositions

  process_all_batch_results → _process_single_batch_file →
  _process_original_batch → _finalize_batch_output →
  processing_recovery.finalize_batch_output (PRODUCTION path, via
  BatchLifecycleManager):
    ✗ did NOT call _write_filtered_dispositions

The reconciler strips FILTERED rows from processed_data before the
shared collector runs, so the collector never sees them. Without an
explicit call in finalize_batch_output, FILTERED records stayed at
DISPOSITION_DEFERRED (stamped at submit by Phase 7a) and never
transitioned to DISPOSITION_FILTERED — silently breaking Phase 7b
parity for every real batch run.

Fix: add service._write_filtered_dispositions(context_map, effective_action_name)
alongside the existing _clear_deferred_dispositions and
_update_prompt_trace_responses calls in finalize_batch_output. Also
updated the adjacent comment that incorrectly claimed the collector
handles all disposition writes.

Regression tests (red→green confirmed):
  - test_finalize_writes_filtered_dispositions
  - test_finalize_uses_service_action_name_when_action_name_none
    (covers the action_name=None fallback to service._action_name —
    mirrors how _clear_deferred and _update_prompt_trace behave)

* chore(prefilter): warn when fallback guard-context path runs

Phase 5 (#551) introduced build_guard_context as the single authority
for guard context construction, used by both batch TaskPreparer.prepare
and online prefilter_by_guard. The prefilter still retained an
eval_item-only fallback for backward compatibility with 54 test call
sites that don't pass pipeline context kwargs.

All current production callers — UnifiedProcessor._guard_filter and
_guard_filter_file_mode (unified.py:129, :187) — pass the full context.
The fallback is dormant in production, but a future caller that omits
the kwargs would silently reintroduce the pre-Phase-5 online/batch
guard divergence the unification was meant to eliminate. That's exactly
the "leave the bad branch as a fallback" anti-pattern from
lessons/engineering-patterns.md.

This commit adds a single WARNING log line (once per call, not per
record) when the fallback path runs with a guard configured. The
warning names the agent and explains the divergence so operators can
trace any regression in ops logs.

Not changed in this PR: making the kwargs required, which would force
migration of 54 tests. That's a follow-up cleanup once the warning has
shown the fallback truly never fires in production.

Regression tests:
  - test_fallback_path_emits_warning: 3 records → exactly 1 warning,
    naming the agent
  - test_pipeline_context_path_silent: production path emits nothing
  - test_no_guard_configured_silent: early-return path emits nothing
Replace ns.update() with dict spread to create a new dict per record.
Prevents cross-record data contamination when multiple records share
the same namespace dict object through shallow copies.

Two regression tests verify isolation between records sharing a
namespace dict, both across separate results and within the same result.
write_record_dispositions() previously called clear_disposition(DEFERRED)
before set_disposition(terminal). A crash between the two left the record
with no disposition — invisible to queries.

Reorder: write the terminal disposition first, then clear DEFERRED. The
UNIQUE(action_name, record_id, disposition) constraint allows both rows
to coexist since disposition values differ. If the process crashes after
the terminal write but before the DEFERRED clear, the terminal state is
already committed and queryable.

6 regression tests verify the ordering invariant and crash-safety.
…565)

When a FILE-granularity tool returned empty/None, a single
ProcessingResult.failed(source_guid=None) was created for all N
input records. The collector's `if result.source_guid:` guard
skipped the disposition write, leaving all N records invisible
in the database.

Now creates one FAILED result per input record, each carrying
the record's source_guid and a deep copy of the record as
source_snapshot. The collector writes N DISPOSITION_FAILED rows
instead of zero.
…ws (#566)

* fix(batch): write FAILED dispositions when batch file processing throws

When _process_single_batch_file raises a non-RuntimeError exception,
the except/continue block in process_all_batch_results silently
abandoned records — they stayed DEFERRED forever, invisible to retry
and rerun.

Now loads the context_map for the failed batch file, iterates INCLUDED
records, clears DEFERRED, and writes DISPOSITION_FAILED with the
exception message as reason. Records in other batch files are unaffected.

7 regression tests covering: INCLUDED vs SKIPPED filtering, reason
propagation, multi-file isolation, no-backend safety, context_map load
failure, RuntimeError propagation, and missing source_guid.

* refactor: reuse _safe_set_disposition, reduce test boilerplate

- Replace inline try/except around set_disposition with existing
  _safe_set_disposition from result_collector (code reuse)
- Extract _setup_single_file_failure and _run_with_failure test helpers
  to eliminate repeated manager/registry setup boilerplate
- Use consistent _failed_disposition_record_ids helper for assertion
  filtering across all tests
- Remove unused side_effect parameter (B008 lint fix)
Actions with partial failures (COMPLETED_WITH_FAILURES) were preserved
on rerun, causing the entire action to be skipped. Failed records were
never reprocessed unless the user explicitly ran the retry CLI command.

Add COMPLETED_WITH_FAILURES to the retryable set in reset_retryable()
so plain reruns reset these actions to PENDING and reprocess them.

Extract the inline retryable set to a module-level RETRYABLE_STATUSES
constant, consistent with COMPLETED_STATUSES and TERMINAL_STATUSES.
…#567)

prefilter_by_guard() dropped filtered records entirely — they appeared
in neither the passing nor skipped return lists. The callers in
_guard_filter() and _guard_filter_file_mode() could only count them
by subtraction and created ProcessingResult.filtered(source_guid=None).
result_collector skips disposition writes when source_guid is None,
so filtered records vanished from the DB.

Fix: prefilter_by_guard() now returns filtered records as a 4th return
value. Both callers iterate them to create ProcessingResult.filtered()
with each record's actual source_guid, enabling proper disposition writes.
The _warn_orphaned_deferred() query reported all records with DEFERRED
disposition as orphans, even when those records had already received a
terminal disposition (SUCCESS, FAILED, etc.) — the stale DEFERRED row
just hadn't been cleared yet.

Changed to a two-step approach: first query DEFERRED records (indexed,
fast — returns 0 in the happy path). Only if deferred records exist,
query terminal dispositions to filter out false positives. Genuine
orphans = deferred IDs with no terminal sibling.

Observed: product_listing_enrichment had 71 false orphan warnings that
are now suppressed.
* fix: stale recovery state from crashed run poisons reprompt on rerun

The original batch path loaded recovery_state from disk and passed it to
check_and_submit_reprompt. If a previous run crashed after writing state,
the stale reprompt_attempt counter caused the next fresh run to skip
reprompt (believing attempts were exhausted).

Two fixes:
1. _process_original_batch no longer loads recovery state — in the
   original batch path, any existing file is stale by definition.
2. _finalize_batch_output now deletes any leftover recovery state file
   before finalizing, matching the cleanup already done in the recovery
   path's _finalize_and_cleanup.

* simplify: remove redundant test_stale_state_does_not_skip_reprompt

test_process_original_batch_ignores_stale_recovery_state already verifies
that recovery_state=None is passed and RecoveryStateManager.load is not
called. The removed test duplicated this assertion with a fragile
side_effect capture pattern.
* fix: CLI batch path delegates to production retry/reprompt logic

process_batch_results previously called retrieve_and_reconcile directly
with no retry loop. Missing records were permanently tombstoned as
BATCH_NOT_RETURNED instead of being retried.

Now delegates to _process_single_batch_file, which has the full
retry/reprompt machinery. If a recovery batch is submitted, raises
ProcessingError to signal the caller to re-invoke after recovery
completes.

* simplify: remove dead params and redundant warning from process_batch_results

- Remove unused base_directory and file_path parameters (output path is
  now determined by _determine_output_path via the production path)
- Remove redundant logger.warning before ProcessingError raise
- Improve docstring to describe behavior, not implementation
* fix: isolate enrichment failures per-record and enrich batch errors

F9: Wrap per-record enrichment in try/except so a single enrichment
failure produces a FAILED ProcessingResult instead of aborting the
entire action. Previously, an exception in enrichment for record N
killed all records 0 through end.

F6: Remove the RunMode.ONLINE guard in _enrich() so batch error
results (FAILED/EXHAUSTED without processing_context) go through
the enrichment pipeline. Previously they were appended as-is,
producing tombstones missing lineage, version IDs, and passthrough
fields that online tombstones have.

* refactor: hoist test imports to module level

Move Enricher, EnrichmentPipeline, RunMode, and replace imports from
inline test-method scope to module-level imports. Reduces duplication
across 6 test methods.
)

* fix: empty LLM response records no longer vanish from output (F10)

When on_empty=warn (the default), an empty LLM response fell through to
_transform_response producing data=[], which result_collector wrote as
DISPOSITION_SUCCESS with zero output records — the record vanished.

Now: on_empty=warn returns ProcessingResult.failed() so the record gets
DISPOSITION_FAILED and is visible in error reporting. on_empty=skip
returns ProcessingResult.skipped() with a tombstone so the record
appears in output with a skip reason.

* refactor: use EMPTY_OUTPUT constant, remove redundant warning log

- Add EMPTY_OUTPUT to record/reasons.py (bare string violated module contract)
- Remove logger.warning for on_empty=warn (RecordEmptyOutputEvent already fires)
- Fix misleading "silently skip" comment — tombstone IS visible in output
clear_prompt_traces() existed on the storage backend but was never
called from production code, causing unbounded growth of the
prompt_trace table across runs. Add calls in both _clear_for_fresh_run
and _reset_retryable_actions alongside the existing delete_target and
clear_disposition calls.
The has_pipeline_context check in prefilter_by_guard silently fell back
to context=eval_item when no pipeline kwargs were passed. All production
callers already pass full context, making this a dormant anti-pattern
that could reintroduce online/batch guard divergence if a future caller
omitted the kwargs.

Now prefilter_by_guard always calls build_guard_context unconditionally.
Deleted 8 tests that asserted the old fallback behavior (context==eval_item,
fallback warning). Updated the paired_execution conftest fixture to handle
the richer context shape from build_guard_context.
…og (#576)

Replaces 30 individual changie YAML files (one per PR) with a single
consolidated entry covering the entire batch/online unification initiative
(Phases 0–9) and bug fix wave (F1–F15).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: consolidate 30 changie entries into single unification changelog

Replaces 30 individual changie YAML files (one per PR) with a single
consolidated entry covering the entire batch/online unification initiative
(Phases 0–9) and bug fix wave (F1–F15).

* fix: tighten prefilter guard tests and log from PR #567 review

- Assert len(filtered) in test_all_filtered and test_filter_removes_failing_records
  (would have caught the original pre-fix regression)
- Log line uses len(filtered) instead of subtraction
- Record-mode disposition test asserts args[2] + source_guid structurally
  (was string-matching with `in str(c)`)
- Negative test fixes arg index: disposition is args[2], not args[3]
…#579)

- batch.py:206 — guard None rid before adding to set[str]
- submission.py:295 — early return when storage_backend is None
@Muizzkolapo Muizzkolapo merged commit 46d17b4 into main May 19, 2026
5 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant