Skip to content

feat: add Dispatcher.new(parallelism:) for bounded intra-tick dispatch#182

Merged
duncanita merged 3 commits into
mainfrom
feat/v1.3-dispatcher-parallelism
May 8, 2026
Merged

feat: add Dispatcher.new(parallelism:) for bounded intra-tick dispatch#182
duncanita merged 3 commits into
mainfrom
feat/v1.3-dispatcher-parallelism

Conversation

@duncanita
Copy link
Copy Markdown
Owner

Summary

V1.3 introduces bounded intra-tick parallel dispatch. Consumers can fan out N co-eligible effects in one tick instead of paying N × per_record wall time. The opt-in is a single additive kwarg; the default is identical to V1.2.

This implements the feature half of the V1.3 split. The governance carve-out shipped first as #180 (cop allow-list) and #181 (cop tightening).

What's new

  • DAG::Effects::Dispatcher.new(parallelism: 1) (default) preserves the V1.2 serial contract bit-identical.
  • parallelism > 1 claims records as before but dispatches them through a bounded Thread.new pool of at most parallelism workers. Pool reads from a Queue of [record, slot_index] pairs, writes outcomes into a pre-allocated Array slot. succeeded.map(&:id) is a subsequence of claimed.map(&:id) in original order.
  • Unexpected exceptions raised inside a worker thread propagate out of #tick via Thread#value after join — V1.2 serial-map exception semantics preserved (dispatch_record still catches only StaleLeaseError).
  • parallelism > 1 requires the storage adapter to declare #thread_safe_for_dispatch? truthy. Memory::Storage does not declare it (single-process per Roadmap §2.4); constructing Dispatcher.new(..., parallelism: > 1) with it raises ArgumentError. Durable adapters (typically SQLite-backed in consumer hosts) declare it explicitly.

Tests (spec/r2/effects_dispatcher_test.rb)

7 new tests:

  1. parallelism: 1 default → max_in_flight == 1 over a 4-record batch
  2. parallelism: 3 over 6 records → max_in_flight <= 3 and >= 2
  3. parallelism: 4 with mixed success/failure → collection order preserved
  4. Non-StandardError worker exceptions propagate out of tick
  5. parallelism > 1 with Memory::StorageArgumentError naming thread_safe_for_dispatch?
  6. parallelism validation: 0, -1, 1.5 all raise
  7. (Existing 11 V1.2 tests still pass invariati under parallelism: 1 default.)

The IntervalRecordingHandler test helper writes per-record [entered_at_ns, exited_at_ns] into disjoint Hash slots and sweeps the merged interval set ex post to compute max_concurrent. No thread-sync primitives needed in the test (writes target disjoint keys; CRuby's GVL keeps Hash#[]= atomic per key). End-events sort before start-events at equal timestamps to avoid false-positive overlap on abutting intervals.

Release artifacts

  • lib/dag/version.rb: 1.2.0 → 1.3.0
  • Gemfile.lock: version pin updated
  • CHANGELOG.md: V1.3 release section with feature notes + already-merged governance carve-out
  • ROADMAP.md: V1.3 dispatcher parallelism + Release v1.3 marked Done
  • CONTRACT.md: "Dispatcher Concurrency Contract" subsection now names thread_safe_for_dispatch? explicitly
  • spec/r0/v1_2_release_gate_test.rb: drops the assert_equal \"1.2.0\", DAG::VERSION check (release-gate version pin moves to V1.3 gate, matching the V1.1 → V1.2 pattern)
  • spec/r0/v1_3_release_gate_test.rb: new release gate covering version, changelog, roadmap, dispatcher kwarg, and contract docs

Test plan

  • bundle exec rake — 619 runs, 40676 assertions, 0 failures
  • bundle exec rubocop — clean (Thread.new, Queue, <<, pop, []= in dispatcher.rb pass the V1.3 carve-out; Mutex/Monitor/etc. still banned)
  • bundle exec standardrb — clean
  • bundle exec yard doc — 99.15% documented

What's intentionally NOT in this PR

  • Per-handler concurrency caps (per_type_parallelism: {}). V1.4+.
  • workflow_id: filter on claim_ready_effects. V1.4+.
  • Parallel node execution in Runner. Separate design pass; would move §2.1 Determinism pillar.

🤖 Generated with Claude Code

V1.3 introduces bounded intra-tick parallel dispatch so consumers can
fan out N co-eligible effects (e.g. tool calls and LLM calls in the
same workflow) instead of paying `N × per_record` wall time. The
opt-in is a single additive kwarg; the default is identical to V1.2.

## Dispatcher

- New `parallelism: 1` kwarg on `DAG::Effects::Dispatcher.new`,
  validated as a positive Integer via `DAG::Validation`.
- New private `parallel_map(items)` replaces the V1.2
  `claimed.map { dispatch_record(record) }`. With `parallelism <= 1`
  or `items.length <= 1` it reduces to `items.map`, preserving the
  V1.2 serial contract bit-identical. Otherwise it pre-allocates a
  result Array, feeds a `Queue` of `[record, slot_index]` pairs, and
  spawns at most `parallelism` `Thread.new` workers. Workers `pop`
  pairs from the queue, invoke the block, and write the outcome into
  their slot. `workers.each(&:value)` propagates any exception that
  escaped the worker via `Thread#value` after `join` — preserving
  the V1.2 serial-map exception semantics where any exception not
  caught by `dispatch_record` (still only `StaleLeaseError`)
  re-emerges from `tick`.
- Order preservation is per collection: `succeeded.map(&:id)` is a
  subsequence of `claimed.map(&:id)` in original order. Slot
  alignment between the two arrays is not promised (and was never
  promised even at V1.2 with mixed outcomes).

## Storage thread-safety contract

`parallelism > 1` requires the storage adapter to declare
thread-safety by implementing `#thread_safe_for_dispatch?` and
answering truthy. Adapters that do not declare it cause
`Dispatcher.new(..., parallelism: > 1)` to raise `ArgumentError`,
surfacing the contract mismatch at construction. `Memory::Storage`
is single-process by Roadmap §2.4 and intentionally does not
declare it; durable adapters (typically SQLite-backed in consumer
hosts) bind every dispatcher-touched method to a transaction and
declare it explicitly.

## Tests

`spec/r2/effects_dispatcher_test.rb` adds 7 V1.3 tests:

- Default `parallelism: 1` keeps `max_in_flight == 1` over a 4-record
  batch — V1.2 serial behavior bit-identical.
- `parallelism: 3` over a 6-record batch caps `max_in_flight <= 3`
  and observes `>= 2` (so we know we actually parallelized).
- `parallelism: 4` with mixed success/failure outcomes preserves
  collection order: `succeeded.map(&:id)` and `failed.map(&:id)` are
  both subsequences of `claimed.map(&:id)`.
- Non-`StandardError` exceptions raised inside a worker thread
  propagate out of `tick` (verified with a custom `Exception`
  subclass to bypass `invoke_handler`'s `rescue => caught`).
- `parallelism > 1` with `Memory::Storage` raises `ArgumentError`
  with a message naming `thread_safe_for_dispatch?`.
- `parallelism` validates as positive Integer (`0`, `-1`, `1.5`
  raise).

A test-scope `TestThreadSafeMemoryStorage < Memory::Storage` opts
into the contract by declaring `thread_safe_for_dispatch? = true`,
since the parallel tests rely on per-record-isolated state and
CRuby's GVL — production parallel dispatch uses durable adapters
that satisfy the contract by transaction binding.

The `IntervalRecordingHandler` test helper writes per-record
[entered_at_ns, exited_at_ns] timestamps into disjoint slots and
sweeps them ex post to compute `max_concurrent`. This avoids needing
thread synchronization primitives in the test (writes target
disjoint Hash keys; CRuby's GVL keeps `Hash#[]=` atomic).

## Release artifacts

- `lib/dag/version.rb` bumped 1.2.0 → 1.3.0; `Gemfile.lock` updated.
- `CHANGELOG.md`: V1.3 release notes plus governance carve-out
  entry that already shipped in #180/#181.
- `ROADMAP.md`: V1.3 dispatcher parallelism + Release v1.3 marked
  Done.
- `CONTRACT.md`: "Dispatcher Concurrency Contract" subsection now
  names `thread_safe_for_dispatch?` explicitly.
- `spec/r0/v1_2_release_gate_test.rb`: removes the
  `assert_equal "1.2.0", DAG::VERSION` check (release-gate version
  pin moves to the V1.3 gate, matching the pattern from V1.1 → V1.2).
- `spec/r0/v1_3_release_gate_test.rb`: new release gate covering
  version, changelog, roadmap, dispatcher kwarg, and contract docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f3b5a9c42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/dag/effects/dispatcher.rb Outdated
end
end
end
workers.each(&:value)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drain all workers before propagating thread exceptions

When parallelism > 1, workers.each(&:value) raises as soon as the first worker thread has an unhandled exception, which exits parallel_map without waiting for the remaining workers. In that case, other worker threads can continue dispatching and mutating storage after tick has already failed, so callers observe an exception while additional effects are still being marked/released in the background. Add an ensure path that joins/terminates all workers before re-raising to keep failure behavior bounded and deterministic.

Useful? React with 👍 / 👎.

The first iteration used `workers.each(&:value)` to propagate the
first worker exception. That re-raises immediately when it hits the
dead worker, leaving peer workers still in flight — they keep calling
`mark_effect_*`, `release_nodes_satisfied_by_effect`, and
`append_event` on storage *after* `#tick` has already returned to
its caller. The caller sees an exception with an inconsistent
post-tick state.

`Thread#join` has the same shortcoming: per Ruby semantics, joining a
thread that terminated abnormally re-raises that thread's exception,
so a `workers.each(&:join)` followed by `workers.each(&:value)` short-
circuits identically.

Fix: capture exceptions inside each worker, park them in a shared
`errors` list, set a shared abort flag (so peer workers stop pulling
new records after their current item completes), and exit the worker
loop normally so no worker terminates with an exception. Now
`workers.each(&:join)` actually waits for every worker — including
the peers that are still finishing their current record — and only
then does `#tick` raise the first captured exception.

`# standard:disable Lint/RescueException` is required on the rescue:
we deliberately catch every exception class (not just StandardError)
so non-StandardError throws also propagate through the same path,
preserving the V1.2 serial-map exception contract.

A new test in `spec/r2/effects_dispatcher_test.rb` pins the
invariant: with `parallelism: 2` and 4 records where `k0` raises
after a 20ms delay (giving the peer worker time to grab `k1` and
enter its 100ms-sleep handler), the test asserts that the peer
worker's completion timestamp is strictly less than the timestamp
captured immediately after `#tick` raised. Without the fix the peer
hadn't yet returned from its 100ms sleep when `#tick` re-raised,
which the test would observe as `finished_at == {}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duncanita
Copy link
Copy Markdown
Owner Author

Pushed a follow-up commit d51cd9c fixing a correctness issue surfaced by Codex stop-time review: workers.each(&:value) (and Thread#join on a thread terminated by exception) re-raises immediately, short-circuiting the join loop and letting peer workers keep mutating storage past #tick's return.

The fix captures exceptions inside each worker into a shared errors list, sets a shared abort flag so peers stop pulling new records (peers still finish the record they were already processing), and exits each worker's loop normally. After workers.each(&:join) returns, #tick raises the first captured exception. Now no worker is still mutating storage when #tick raises.

A new test test_parallelism_above_one_joins_all_workers_before_raising pins the invariant: with parallelism: 2 and 4 records where k0 raises after a 20ms delay, the peer worker's completion timestamp must be strictly less than the moment #tick raised. Without the fix the peer hadn't yet returned from its 100ms sleep when #tick re-raised.

bundle exec rake — 620 runs, 40680 assertions, 0 failures.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d51cd9c924

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/dag/effects/dispatcher.rb Outdated
Comment on lines +201 to +203
break unless abort_flag.empty?
pair = begin
queue.pop(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent workers from dequeuing new items after abort

The abort check and dequeue are split (break unless abort_flag.empty? then queue.pop(true)), so they are not atomic across threads. When one worker captures an exception and sets abort_flag, a peer can still pass the empty-check and pop a new record before seeing the flag, which means additional effects may be marked/released even though tick is already on an exception path. This violates the intended “finish only already-running work, then raise” behavior and can make failure-side effects larger than expected for parallelism > 1.

Useful? React with 👍 / 👎.

The previous iteration used a shared `abort_flag` Array that the
failing worker pushed to and that peer workers checked at the top of
each loop iteration. This had two weaknesses Codex stop-time review
flagged as "abort flag does not actually stop post-failure work":

1. The check sat *before* `queue.pop(true)`, so a peer that had
   already passed the check could still pop a new record after the
   failing worker raised — the flag arrived too late for that
   iteration.
2. Cross-thread visibility of mutations to a plain `Array` is not
   guaranteed independently of `Queue`'s built-in synchronization.
   On CRuby with the GVL it usually works, but the contract should
   not depend on that.

Fix: when a worker rescues an exception, it drains the work queue
itself (popping non-blockingly until `ThreadError`). Peer workers
that try to pop after the drain see `ThreadError` immediately and
exit. The signal rides on `Queue`'s thread-safety guarantees, with
no auxiliary state. Records that were drained but never processed
remain in `:dispatching` state in storage; their leases will expire
and a future `tick` can re-claim them — same semantics as if a
worker process had crashed mid-batch.

Peer workers still complete whatever record they were already
processing when the failure happened (that's the join-before-raise
invariant from the previous commit). The drain only stops them from
*starting* new records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duncanita
Copy link
Copy Markdown
Owner Author

Pushed b1df2d7 to address a follow-up Codex stop-time review flag ("abort flag does not actually stop post-failure work").

The previous iteration used a shared abort_flag Array checked at the top of the worker loop. Two weaknesses:

  1. The check sat before queue.pop(true), so a peer worker that had already passed the check could still pop a new record after the failing worker raised — the flag arrived too late for that iteration.
  2. Cross-thread visibility of mutations to a plain Ruby Array is not part of the language's contract; it usually works under CRuby's GVL but the implementation shouldn't depend on it.

Fix: when a worker rescues an exception, it drains the work queue itself (popping non-blockingly until ThreadError). Peer workers that try to pop after the drain see ThreadError immediately and exit. The signal rides on Queue's built-in thread-safety, with no auxiliary state.

Records that were drained but never processed remain in :dispatching state in storage; their leases expire and a future tick can re-claim them — same semantics as a mid-batch worker crash.

The join-before-raise invariant from d51cd9c is unchanged: peers still finish the record they were already processing.

bundle exec rake — 620 runs, 40680 assertions, 0 failures.

@duncanita duncanita merged commit 046346b into main May 8, 2026
6 checks passed
@duncanita duncanita deleted the feat/v1.3-dispatcher-parallelism branch May 8, 2026 10:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1df2d7a3a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +213 to +215
errors << exception
drain_queue(queue)
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize abort so peers cannot dequeue after worker failure

After a worker hits an exception, drain_queue(queue) is intended to stop further dispatch, but draining is not exclusive: peer workers can still win a concurrent queue.pop(true) before the queue is fully drained and start processing new records that were not in flight when the failure occurred. That means tick can still execute additional handlers/storage mutations on its exception path, which breaks the bounded-failure behavior documented in this change. Fresh evidence relative to the prior comment: the old abort-flag check is gone, but the new drain-only approach at this site still has the same dequeue race because there is no synchronized abort gate around pop.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant