Skip to content

spec: history sync completion signal (EOSE-equivalent)#214

Merged
intendednull merged 5 commits into
mainfrom
claude/spec-history-sync-eose
Apr 26, 2026
Merged

spec: history sync completion signal (EOSE-equivalent)#214
intendednull merged 5 commits into
mainfrom
claude/spec-history-sync-eose

Conversation

@intendednull
Copy link
Copy Markdown
Owner

Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — I am not proposing to merge implementation here, only the spec.

What & why

Nostr's EOSE (end-of-stored-events) marker is one of its cleanest design wins: a zero-ambiguity boundary between history replay and live tail, so clients know when to stop showing a loading indicator. Willow's gossip + replay/storage worker split currently has no such signal — clients are "probably caught up" but can't be sure.

This spec proposes a new MessageType::HistorySyncComplete carrying {topic_id, provider_peer, last_event_hash, epoch}, scoped to (topic_id, provider_peer, epoch) rather than Nostr's per-subscription id (Willow uses topics as implicit subscriptions). It also specifies the multi-provider rule (first-trusted-wins by default, optional strict-majority mode) and a new ClientEvent::HistorySynced { topic, provider, still_pending } for the UI.

Markers never flow through apply_event, so malicious markers are UX-only bugs.

Spec file: docs/specs/2026-04-24-history-sync-eose.md

Open questions for review

Copied from the spec — looking for reviewer input on:

  1. Mandatory vs optional last_event_hash in the marker
  2. Should failure markers exist (e.g., HistorySyncFailed) or just absence-as-error
  3. Piggybacking state hash onto the marker for cross-verification
  4. Server-wide "caught up" signal vs per-topic only
  5. Nonce vs epoch in the marker
  6. Relay-side rate-limiting rules

Composition with sibling specs

  • Negentropy sync (separate PR): the EOSE signal is the natural termination for a NegClose
  • Relay capability doc (separate PR): should relays advertise supports_history_sync_eose: bool?
  • Error prefixes (separate PR): HistorySyncFailed reason variants

Note: the commit is unsigned because the harness's signing backend is returning HTTP 400 "missing source" on every git-invoked signing request during this session. Consistent with the rest of this spec set.


Generated by Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid spec overall — the topic-scoped framing, multi-provider trust policy, and last_event_hash truncation guard are all well-motivated and improve on what NIP-01 gives you. Posting as COMMENT (GitHub blocks REQUEST_CHANGES on one's own PR), but flagging two issues that I'd treat as blocking before merge:

  1. Internal contradiction in the wire path. The spec specifies broadcast_neighbors and relay forwarding, but broadcast_neighbors is by-construction not forwarded by gossip relays in the current TopicHandle trait, and there is no per-peer direct-send primitive — calling it on every NeighborUp would spam every existing neighbor with a marker meant for one new joiner. Pick mesh-broadcast or add a direct-send primitive, then make "Provider-side emission" follow.
  2. epoch field will collide with PR #220 (epoch-key-rotation) and the existing crypto epoch terminology in the test-architecture spec. A rename (e.g. provider_run_id, stream_generation) avoids a year of confused review comments.

Other inline comments cover NIP-01 over-claims, provider_peer being unsigned (and thus relay-mutable), the missing "provider drops mid-sync" edge case, and a stray line-number citation. None are showstoppers individually.


Generated by Claude Code

redundant traffic. The relay forwards these frames verbatim (it
already forwards opaque bytes by topic; the new `MessageType` tag is
invisible to relay-side logic, which is deliberately stateless per
`crates/relay/src/lib.rs:13`).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal contradiction: broadcast_neighbors is incompatible with relay forwarding.

The spec says the marker is sent as "a direct broadcast_neighbors-style frame from a provider to a newly-joined peer" and that "the relay forwards these frames verbatim." These are mutually exclusive in the current trait surface:

  • TopicHandle::broadcast_neighbors (crates/network/src/traits.rs:72) is documented as "Broadcast data only to direct neighbors (not forwarded further)" — by construction it will not traverse the relay if the relay is not the provider's direct neighbor of the joining peer.
  • The relay today only re-broadcasts via TopicHandle::broadcast (see crates/relay/src/lib.rs ~L450/676); there is no code path that forwards neighbor-only frames.
  • There is also no per-peer direct send in the Network/TopicHandle trait — broadcast_neighbors hits all direct neighbors, not the one newly-joined peer, so calling it on every NeighborUp will spam every existing neighbor with a marker meant for one new joiner.

Suggest picking one model and being explicit: either (a) the marker is regular gossip (uses broadcast, costs one mesh-wide message per NeighborUp, relays forward as normal), or (b) introduce a new send_to(peer, bytes) direct frame in the Network trait and document how the relay bridges it. The "Provider-side emission" section needs to follow whichever choice you make.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a

pub last_event_hash: Option<EventHash>,
/// Monotonically-increasing cursor scoped to (topic_id, provider_peer).
/// Incremented when the provider restarts and re-streams from zero.
pub epoch: u64,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming collision with PR #220 (epoch-key-rotation). The word "epoch" is already heavily overloaded in this codebase — docs/specs/2026-04-13-test-architecture.md references "epoch key" and "epoch isolation" for the forward-secrecy ratchet, and #220 (currently in flight) leans on "epoch" for the key-rotation generation counter. A reader skimming the wire format will plausibly conflate HistorySyncComplete::epoch with the crypto epoch, especially since both are u64 generation numbers tied to "restart / rotation."

Concrete fixes (any of):

  • Rename to stream_generation, provider_session_id, or provider_run_id.
  • If you keep epoch, add a doc-comment explicitly disambiguating: "Unrelated to the encryption epoch in the key-rotation spec; this is per-(topic, provider) stream lifecycle only."

Open question 5 also asks whether epoch should be a nonce instead — worth deciding alongside the rename, since "nonce" makes the non-crypto-epoch meaning self-evident.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename please

treat the sync as incomplete (see "Sharp edges" below). `epoch` exists
so a provider that crashes and restarts mid-stream cannot confuse a
client still holding a stale marker from the previous run — mirroring
Nostr's one-EOSE-per-REQ guarantee without needing a subscription id.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two NIP-01 claims here are stronger than what NIP-01 actually says.

  1. NIP-01 does not in fact carry last_event_hash or any equivalent — and crucially, it explicitly does not guarantee completeness ("EOSE is purely a delimiter signal"). The motivation paragraph above (lines 20-23) frames Nostr as having "solved this cleanly," but Nostr's solution is just the delimiter; the truncation/trust handling here is a Willow invention. That's fine, but the prose should own it rather than imply the design is paralleling Nostr.
  2. "Mirroring Nostr's one-EOSE-per-REQ guarantee" — NIP-01 doesn't formalize this. In practice relays send one EOSE per REQ because there is one REQ, but there is no spec-level guarantee about restart/replay behavior (it is undefined in NIP-01 — the failure-mode equivalent is CLOSED, not a fresh EOSE).

Recommend softening both passages to "inspired by" rather than "mirroring," and adding a note in "Sharp edges" that the truncation-detection contract is stronger than NIP-01's, deliberately.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

single audit log can correlate markers across topics. `provider_peer`
is the provider's `EndpointId`, **not** a signed field — the signature
lives on the Ed25519-wrapped envelope the way ordinary chat events are
wrapped in `crates/identity/src/lib.rs`. `last_event_hash` lets the
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provider_peer being unsigned is subtle and worth justifying.

The spec says provider_peer is "not a signed field — the signature lives on the Ed25519-wrapped envelope." If the envelope is signed by the provider's identity key, then the signer's identity is implicitly authenticated. But provider_peer is a separate field inside the payload that a malicious relay could in principle rewrite to point at another trusted SyncProvider, causing the client to credit the wrong provider with completion.

That probably doesn't break safety (the client still only trusts SyncProviders), but it can:

  • Spoof the "still_pending" accounting in ClientEvent::HistorySynced.
  • Let a marker from provider A satisfy a strict policy that was waiting on provider B specifically.

Either bind provider_peer to be == envelope.signer and reject mismatches, or drop the field from the payload entirely and read it from the envelope at unpack time. The latter is also more bytes-efficient.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do the latter if possible

reconnects requests history again; providers emit a new marker with
a new `epoch`. The client SHOULD discard markers whose
`(provider_peer, epoch)` is older than one it has already seen
since the last `NeighborUp` for that provider.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing open question: what happens mid-sync if the provider goes offline before emitting the marker?

The "Sharp edges" section covers a compromised provider lying with a marker, and resumption with a new epoch, but not the much more common case: a SyncProvider drops the QUIC connection (or a NeighborDown arrives) after streaming 400 of its 1000 historical events and before the marker is sent. Under the default policy:

  • If another trusted SyncProvider eventually emits a marker, the client flips to "synced" — but it has only the union of received events, with no signal that some events the dead provider was about to send are also missing from the surviving provider. (Storage worker covers replay's gaps in practice, but only if the storage worker is connected.)
  • If no other SyncProvider is connected, still_pending decrements but never reaches a marker — the UI is stuck loading forever.

Worth either an open question or an explicit handling rule: e.g. on NeighborDown(provider) while sync is in flight, emit a HistorySynced { still_pending, .. }-equivalent that decrements still_pending and records that this provider's contribution is incomplete; if still_pending would hit zero with at least one incomplete provider and no completed marker, fall through to a timeout-based HistorySyncFailed (open question 2 above).


Generated by Claude Code

redundant traffic. The relay forwards these frames verbatim (it
already forwards opaque bytes by topic; the new `MessageType` tag is
invisible to relay-side logic, which is deliberately stateless per
`crates/relay/src/lib.rs:13`).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor citation nit: crates/relay/src/lib.rs:13 lands on //! All routines in this crate operate at the **transport layer**. — the trust-model statement you're alluding to is in the same module-doc block but a few lines down (around L23-L43, "Trust model"). Either bump the line number or just cite the file ("see the Trust model section in crates/relay/src/lib.rs's module docs"). Same drive-by check on the other citations: crates/transport/src/lib.rs:64, :36, crates/network/src/traits.rs:132, and crates/client/src/events.rs:48 all match the current code on main.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cite

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strengths

  • Correctly identifies that Willow has no NIP-01-style subscription id and adapts the EOSE concept to (topic_id, provider_peer, epoch) — that is the right key, and the rationale for each component is well argued.
  • The epoch + last_event_hash pair gives both restart-safety and a tripwire for silent truncation; folding both into a single ~80B frame is a sensible engineering trade.
  • Explicitly carving the marker out of apply_event keeps the authority-bearing surface unchanged (per docs/specs/2026-04-12-state-authority-and-mutations.md). This is the right side of the line to land on.

Concerns

  1. First-trusted-wins is wrong for the case the feature exists to solve. The replay worker holds only the last 1000 events (crates/relay/src/lib.rs worker docs; see also CLAUDE.md "Local Development Stack"). On a server with >1000 historical events the replay worker will always return its marker first, with last_event_hash pointing at an event the storage worker can predate by months. The default policy as written will flip the spinner off while the storage worker is still streaming the long tail, which is exactly the user-visible bug the spec claims to fix. The default should be "wait for storage if a storage worker is in the trust list and currently connected; otherwise accept any SyncProvider."

  2. No timeout for a withholding provider. "Sharp edges" handles a provider that emits a marker too early but not one that never emits at all. A trusted provider can pin a client in "loading" forever by never sending the frame — same UX failure mode the spec exists to eliminate. There needs to be a per-provider timeout (e.g. 5 s wall-clock, or "no new events for N ms") that escalates to the next provider in the trust list and surfaces a HistorySyncDegraded event so the UI can degrade gracefully rather than spin forever.

  3. Per-channel scope under-specifies the join flow. Open question 4 hand-waves the most important consumer. A user joining a 20-channel server today subscribes to SERVER_OPS_TOPIC plus N channel topics (crates/network/src/topics.rs:42). With per-topic markers the joining UI receives 20+1 HistorySynced events with no spec for "the server is ready" — and the "join is complete" question is what the user actually asks. Either define a server-wide aggregate (SERVER_OPS_TOPIC marker + an explicit list of channel topics covered) or specify how the client computes "all channels caught up" deterministically. This is not a follow-up; it's load-bearing for the join-flow UX.

  4. Negentropy interaction (PR #219) is not addressed and the two specs conflict. The negentropy spec replaces the bulk-replay path with logarithmic range reconciliation; "completion" in negentropy is the natural termination of the recursion (when both sides converge to Skip ranges everywhere), not a separate frame. Once #219 lands, HistorySyncComplete is either redundant with the negentropy terminator or it has to be redefined as "negentropy session converged for this (topic, peer)." This spec needs a section that picks one of: (a) supersede this spec when #219 lands, (b) keep this as the user-facing signal and have the negentropy session emit it on terminate, or (c) keep both and define the relationship. As-is the two PRs would produce overlapping-and-inconsistent wire frames.

  5. The state-hash open question (#3) should be resolved as YES. The CLAUDE.md trust model already says clients "get state hash from multiple peers, use the hash agreed upon by the most trusted sources." Today there is no transport-level moment that naturally carries that hash; a sync-completion frame is exactly that moment. Adding state_hash: StateHash to the marker means a client's "I'm caught up" decision and its "do my providers agree on state?" decision converge on the same evidence, and a divergent provider becomes detectable for free — a cross-check the spec otherwise leaves to a future mechanism. This also strengthens concern #1: with state hashes you can prefer the marker that matches the majority-agreed hash rather than the first-arrived one.

Suggestions

  • Add a "Resumption" section. "I've been offline 3 days, when am I caught up?" is not the same problem as "I just joined." A returning client has a known last-applied event hash and timestamp, so the marker semantics it needs are "everything since X." Either point at how the DAG-diff protocol from 2026-04-01-per-author-merkle-dag-state-design.md handles this (and how the marker is reused), or specify a since: Option<EventHash> field on the request side. Right now the spec only addresses cold joins.

  • Specify a deduplication / dedup window for the "no markers from untrusted peers" rule. A peer can be granted SyncProvider mid-session via a GrantPermission event (crates/state/src/event.rs). What happens to a marker received before the grant was observed? Define: (a) drop it, (b) buffer it for one apply_event tick, or (c) accept retroactively. This matters because the grant order across topics is not deterministic on the wire.

  • Make last_event_hash mandatory and add a count: u64. Open question 1 should resolve to "mandatory" — the cost is one bool in the provider; the value is unambiguous truncation detection. While you're at it, include the count of events streamed in this session so the client can sanity-check against its received-from-this-provider counter (catches reorderings and partial drops that share a final hash by coincidence).

  • Replace epoch: u64 with a 16-byte random session id. Open question 5 is right to flag this. A monotonic counter requires the provider to persist epoch across crashes; a random session id ([u8; 16]) is stateless, collision-resistant for the lifetime of any client cache, and removes the "did I bump it?" failure mode entirely. The cost over u64 is 8 bytes per marker — irrelevant at this scale.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments on four spots where the spec is concretely wrong or incomplete: (1) the first-trusted-wins default that loses history past 1000 events, (2) the missing withholding-provider timeout, (3) the broadcast_neighbors semantics that don't actually do per-joiner delivery, and (4) provider_peer being unsigned and redundant with the envelope signer. See the summary review for the structural concerns (negentropy interaction, per-channel join scope, state-hash piggybacking).


Generated by Claude Code

for a topic as soon as it has received a valid marker from **any
peer granted `SyncProvider` permission** (the relay-worker role in
the state machine). This minimises perceived load time and matches
the UX the feature exists to deliver.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default is wrong for the most common deployment. The replay worker streams from a 1000-event ring buffer; on any server with more than 1000 events of history it will always finish first, and last_event_hash will point at the 1000th-most-recent event — not the actual tail. Accepting its marker as "caught up" silently drops everything older.

Suggested change:

1. **Default policy (correctness-first):** if a peer with the
   storage-worker role is in the trust list and currently connected,
   wait for *its* marker. Otherwise accept the first marker from
   any peer granted `SyncProvider`. The replay worker's marker by
   itself is sufficient only if no storage worker is reachable.

This trades ~one network RTT in the storage-present case for not lying to the user.


Generated by Claude Code

before actually flushing its history. The worst-case effect is a
stale UI; it cannot forge events (signatures still verify). Clients
MAY set a floor (e.g. "wait at least 200 ms and see at least one
event *or* one marker") to cap the damage.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Provider lies" only covers premature emission. The dual failure — a trusted provider that never emits — is the same UX bug the spec exists to fix, just with a different attacker. Add an explicit timeout bullet:

- **Provider withholds.** A trusted `SyncProvider` that never emits a
  marker would pin the UI in "loading" indefinitely. Clients MUST
  apply a per-(topic, provider) deadline (suggested: 5 s wall-clock
  *or* 500 ms of stream silence after at least one event has been
  received, whichever is later). On expiry the client emits
  `HistorySyncDegraded { topic, provider, reason: Timeout }` and
  proceeds as if that provider had not been in the trust list for
  this topic.

Without this, "first-trusted-wins" is also "first-trusted-can-deadlock-the-UI."


Generated by Claude Code

redundant traffic. The relay forwards these frames verbatim (it
already forwards opaque bytes by topic; the new `MessageType` tag is
invisible to relay-side logic, which is deliberately stateless per
`crates/relay/src/lib.rs:13`).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here:

  1. broadcast_neighbors (crates/network/src/traits.rs:72) hits all current neighbors of the topic mesh, not just the new joiner. On every NeighborUp for a popular topic this re-spams every peer with a marker that is meaningless to anyone but the joiner. You either need a real per-peer direct-send primitive added to TopicHandle, or the marker has to go via mesh broadcast and rely on (provider_peer, epoch) dedup at the receiver.

  2. "The relay forwards these frames verbatim" — true for mesh-broadcast, but if you switch to a per-peer direct send the relay is not on that path at all (crates/relay/src/lib.rs:13 is gossip-only). Worth being explicit about which transport carries the marker before claiming relay-bridge transparency.


Generated by Claude Code

/// The TopicId (blake3 of the canonical topic string) this marker applies to.
pub topic_id: [u8; 32],
/// The Ed25519 PeerId of the provider emitting the marker.
pub provider_peer: EndpointId,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provider_peer is a redundant-and-dangerous field as written. The signed envelope (crates/identity/src/lib.rs) already carries the signer's PeerId; including a separate provider_peer inside the unsigned struct lets a relay or MITM rewrite it to attribute the marker to a different peer. At best this is duplicated info; at worst it splits the trust check (signature says X, payload says Y — which wins?).

Suggested fix: drop the field and have the receiver use the verified envelope signer as the provider identity. If you keep it for audit-log convenience, the spec must state that on mismatch the signature wins and the marker is rejected.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do it

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2: comparative research

Looking outside Nostr, the "am I caught up?" problem has been solved at least six different ways, and EOSE is on the simpler/weaker end of that spectrum. Matrix's /sync uses an opaque next_batch token the client persists and replays — the marker is client-held state, not a one-shot frame, and gaps are signalled with a separate limited: true flag plus prev_batch for backfill (spec, tutorial). XMPP MAM uses an inline <fin complete='true'/> element wrapping an RSM <set> that carries <first>, <last> and <count> — completeness is bound to a paginated query, not to a stream (XEP-0313). IRCv3 CHATHISTORY wraps results in an explicit BATCH +id chathistory / BATCH -id pair, with empty batches required so clients don't hang (spec). Kafka exposes the high-water-mark / log-end-offset on every fetch response so consumers compute "lag = LEO − offset == 0" client-side, no separate signal at all (2minutestreaming). Automerge's sync protocol has no explicit "done" marker — both peers exchange the heads (hashes) of their commit graph, and termination is inferred from generate_sync_message() → None on both sides (docs, Rust). JMAP defines an opaque state string with a mandatory cannotCalculateChanges error path when the server can't bridge from the client's old token, plus a 30-day retention SHOULD (RFC 8620). Critically, only Automerge and Subduction tie the sync boundary to a cryptographic object — heads-as-hashes — and Subduction extends that with signed payloads explicitly (Subduction).

New insights / recommendations

  1. Prefer a Matrix-style opaque resume token over a provider-emitted marker, or combine them. The current spec ships a one-shot HistorySyncComplete frame. If the frame is dropped (relay reconnect mid-stream, gossip mesh churn, client tab restart between marker arrival and persistence), the client has no second chance — it's back to the spinner heuristic the spec set out to kill. Matrix's next_batch is durable: lose the network and the next call resumes from the last-known token. A Willow analogue would be (provider_peer, epoch, last_event_hash) persisted to the client's event store, replayable on reconnect, with the marker frame just being the transport that hands the token over. This also subsumes Open Question #4 (per-channel vs per-server) for free — the client just stores a token per topic. JMAP's cannotCalculateChanges error is the matching design lesson: define what happens when a provider can no longer serve from the client's token, because it will happen with a 1000-event ring buffer.

  2. Automerge's symmetric heads-exchange maps the Willow DAG cleanly; EOSE does not. Willow already has a per-author Merkle DAG (2026-04-01-per-author-merkle-dag-state-design.md) and the spec hand-waves "after the DAG-diff exchange completes" for the peer-to-peer case. Nostr's EOSE is fundamentally an asymmetric server-to-client signal — the relay decides when the client is done. Automerge's protocol is symmetric: both peers send Have(heads) summaries and both decide independently when they have nothing new to send. For peer-to-peer Willow this is a strictly better fit, because there is no asymmetry to begin with — every peer is potentially a provider. The replay/storage workers are the only providers where EOSE-style asymmetry actually maps, and even there the heads-exchange is what produces the natural completion moment. Recommendation: spec the marker as the wire encoding of "my frontier matches yours" (i.e. include the provider's set of DAG heads, not just last_event_hash), and the asymmetric "done now" interpretation falls out as a special case for clients that don't carry full DAG state yet.

  3. Sign the boundary, or at least bind it to a state hash. Open Question #3 already gestures at this; the comparative survey strengthens the case. MAM, IRCv3, Matrix, Kafka, and Nostr all have zero cryptographic protection on the completion signal — clients trust the server's word. Subduction is the explicit counterexample: its sync extension signs payloads precisely so a malicious provider can't lie about what it has. Willow already signs every event via Ed25519 in crates/identity/; the marker should ride the same envelope and additionally include the provider's StateHash (from willow-state's SHA-256 of the materialized state). That converts the "Provider lies" sharp edge from "stale UI" into "client can detect divergence and refuse to flip the flag," at the cost of one hash field. The peer-to-peer case in particular is undefendable without this — there is no SyncProvider permission to filter on for arbitrary peers.

  4. Cite Matrix's documented sync-storm class of bug as a concrete failure mode to avoid. synapse#8518 describes a long-poll /sync loop spontaneously turning into a storm where every request returned in 30–40 ms with empty rooms and a fresh next_batch — the client kept replaying a token the server immediately invalidated. Related: #8486 (incorrect prev_batch in incremental sync), #10535, #6030 (empty token → 500). The mechanism is token/epoch desynchronisation between client and server under restart conditions — exactly the case the spec's epoch: u64 field is meant to handle. The lesson is to test the boundary explicitly: a provider that restarts mid-stream, a marker that arrives before the events it claims to bound, a client that holds a marker across a process restart, a provider whose epoch counter resets to zero (the "did I remember to bump it?" failure named in Open Question #5 — Matrix shipped this exact bug class for years). The current crates/relay/tests/ plan in the spec doesn't cover any of these; it should.

  5. The "completeness" question is doing two jobs and should be split. MAM's complete='true' answers "did this query return everything matching the filter?" Kafka's high-watermark answers "is my offset equal to the latest produced offset?" Automerge's heads-equality answers "do our DAGs converge?" These are three different questions and Willow's spec conflates them: HistorySynced is presented as one boolean but actually encodes "the provider has flushed its buffer" (MAM-style), which says nothing about whether the client's DAG converges with anyone else's (Automerge-style) or whether the topic gossip mesh has quiesced (Kafka-style). With three concurrent providers (replay/storage/peer) per topic, the first-trusted-wins default policy is going to flip the UI flag long before convergence holds — by design. That may be the right UX trade-off, but it should be named in the spec as "buffer-flushed completeness, not state convergence," with a separate signal (or a flag on the same signal) reserved for the convergence case. Otherwise consumers further down the stack will assume the stronger property and write subtly wrong code against it.

If we were starting from these references instead of Nostr

The wire frame would carry an opaque, client-persisted resume token (Matrix next_batch semantics) rather than a transient one-shot marker. The token would be bound to a cryptographic checkpoint — the provider's set of DAG heads plus the materialized StateHash — making "have I caught up?" answerable by the client comparing hashes rather than trusting a flag (Automerge / Subduction lesson). The provider would expose a Kafka-style watermark continuously (last-known head per topic) so a client could compute its own lag without waiting for a discrete event, and the discrete HistorySynced event would be derivable client-side as "watermark observed, my frontier ⊇ that watermark." The "buffer truncated, please reconcile from a deeper source" path would be a first-class error (JMAP cannotCalculateChanges) rather than the silent-truncation footnote currently in Sharp Edges. And the test matrix would explicitly cover the Synapse #8518 class — provider restart, epoch desync, marker-before-events, marker-after-events, marker-without-events — because every protocol surveyed here that did not test those shipped a sync-storm bug eventually.


Generated by Claude Code

- Drop `provider_peer` from payload; receiver derives from verified
  envelope signer (forecloses MITM/relay-rewrite class).
- Rename `epoch` -> `stream_generation` to disambiguate from the
  crypto-epoch in #220.
- Switch from `broadcast_neighbors` to `broadcast` with receiver-
  side dedup on `(provider, stream_generation)`; no new transport
  primitives needed.
- Soften NIP-01 over-claims; explicitly note the truncation-detection
  contract is deliberately stronger than NIP-01.
- Fix `crates/relay/src/lib.rs:13` line citation; reference the
  `Trust model` section in module docs instead.
intendednull pushed a commit that referenced this pull request Apr 25, 2026
Apply review decisions to the relay capability document spec:

- Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical
  bytes, signature field excluded from canonicalisation).
- Specify dispatch surgery: explicit branch in dispatch_connection for
  /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT
  and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the
  handle_bootstrap_connection pattern.
- Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in
  willow-state); list as future work.
- Resolve multi-tenant question: one shared doc per host, relay is
  topic-agnostic.
- Soften operator-metadata leakage: version is coarse semver, software
  is project name, both MAY be omitted.
- Two-tier caching by status: ok=300s, degraded/read_only=5s with
  must-revalidate.
- Recommend WS clients also send Sec-WebSocket-Protocol; JSON is
  advisory pre-connect.
- Fix port framing: relay binds one port multiplexing TCP+WS, not two.
- Drop sync_provider_only (operator vibes without a concrete
  pre-handshake check).
- Add Cross-spec coordination table pinning feature tags for #214,
  #216, #217, #218, #219, #220, #221.
- Rewrite Open Questions to keep only genuinely-open items (paid-relay
  semantics, utilisation telemetry, relay discovery, feature registry).

https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
intendednull and others added 2 commits April 25, 2026 02:01
- move HistorySyncComplete from a new MessageType variant in
  willow-transport to a new WireMessage variant in willow-common,
  matching the production convention that all gossip traffic is
  routed through MessageType::Channel and avoids a transport→state
  dependency cycle
- correct motivation to describe the actual two sync paths
  (SyncRequest/SyncBatch on _willow_server_ops + WorkerRequest::Sync/
  History on _willow_workers) instead of a fictional "events
  replayed on the gossip topic" path
- add "Which gossip topic carries the marker" subsection clarifying
  that workers only subscribe to WORKERS_TOPIC and SERVER_OPS_TOPIC
  in current code, so the marker travels on _willow_server_ops for
  worker-served history (not per-channel topics)
- describe replay worker as per-author DAG with
  max_events_per_author cap (default 1000) plus LRU server cap, not
  a flat 1000-event ring buffer
- clarify SyncProvider is a permission, not a role
- name the storage path the marker draws from (sync_since /
  ORDER BY seq ASC, not history() / DESC) and explain why
- replace coined "DAG-diff protocol" with the spec's actual
  SyncMessage::Advertise/Request/Response terminology
- reconcile motivation's "two distinct sync providers" with the
  three provider classes listed below
- fix line citations: events.rs:19 (was :10), events.rs:57 (was
  :48), traits.rs:56 / :84 (was :52), relay/src/lib.rs:9-22
  Scope-section (was Trust-model section)
- drop duplicated "verified envelope signer" parenthetical in
  Sharp edges section

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quisite - round 2

- Motivation: drop the false claim that clients consume WorkerRequest/Response today
- No re-broadcast on _willow_server_ops; corrected
- Peer-to-peer provider class explicitly conditional on per-author DAG sync spec
- Prerequisite: replay/storage workers must add SERVER_OPS broadcast handle
- Storage worker marker transport pinned: (a) broadcast on SERVER_OPS, not (b) WorkerResponse
- Minor line cites and wording (MessageType:64->66, wire.rs:105-120, relay scope:8-22)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
intendednull added a commit that referenced this pull request Apr 25, 2026
… - round 4

- PR #214 reframed as co-proposed (not landed precedent); merge-order
  is independent, conflicts are trivial.
- PermissionDenied: explicitly part of the work to thread typed
  Permission through check_permission and InsertError, instead of
  parsing it back out of a formatted string.
- OQ1 dropped: the envelope is already authenticated via pack_wire,
  so the body is correct and the question was incoherent.
- target_peer: EndpointId added to RejectPayload; clients filter by
  self.endpoint_id() (mirrors VoiceSignal/JoinResponse/JoinDenied).
- Wording: "dropped silently" -> "logged-only and not signaled".
- Dep graph: state -> identity -> transport (transitive, not direct).
- TopicId is not actually re-exported by willow-network; spec now
  says so and uses [u8; 32] on the wire.
- PartialEq, Eq added to RejectPayload + RejectContext.
- Relay connection-cap mapped to Capacity (logged-only; semaphore
  has no retry_after_ms basis); RateLimited moved to future producers.
- Future-producer rows split into their own subsection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…round 4

- Blocks-on attribution: per-author DAG spec, not this spec, lands SyncMessage wire variant
- Dep graph edge: state → identity → transport (transitive)
- Storage worker: one SyncBatch reply on _willow_workers (not a stream)
- Motivation: clarify SERVER_OPS carries state-sync, not live chat
- TopicId: shared type, not "per-channel"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
intendednull added a commit that referenced this pull request Apr 25, 2026
…lation - round 4

- Per-envelope budget: 64 KiB minus small constant (~200B); dropped wrong "~57-60 KiB usable"
- Migration: don't bump PROTOCOL_VERSION; use additive SyncRequestV2/SyncBatchV2 variants for soft rollout
- SyncRequest gains request_id (matched to worker path's String for consolidation)
- Worker path: only `more` added; outer WorkerWireMessage::Response.request_id reused
- Index claim downgraded: NOT IN disjunct still requires server-scan; recommend restructuring sync_since to use explicit per-author predicates
- HistorySyncComplete framed as defined by spec #214 (unmerged); SyncCompleted relationship spelled out
- Author-count threshold corrected to ~900
- listeners.rs MAX_SYNC_BATCH_SIZE = 10_000 acknowledged; defense-in-depth retained
- Line cites tightened across multiple sections

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intendednull intendednull merged commit e092e9b into main Apr 26, 2026
5 checks passed
@intendednull intendednull deleted the claude/spec-history-sync-eose branch April 26, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants