spec: history sync completion signal (EOSE-equivalent)#214
Conversation
Co-Authored-By: Claude <noreply@anthropic.com>
intendednull
left a comment
There was a problem hiding this comment.
Solid spec overall — the topic-scoped framing, multi-provider trust policy, and last_event_hash truncation guard are all well-motivated and improve on what NIP-01 gives you. Posting as COMMENT (GitHub blocks REQUEST_CHANGES on one's own PR), but flagging two issues that I'd treat as blocking before merge:
- Internal contradiction in the wire path. The spec specifies
broadcast_neighborsand relay forwarding, butbroadcast_neighborsis by-construction not forwarded by gossip relays in the currentTopicHandletrait, and there is no per-peer direct-send primitive — calling it on everyNeighborUpwould spam every existing neighbor with a marker meant for one new joiner. Pick mesh-broadcast or add a direct-send primitive, then make "Provider-side emission" follow. epochfield will collide with PR #220 (epoch-key-rotation) and the existing crypto epoch terminology in the test-architecture spec. A rename (e.g.provider_run_id,stream_generation) avoids a year of confused review comments.
Other inline comments cover NIP-01 over-claims, provider_peer being unsigned (and thus relay-mutable), the missing "provider drops mid-sync" edge case, and a stray line-number citation. None are showstoppers individually.
Generated by Claude Code
| redundant traffic. The relay forwards these frames verbatim (it | ||
| already forwards opaque bytes by topic; the new `MessageType` tag is | ||
| invisible to relay-side logic, which is deliberately stateless per | ||
| `crates/relay/src/lib.rs:13`). |
There was a problem hiding this comment.
Internal contradiction: broadcast_neighbors is incompatible with relay forwarding.
The spec says the marker is sent as "a direct broadcast_neighbors-style frame from a provider to a newly-joined peer" and that "the relay forwards these frames verbatim." These are mutually exclusive in the current trait surface:
TopicHandle::broadcast_neighbors(crates/network/src/traits.rs:72) is documented as "Broadcast data only to direct neighbors (not forwarded further)" — by construction it will not traverse the relay if the relay is not the provider's direct neighbor of the joining peer.- The relay today only re-broadcasts via
TopicHandle::broadcast(seecrates/relay/src/lib.rs~L450/676); there is no code path that forwards neighbor-only frames. - There is also no per-peer direct send in the
Network/TopicHandletrait —broadcast_neighborshits all direct neighbors, not the one newly-joined peer, so calling it on everyNeighborUpwill spam every existing neighbor with a marker meant for one new joiner.
Suggest picking one model and being explicit: either (a) the marker is regular gossip (uses broadcast, costs one mesh-wide message per NeighborUp, relays forward as normal), or (b) introduce a new send_to(peer, bytes) direct frame in the Network trait and document how the relay bridges it. The "Provider-side emission" section needs to follow whichever choice you make.
Generated by Claude Code
| pub last_event_hash: Option<EventHash>, | ||
| /// Monotonically-increasing cursor scoped to (topic_id, provider_peer). | ||
| /// Incremented when the provider restarts and re-streams from zero. | ||
| pub epoch: u64, |
There was a problem hiding this comment.
Naming collision with PR #220 (epoch-key-rotation). The word "epoch" is already heavily overloaded in this codebase — docs/specs/2026-04-13-test-architecture.md references "epoch key" and "epoch isolation" for the forward-secrecy ratchet, and #220 (currently in flight) leans on "epoch" for the key-rotation generation counter. A reader skimming the wire format will plausibly conflate HistorySyncComplete::epoch with the crypto epoch, especially since both are u64 generation numbers tied to "restart / rotation."
Concrete fixes (any of):
- Rename to
stream_generation,provider_session_id, orprovider_run_id. - If you keep
epoch, add a doc-comment explicitly disambiguating: "Unrelated to the encryption epoch in the key-rotation spec; this is per-(topic, provider) stream lifecycle only."
Open question 5 also asks whether epoch should be a nonce instead — worth deciding alongside the rename, since "nonce" makes the non-crypto-epoch meaning self-evident.
Generated by Claude Code
| treat the sync as incomplete (see "Sharp edges" below). `epoch` exists | ||
| so a provider that crashes and restarts mid-stream cannot confuse a | ||
| client still holding a stale marker from the previous run — mirroring | ||
| Nostr's one-EOSE-per-REQ guarantee without needing a subscription id. |
There was a problem hiding this comment.
Two NIP-01 claims here are stronger than what NIP-01 actually says.
- NIP-01 does not in fact carry
last_event_hashor any equivalent — and crucially, it explicitly does not guarantee completeness ("EOSE is purely a delimiter signal"). The motivation paragraph above (lines 20-23) frames Nostr as having "solved this cleanly," but Nostr's solution is just the delimiter; the truncation/trust handling here is a Willow invention. That's fine, but the prose should own it rather than imply the design is paralleling Nostr. - "Mirroring Nostr's one-EOSE-per-REQ guarantee" — NIP-01 doesn't formalize this. In practice relays send one EOSE per
REQbecause there is oneREQ, but there is no spec-level guarantee about restart/replay behavior (it is undefined in NIP-01 — the failure-mode equivalent isCLOSED, not a fresh EOSE).
Recommend softening both passages to "inspired by" rather than "mirroring," and adding a note in "Sharp edges" that the truncation-detection contract is stronger than NIP-01's, deliberately.
Generated by Claude Code
| single audit log can correlate markers across topics. `provider_peer` | ||
| is the provider's `EndpointId`, **not** a signed field — the signature | ||
| lives on the Ed25519-wrapped envelope the way ordinary chat events are | ||
| wrapped in `crates/identity/src/lib.rs`. `last_event_hash` lets the |
There was a problem hiding this comment.
provider_peer being unsigned is subtle and worth justifying.
The spec says provider_peer is "not a signed field — the signature lives on the Ed25519-wrapped envelope." If the envelope is signed by the provider's identity key, then the signer's identity is implicitly authenticated. But provider_peer is a separate field inside the payload that a malicious relay could in principle rewrite to point at another trusted SyncProvider, causing the client to credit the wrong provider with completion.
That probably doesn't break safety (the client still only trusts SyncProviders), but it can:
- Spoof the "still_pending" accounting in
ClientEvent::HistorySynced. - Let a marker from provider A satisfy a strict policy that was waiting on provider B specifically.
Either bind provider_peer to be == envelope.signer and reject mismatches, or drop the field from the payload entirely and read it from the envelope at unpack time. The latter is also more bytes-efficient.
Generated by Claude Code
There was a problem hiding this comment.
lets do the latter if possible
| reconnects requests history again; providers emit a new marker with | ||
| a new `epoch`. The client SHOULD discard markers whose | ||
| `(provider_peer, epoch)` is older than one it has already seen | ||
| since the last `NeighborUp` for that provider. |
There was a problem hiding this comment.
Missing open question: what happens mid-sync if the provider goes offline before emitting the marker?
The "Sharp edges" section covers a compromised provider lying with a marker, and resumption with a new epoch, but not the much more common case: a SyncProvider drops the QUIC connection (or a NeighborDown arrives) after streaming 400 of its 1000 historical events and before the marker is sent. Under the default policy:
- If another trusted SyncProvider eventually emits a marker, the client flips to "synced" — but it has only the union of received events, with no signal that some events the dead provider was about to send are also missing from the surviving provider. (Storage worker covers replay's gaps in practice, but only if the storage worker is connected.)
- If no other SyncProvider is connected,
still_pendingdecrements but never reaches a marker — the UI is stuck loading forever.
Worth either an open question or an explicit handling rule: e.g. on NeighborDown(provider) while sync is in flight, emit a HistorySynced { still_pending, .. }-equivalent that decrements still_pending and records that this provider's contribution is incomplete; if still_pending would hit zero with at least one incomplete provider and no completed marker, fall through to a timeout-based HistorySyncFailed (open question 2 above).
Generated by Claude Code
| redundant traffic. The relay forwards these frames verbatim (it | ||
| already forwards opaque bytes by topic; the new `MessageType` tag is | ||
| invisible to relay-side logic, which is deliberately stateless per | ||
| `crates/relay/src/lib.rs:13`). |
There was a problem hiding this comment.
Minor citation nit: crates/relay/src/lib.rs:13 lands on //! All routines in this crate operate at the **transport layer**. — the trust-model statement you're alluding to is in the same module-doc block but a few lines down (around L23-L43, "Trust model"). Either bump the line number or just cite the file ("see the Trust model section in crates/relay/src/lib.rs's module docs"). Same drive-by check on the other citations: crates/transport/src/lib.rs:64, :36, crates/network/src/traits.rs:132, and crates/client/src/events.rs:48 all match the current code on main.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Strengths
- Correctly identifies that Willow has no NIP-01-style subscription id and adapts the EOSE concept to
(topic_id, provider_peer, epoch)— that is the right key, and the rationale for each component is well argued. - The
epoch+last_event_hashpair gives both restart-safety and a tripwire for silent truncation; folding both into a single ~80B frame is a sensible engineering trade. - Explicitly carving the marker out of
apply_eventkeeps the authority-bearing surface unchanged (perdocs/specs/2026-04-12-state-authority-and-mutations.md). This is the right side of the line to land on.
Concerns
-
First-trusted-wins is wrong for the case the feature exists to solve. The replay worker holds only the last 1000 events (
crates/relay/src/lib.rsworker docs; see alsoCLAUDE.md"Local Development Stack"). On a server with >1000 historical events the replay worker will always return its marker first, withlast_event_hashpointing at an event the storage worker can predate by months. The default policy as written will flip the spinner off while the storage worker is still streaming the long tail, which is exactly the user-visible bug the spec claims to fix. The default should be "wait for storage if a storage worker is in the trust list and currently connected; otherwise accept anySyncProvider." -
No timeout for a withholding provider. "Sharp edges" handles a provider that emits a marker too early but not one that never emits at all. A trusted provider can pin a client in "loading" forever by never sending the frame — same UX failure mode the spec exists to eliminate. There needs to be a per-provider timeout (e.g. 5 s wall-clock, or "no new events for N ms") that escalates to the next provider in the trust list and surfaces a
HistorySyncDegradedevent so the UI can degrade gracefully rather than spin forever. -
Per-channel scope under-specifies the join flow. Open question 4 hand-waves the most important consumer. A user joining a 20-channel server today subscribes to
SERVER_OPS_TOPICplus N channel topics (crates/network/src/topics.rs:42). With per-topic markers the joining UI receives 20+1HistorySyncedevents with no spec for "the server is ready" — and the "join is complete" question is what the user actually asks. Either define a server-wide aggregate (SERVER_OPS_TOPICmarker + an explicit list of channel topics covered) or specify how the client computes "all channels caught up" deterministically. This is not a follow-up; it's load-bearing for the join-flow UX. -
Negentropy interaction (PR #219) is not addressed and the two specs conflict. The negentropy spec replaces the bulk-replay path with logarithmic range reconciliation; "completion" in negentropy is the natural termination of the recursion (when both sides converge to Skip ranges everywhere), not a separate frame. Once #219 lands,
HistorySyncCompleteis either redundant with the negentropy terminator or it has to be redefined as "negentropy session converged for this(topic, peer)." This spec needs a section that picks one of: (a) supersede this spec when #219 lands, (b) keep this as the user-facing signal and have the negentropy session emit it on terminate, or (c) keep both and define the relationship. As-is the two PRs would produce overlapping-and-inconsistent wire frames. -
The state-hash open question (#3) should be resolved as YES. The
CLAUDE.mdtrust model already says clients "get state hash from multiple peers, use the hash agreed upon by the most trusted sources." Today there is no transport-level moment that naturally carries that hash; a sync-completion frame is exactly that moment. Addingstate_hash: StateHashto the marker means a client's "I'm caught up" decision and its "do my providers agree on state?" decision converge on the same evidence, and a divergent provider becomes detectable for free — a cross-check the spec otherwise leaves to a future mechanism. This also strengthens concern #1: with state hashes you can prefer the marker that matches the majority-agreed hash rather than the first-arrived one.
Suggestions
-
Add a "Resumption" section. "I've been offline 3 days, when am I caught up?" is not the same problem as "I just joined." A returning client has a known last-applied event hash and timestamp, so the marker semantics it needs are "everything since X." Either point at how the DAG-diff protocol from
2026-04-01-per-author-merkle-dag-state-design.mdhandles this (and how the marker is reused), or specify asince: Option<EventHash>field on the request side. Right now the spec only addresses cold joins. -
Specify a deduplication / dedup window for the "no markers from untrusted peers" rule. A peer can be granted
SyncProvidermid-session via aGrantPermissionevent (crates/state/src/event.rs). What happens to a marker received before the grant was observed? Define: (a) drop it, (b) buffer it for oneapply_eventtick, or (c) accept retroactively. This matters because the grant order across topics is not deterministic on the wire. -
Make
last_event_hashmandatory and add acount: u64. Open question 1 should resolve to "mandatory" — the cost is one bool in the provider; the value is unambiguous truncation detection. While you're at it, include the count of events streamed in this session so the client can sanity-check against its received-from-this-provider counter (catches reorderings and partial drops that share a final hash by coincidence). -
Replace
epoch: u64with a 16-byte random session id. Open question 5 is right to flag this. A monotonic counter requires the provider to persist epoch across crashes; a random session id ([u8; 16]) is stateless, collision-resistant for the lifetime of any client cache, and removes the "did I bump it?" failure mode entirely. The cost overu64is 8 bytes per marker — irrelevant at this scale.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Inline comments on four spots where the spec is concretely wrong or incomplete: (1) the first-trusted-wins default that loses history past 1000 events, (2) the missing withholding-provider timeout, (3) the broadcast_neighbors semantics that don't actually do per-joiner delivery, and (4) provider_peer being unsigned and redundant with the envelope signer. See the summary review for the structural concerns (negentropy interaction, per-channel join scope, state-hash piggybacking).
Generated by Claude Code
| for a topic as soon as it has received a valid marker from **any | ||
| peer granted `SyncProvider` permission** (the relay-worker role in | ||
| the state machine). This minimises perceived load time and matches | ||
| the UX the feature exists to deliver. |
There was a problem hiding this comment.
This default is wrong for the most common deployment. The replay worker streams from a 1000-event ring buffer; on any server with more than 1000 events of history it will always finish first, and last_event_hash will point at the 1000th-most-recent event — not the actual tail. Accepting its marker as "caught up" silently drops everything older.
Suggested change:
1. **Default policy (correctness-first):** if a peer with the
storage-worker role is in the trust list and currently connected,
wait for *its* marker. Otherwise accept the first marker from
any peer granted `SyncProvider`. The replay worker's marker by
itself is sufficient only if no storage worker is reachable.
This trades ~one network RTT in the storage-present case for not lying to the user.
Generated by Claude Code
| before actually flushing its history. The worst-case effect is a | ||
| stale UI; it cannot forge events (signatures still verify). Clients | ||
| MAY set a floor (e.g. "wait at least 200 ms and see at least one | ||
| event *or* one marker") to cap the damage. |
There was a problem hiding this comment.
"Provider lies" only covers premature emission. The dual failure — a trusted provider that never emits — is the same UX bug the spec exists to fix, just with a different attacker. Add an explicit timeout bullet:
- **Provider withholds.** A trusted `SyncProvider` that never emits a
marker would pin the UI in "loading" indefinitely. Clients MUST
apply a per-(topic, provider) deadline (suggested: 5 s wall-clock
*or* 500 ms of stream silence after at least one event has been
received, whichever is later). On expiry the client emits
`HistorySyncDegraded { topic, provider, reason: Timeout }` and
proceeds as if that provider had not been in the trust list for
this topic.
Without this, "first-trusted-wins" is also "first-trusted-can-deadlock-the-UI."
Generated by Claude Code
| redundant traffic. The relay forwards these frames verbatim (it | ||
| already forwards opaque bytes by topic; the new `MessageType` tag is | ||
| invisible to relay-side logic, which is deliberately stateless per | ||
| `crates/relay/src/lib.rs:13`). |
There was a problem hiding this comment.
Two issues here:
-
broadcast_neighbors(crates/network/src/traits.rs:72) hits all current neighbors of the topic mesh, not just the new joiner. On everyNeighborUpfor a popular topic this re-spams every peer with a marker that is meaningless to anyone but the joiner. You either need a real per-peer direct-send primitive added toTopicHandle, or the marker has to go via mesh broadcast and rely on(provider_peer, epoch)dedup at the receiver. -
"The relay forwards these frames verbatim" — true for mesh-broadcast, but if you switch to a per-peer direct send the relay is not on that path at all (
crates/relay/src/lib.rs:13is gossip-only). Worth being explicit about which transport carries the marker before claiming relay-bridge transparency.
Generated by Claude Code
| /// The TopicId (blake3 of the canonical topic string) this marker applies to. | ||
| pub topic_id: [u8; 32], | ||
| /// The Ed25519 PeerId of the provider emitting the marker. | ||
| pub provider_peer: EndpointId, |
There was a problem hiding this comment.
provider_peer is a redundant-and-dangerous field as written. The signed envelope (crates/identity/src/lib.rs) already carries the signer's PeerId; including a separate provider_peer inside the unsigned struct lets a relay or MITM rewrite it to attribute the marker to a different peer. At best this is duplicated info; at worst it splits the trust check (signature says X, payload says Y — which wins?).
Suggested fix: drop the field and have the receiver use the verified envelope signer as the provider identity. If you keep it for audit-log convenience, the spec must state that on mismatch the signature wins and the marker is rejected.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Round 2: comparative research
Looking outside Nostr, the "am I caught up?" problem has been solved at least six different ways, and EOSE is on the simpler/weaker end of that spectrum. Matrix's /sync uses an opaque next_batch token the client persists and replays — the marker is client-held state, not a one-shot frame, and gaps are signalled with a separate limited: true flag plus prev_batch for backfill (spec, tutorial). XMPP MAM uses an inline <fin complete='true'/> element wrapping an RSM <set> that carries <first>, <last> and <count> — completeness is bound to a paginated query, not to a stream (XEP-0313). IRCv3 CHATHISTORY wraps results in an explicit BATCH +id chathistory / BATCH -id pair, with empty batches required so clients don't hang (spec). Kafka exposes the high-water-mark / log-end-offset on every fetch response so consumers compute "lag = LEO − offset == 0" client-side, no separate signal at all (2minutestreaming). Automerge's sync protocol has no explicit "done" marker — both peers exchange the heads (hashes) of their commit graph, and termination is inferred from generate_sync_message() → None on both sides (docs, Rust). JMAP defines an opaque state string with a mandatory cannotCalculateChanges error path when the server can't bridge from the client's old token, plus a 30-day retention SHOULD (RFC 8620). Critically, only Automerge and Subduction tie the sync boundary to a cryptographic object — heads-as-hashes — and Subduction extends that with signed payloads explicitly (Subduction).
New insights / recommendations
-
Prefer a Matrix-style opaque resume token over a provider-emitted marker, or combine them. The current spec ships a one-shot
HistorySyncCompleteframe. If the frame is dropped (relay reconnect mid-stream, gossip mesh churn, client tab restart between marker arrival and persistence), the client has no second chance — it's back to the spinner heuristic the spec set out to kill. Matrix'snext_batchis durable: lose the network and the next call resumes from the last-known token. A Willow analogue would be(provider_peer, epoch, last_event_hash)persisted to the client's event store, replayable on reconnect, with the marker frame just being the transport that hands the token over. This also subsumes Open Question #4 (per-channel vs per-server) for free — the client just stores a token per topic. JMAP'scannotCalculateChangeserror is the matching design lesson: define what happens when a provider can no longer serve from the client's token, because it will happen with a 1000-event ring buffer. -
Automerge's symmetric heads-exchange maps the Willow DAG cleanly; EOSE does not. Willow already has a per-author Merkle DAG (
2026-04-01-per-author-merkle-dag-state-design.md) and the spec hand-waves "after the DAG-diff exchange completes" for the peer-to-peer case. Nostr's EOSE is fundamentally an asymmetric server-to-client signal — the relay decides when the client is done. Automerge's protocol is symmetric: both peers sendHave(heads)summaries and both decide independently when they have nothing new to send. For peer-to-peer Willow this is a strictly better fit, because there is no asymmetry to begin with — every peer is potentially a provider. The replay/storage workers are the only providers where EOSE-style asymmetry actually maps, and even there the heads-exchange is what produces the natural completion moment. Recommendation: spec the marker as the wire encoding of "my frontier matches yours" (i.e. include the provider's set of DAG heads, not justlast_event_hash), and the asymmetric "done now" interpretation falls out as a special case for clients that don't carry full DAG state yet. -
Sign the boundary, or at least bind it to a state hash. Open Question #3 already gestures at this; the comparative survey strengthens the case. MAM, IRCv3, Matrix, Kafka, and Nostr all have zero cryptographic protection on the completion signal — clients trust the server's word. Subduction is the explicit counterexample: its sync extension signs payloads precisely so a malicious provider can't lie about what it has. Willow already signs every event via Ed25519 in
crates/identity/; the marker should ride the same envelope and additionally include the provider'sStateHash(fromwillow-state's SHA-256 of the materialized state). That converts the "Provider lies" sharp edge from "stale UI" into "client can detect divergence and refuse to flip the flag," at the cost of one hash field. The peer-to-peer case in particular is undefendable without this — there is noSyncProviderpermission to filter on for arbitrary peers. -
Cite Matrix's documented sync-storm class of bug as a concrete failure mode to avoid. synapse#8518 describes a long-poll
/syncloop spontaneously turning into a storm where every request returned in 30–40 ms with empty rooms and a freshnext_batch— the client kept replaying a token the server immediately invalidated. Related: #8486 (incorrectprev_batchin incremental sync), #10535, #6030 (empty token → 500). The mechanism is token/epoch desynchronisation between client and server under restart conditions — exactly the case the spec'sepoch: u64field is meant to handle. The lesson is to test the boundary explicitly: a provider that restarts mid-stream, a marker that arrives before the events it claims to bound, a client that holds a marker across a process restart, a provider whose epoch counter resets to zero (the "did I remember to bump it?" failure named in Open Question #5 — Matrix shipped this exact bug class for years). The currentcrates/relay/tests/plan in the spec doesn't cover any of these; it should. -
The "completeness" question is doing two jobs and should be split. MAM's
complete='true'answers "did this query return everything matching the filter?" Kafka's high-watermark answers "is my offset equal to the latest produced offset?" Automerge's heads-equality answers "do our DAGs converge?" These are three different questions and Willow's spec conflates them:HistorySyncedis presented as one boolean but actually encodes "the provider has flushed its buffer" (MAM-style), which says nothing about whether the client's DAG converges with anyone else's (Automerge-style) or whether the topic gossip mesh has quiesced (Kafka-style). With three concurrent providers (replay/storage/peer) per topic, the first-trusted-wins default policy is going to flip the UI flag long before convergence holds — by design. That may be the right UX trade-off, but it should be named in the spec as "buffer-flushed completeness, not state convergence," with a separate signal (or a flag on the same signal) reserved for the convergence case. Otherwise consumers further down the stack will assume the stronger property and write subtly wrong code against it.
If we were starting from these references instead of Nostr
The wire frame would carry an opaque, client-persisted resume token (Matrix next_batch semantics) rather than a transient one-shot marker. The token would be bound to a cryptographic checkpoint — the provider's set of DAG heads plus the materialized StateHash — making "have I caught up?" answerable by the client comparing hashes rather than trusting a flag (Automerge / Subduction lesson). The provider would expose a Kafka-style watermark continuously (last-known head per topic) so a client could compute its own lag without waiting for a discrete event, and the discrete HistorySynced event would be derivable client-side as "watermark observed, my frontier ⊇ that watermark." The "buffer truncated, please reconcile from a deeper source" path would be a first-class error (JMAP cannotCalculateChanges) rather than the silent-truncation footnote currently in Sharp Edges. And the test matrix would explicitly cover the Synapse #8518 class — provider restart, epoch desync, marker-before-events, marker-after-events, marker-without-events — because every protocol surveyed here that did not test those shipped a sync-storm bug eventually.
Generated by Claude Code
- Drop `provider_peer` from payload; receiver derives from verified envelope signer (forecloses MITM/relay-rewrite class). - Rename `epoch` -> `stream_generation` to disambiguate from the crypto-epoch in #220. - Switch from `broadcast_neighbors` to `broadcast` with receiver- side dedup on `(provider, stream_generation)`; no new transport primitives needed. - Soften NIP-01 over-claims; explicitly note the truncation-detection contract is deliberately stronger than NIP-01. - Fix `crates/relay/src/lib.rs:13` line citation; reference the `Trust model` section in module docs instead.
Apply review decisions to the relay capability document spec: - Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical bytes, signature field excluded from canonicalisation). - Specify dispatch surgery: explicit branch in dispatch_connection for /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the handle_bootstrap_connection pattern. - Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in willow-state); list as future work. - Resolve multi-tenant question: one shared doc per host, relay is topic-agnostic. - Soften operator-metadata leakage: version is coarse semver, software is project name, both MAY be omitted. - Two-tier caching by status: ok=300s, degraded/read_only=5s with must-revalidate. - Recommend WS clients also send Sec-WebSocket-Protocol; JSON is advisory pre-connect. - Fix port framing: relay binds one port multiplexing TCP+WS, not two. - Drop sync_provider_only (operator vibes without a concrete pre-handshake check). - Add Cross-spec coordination table pinning feature tags for #214, #216, #217, #218, #219, #220, #221. - Rewrite Open Questions to keep only genuinely-open items (paid-relay semantics, utilisation telemetry, relay discovery, feature registry). https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
- move HistorySyncComplete from a new MessageType variant in willow-transport to a new WireMessage variant in willow-common, matching the production convention that all gossip traffic is routed through MessageType::Channel and avoids a transport→state dependency cycle - correct motivation to describe the actual two sync paths (SyncRequest/SyncBatch on _willow_server_ops + WorkerRequest::Sync/ History on _willow_workers) instead of a fictional "events replayed on the gossip topic" path - add "Which gossip topic carries the marker" subsection clarifying that workers only subscribe to WORKERS_TOPIC and SERVER_OPS_TOPIC in current code, so the marker travels on _willow_server_ops for worker-served history (not per-channel topics) - describe replay worker as per-author DAG with max_events_per_author cap (default 1000) plus LRU server cap, not a flat 1000-event ring buffer - clarify SyncProvider is a permission, not a role - name the storage path the marker draws from (sync_since / ORDER BY seq ASC, not history() / DESC) and explain why - replace coined "DAG-diff protocol" with the spec's actual SyncMessage::Advertise/Request/Response terminology - reconcile motivation's "two distinct sync providers" with the three provider classes listed below - fix line citations: events.rs:19 (was :10), events.rs:57 (was :48), traits.rs:56 / :84 (was :52), relay/src/lib.rs:9-22 Scope-section (was Trust-model section) - drop duplicated "verified envelope signer" parenthetical in Sharp edges section Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quisite - round 2 - Motivation: drop the false claim that clients consume WorkerRequest/Response today - No re-broadcast on _willow_server_ops; corrected - Peer-to-peer provider class explicitly conditional on per-author DAG sync spec - Prerequisite: replay/storage workers must add SERVER_OPS broadcast handle - Storage worker marker transport pinned: (a) broadcast on SERVER_OPS, not (b) WorkerResponse - Minor line cites and wording (MessageType:64->66, wire.rs:105-120, relay scope:8-22) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… - round 4 - PR #214 reframed as co-proposed (not landed precedent); merge-order is independent, conflicts are trivial. - PermissionDenied: explicitly part of the work to thread typed Permission through check_permission and InsertError, instead of parsing it back out of a formatted string. - OQ1 dropped: the envelope is already authenticated via pack_wire, so the body is correct and the question was incoherent. - target_peer: EndpointId added to RejectPayload; clients filter by self.endpoint_id() (mirrors VoiceSignal/JoinResponse/JoinDenied). - Wording: "dropped silently" -> "logged-only and not signaled". - Dep graph: state -> identity -> transport (transitive, not direct). - TopicId is not actually re-exported by willow-network; spec now says so and uses [u8; 32] on the wire. - PartialEq, Eq added to RejectPayload + RejectContext. - Relay connection-cap mapped to Capacity (logged-only; semaphore has no retry_after_ms basis); RateLimited moved to future producers. - Future-producer rows split into their own subsection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…round 4 - Blocks-on attribution: per-author DAG spec, not this spec, lands SyncMessage wire variant - Dep graph edge: state → identity → transport (transitive) - Storage worker: one SyncBatch reply on _willow_workers (not a stream) - Motivation: clarify SERVER_OPS carries state-sync, not live chat - TopicId: shared type, not "per-channel" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lation - round 4 - Per-envelope budget: 64 KiB minus small constant (~200B); dropped wrong "~57-60 KiB usable" - Migration: don't bump PROTOCOL_VERSION; use additive SyncRequestV2/SyncBatchV2 variants for soft rollout - SyncRequest gains request_id (matched to worker path's String for consolidation) - Worker path: only `more` added; outer WorkerWireMessage::Response.request_id reused - Index claim downgraded: NOT IN disjunct still requires server-scan; recommend restructuring sync_since to use explicit per-author predicates - HistorySyncComplete framed as defined by spec #214 (unmerged); SyncCompleted relationship spelled out - Author-count threshold corrected to ~900 - listeners.rs MAX_SYNC_BATCH_SIZE = 10_000 acknowledged; defense-in-depth retained - Line cites tightened across multiple sections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — I am not proposing to merge implementation here, only the spec.
What & why
Nostr's
EOSE(end-of-stored-events) marker is one of its cleanest design wins: a zero-ambiguity boundary between history replay and live tail, so clients know when to stop showing a loading indicator. Willow's gossip + replay/storage worker split currently has no such signal — clients are "probably caught up" but can't be sure.This spec proposes a new
MessageType::HistorySyncCompletecarrying{topic_id, provider_peer, last_event_hash, epoch}, scoped to(topic_id, provider_peer, epoch)rather than Nostr's per-subscription id (Willow uses topics as implicit subscriptions). It also specifies the multi-provider rule (first-trusted-wins by default, optional strict-majority mode) and a newClientEvent::HistorySynced { topic, provider, still_pending }for the UI.Markers never flow through
apply_event, so malicious markers are UX-only bugs.Spec file:
docs/specs/2026-04-24-history-sync-eose.mdOpen questions for review
Copied from the spec — looking for reviewer input on:
last_event_hashin the markerHistorySyncFailed) or just absence-as-errorComposition with sibling specs
NegClosesupports_history_sync_eose: bool?HistorySyncFailedreason variantsNote: the commit is unsigned because the harness's signing backend is returning HTTP 400
"missing source"on every git-invoked signing request during this session. Consistent with the rest of this spec set.Generated by Claude Code