spec: negentropy-based range reconciliation for history sync#219
Conversation
intendednull
left a comment
There was a problem hiding this comment.
Spec review. Solid framing of the problem, but several concrete issues block adoption as written.
Blocking
-
Fingerprint formula is wrong. §Fingerprint specifies
truncate16(sha256(xor_sum(ids) || count_le)). NIP-77 andnegentropy-protocol-v1.mdboth specify (a) addition mod 2^256 of IDs interpreted as 32-byte little-endian integers — not XOR — and (b) the element count encoded as a varint — not a fixed little-endianu64. As written we will not interop withrust-nostr/negentropyor generate matching reference vectors, defeating the "reuse the crate with minimal glue" rationale. The protocol byte0x61is correct; please fix the digest construction and add Hoyte's published test vectors to the unit tier. -
MessageType = 7collides with the EOSE spec. PR #214 (spec-history-sync-eose, branchclaude/spec-history-sync-eose) reservesMessageType::HistorySyncComplete = 7at the samecrates/transport/src/lib.rs:64slot. Pick non-overlapping tags or coordinate the numbering in one of the two specs before either lands. -
NegClose≠HistorySyncComplete. §Completion signalling claims a Negentropy session naturally satisfies the EOSE contract. It does not: PR #214'sHistorySyncCompletecarries(topic_id, provider_peer, last_event_hash, epoch)precisely so the client can detect silent truncation and so a restarted provider doesn't reuse a stale marker.NegClose { session_id }carries none of that. Either (a) augmentNegClosewith the EOSE anchor fields and explicitly subsume #214, or (b) keep the two distinct and document the ordering (NegClosethenHistorySyncComplete). The spec must pick one — silently dropping #214's invariants is the worst outcome.
Significant
-
Reconsider Negentropy at all for the client↔replay-worker path. Willow's
Event(crates/state/src/event.rs:184-204) hasauthor: EndpointId, strictly-monotonicseq: u64, and signedprev: EventHash. Per-author chains are append-only and authoritative — not the unstructured ID set Negentropy is designed for. Open Question 3 (per-author fast-path) is the actually-important question this spec should answer, not punt: a one-round-tripmax_seq_per_authorvector exchange will resolve the overwhelmingly common reconnect-after-downtime case in 1 RTT and O(authors) bytes, with zero range index. Negentropy's log-diff property only pays off when symmetric difference is large and spans many authors with no shared structure — rare here. Recommend: spec the vector-clock fast-path as the primary protocol and Negentropy as a fallback for cross-worker / disjoint-author cases, not the other way round. This also removes the entire sort-key debate. -
hlc_timestampsort key is not currently available. §Sort key lists(hlc_timestamp, hash)as an option, butEventdoes not carry an HLC — onlytimestamp_hint_ms. HLCs live inwillow-messaging::hlcand are scoped to chat content. Adopting that key requires a wider state-machine change than the spec acknowledges; either remove the option or scope the migration. -
Adversarial timestamps mitigation is hand-wavy. Epoch-day bucketing helps but does not bound recursion within a poisoned bucket — a peer who lands 100k events at
t=0makes the first bucket's reconciliation worst-case.SyncProvider-gated serving doesn't help because the malicious data is already in the DAG and being reconciled. If we keep(timestamp_hint_ms, hash), the spec needs a hard recursion-depth or per-bucket size cap and a defined behaviour on hit (fall back to IdList? abort withNegErr(Blocked)?). This further argues for #4.
Minor
-
rust-nostr/negentropyAPI fit. Crate is real (MIT, v0.4–0.5, MSRV 1.51, published asnegentropyon crates.io), but pinned to NIP-77's(uint64, 32-byte id)item shape and an in-memoryStoragemodel. The "minimal glue" claim should be qualified — we'll need a bridge type to back it with the SQLite range-scan iterator on storage workers, and we cannot use it directly with a custom sort key. -
Frame budget. 256 KB
MAX_DESER_SIZEminus envelope overhead caps a singleNegMsgat ~16k fingerprints. Sessions over a large symmetric difference will span many envelopes — the spec needs to say (a) is concurrent send allowed or strict ping-pong, (b) what happens if the responder's reply doesn't fit one envelope (the current text says "split into multipleNegMsg" but doesn't define ordering / interleaving with the next request frame), (c) backpressure when the gossip buffer is full. -
§Storage requirements: the
range_scansignature returnsBox<dyn Iterator>butEventStoreimpls cross WASM/native — confirm the trait stays object-safe andSend-bounded per the dual-target rule in CLAUDE.md. -
§Wire protocol table puts
initial_msg: Vec<u8>insideNegOpen. NIP-77 mirrors this, but it's worth noting this means the first fingerprint already commits the initiator to a sort key the responder hasn't acked — define the failure mode if the responder disagrees on sort key (NegErr(Unsupported)?). -
Open Question 5 (require a permission to initiate): yes — at minimum require server membership, otherwise any peer that learns a
ServerIdcan probe existence and event-set fingerprints. Easy to add now; awkward to add later.
Overall: the direction is reasonable, but I think the right v1 here is the per-author seq vector exchange, with Negentropy reserved for worker↔worker and disjoint-history cases. If we're going full Negentropy, the fingerprint and EOSE-overlap issues must be resolved before implementation starts.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Thorough spec with a clear algorithmic story and good cross-references into the codebase — the algorithm summary, fingerprint definition, and storage-index sketch are all concrete enough to implement against. That said, I have substantive concerns about the primary design decision and the interaction with existing Willow invariants. I don't think this should land as-is.
Strengths
- The motivation is well-grounded: the current "dump 1000 events from the replay ring buffer" path is genuinely wasteful, and
O(log(|A ⊕ B|))reconciliation is the right shape of answer for worker↔worker replication. - Mirroring NIP-77's fingerprint (
truncate16(sha256(xor_sum || count_le))) byte-for-byte is the right call if we stay with Negentropy — XOR homomorphism is exactly what lets ranges split cheaply, and deviating would forfeit reference vectors. - Filter design is sensible: structural events (
GrantPermission,CreateChannel) bypass the channel filter so server structure always converges. This is important and easy to get wrong. NegClosesatisfying theSyncCompletecontract from the EOSE spec (#214) is a nice composition — one less redundant signal to carry.- Relay stays a stateless bridge (
crates/relay/src/lib.rs:16-41), session state lives in participants. Keeps the trust model unchanged. - Test matrix names the right scenarios (three-peer, edge cases, byte-count assertion on reconnect).
Concerns
-
Sort key is the wrong call and is the most important decision in the spec.
timestamp_hint_msis documented as "Display only — never used for ordering" (crates/state/src/event.rs:202-203). Making it the canonical sort key of the reconciliation index is a direct inversion of that invariant, and since the field is signed and author-controlled, it puts adversarial input on the hot path. The epoch-day bucketing + SyncProvider gating is weaker than it reads — any member withSendMessagescan produce skewed timestamps, and bucketing just concentrates the pathology into a single bucket. See my inline comment on line 57. -
Negentropy may be overkill for Willow's DAG. Willow has monotonic per-author
seq+prevchains that Nostr doesn't. A "max seq per author" vector exchange is O(authors) state, one round trip, exact — and strictly dominates Negentropy for the common reconnect/warm-start cases. Open Question #3 should not be an open question; it should drive the framing. See my inline comment on line 248. -
Encrypted channel-key distribution is waved away. "Sealed key shares stay on a parallel unicast flow" is one line describing a hard problem. If a peer comes online after missing
RotateChannelKeyand the sender is offline, there's no recovery story. Either design it here or explicitly defer and flag that encrypted-channel sync is incomplete. See inline comment on line 225. -
Envelope budget and frame sizing are underspecified. "16 000 fingerprints per NegMsg" assumes the full 256 KB
MAX_DESER_SIZE, but the gossip transport has a 64 KBmax_message_sizeand NIP-77 frames carry more than just the 16-byte digest.frameSizeLimitneeds to be a protocol parameter and receivers need to know continuation semantics. See inline comment on line 201. -
rust-nostr/negentropy dependency risk unexamined. Open Question #1 flags it but doesn't answer it. License? Audit state? WASM support (we compile library crates to
wasm32-unknown-unknownper CLAUDE.md)? Maintenance cadence? If the answer is "fork and port," that's a materially different cost than "add a dep." This needs to be resolved before merging the spec, not after. -
SyncProvider-only serving conflicts with peer-to-peer sync. The spec gates
NegOpenresponders to SyncProvider, but the Integration Points table lists "client ↔ replay worker" and "client ↔ storage worker" as the primary paths — fine if workers hold SyncProvider, but then regular peers can never directly reconcile with each other over gossip. Is that intentional? It means the system degrades to "you must have a worker online" for any history recovery, which is an availability regression from today's gossip replay. Open Question #5 circles this but doesn't resolve it. -
Worker architecture fit.
docs/specs/2026-03-27-worker-nodes-design.mddescribesSyncRequest/SyncBatchas the current interface and lists amax_eventsconfig for replay workers. This spec replaces that path but doesn't discuss migration: do both protocols coexist during rollout, or is there a flag day? What happens when a new client talks to an old worker or vice versa? The four newMessageTypevariants don't imply version negotiation beyond the existing envelope version bump.
Suggestions
- Reframe around per-author seq as the primary sync path. Keep Negentropy as a fallback for detecting DAG-level divergence (via
StateHash) and for cross-author reconciliation of unusual histories. Default sync = seq-vector exchange, and that sidesteps the sort-key problem entirely. - If Negentropy stays primary, switch the sort key to
(HlcTimestamp, EventHash). HLCs (crates/messaging/src/hlc.rs) exist, are monotonic, have bounded skew, and aren't attacker-chosen. Extending HLC to everyEventKind(not justMessage) is a cheaper change than eternally carrying attacker-controllable state in the reconciliation index, and it has standalone value forwillow-statemerge ordering. - Add a worst-case adversarial analysis. Given N events with arbitrary attacker-chosen timestamps within a bucket, what is the bound on round trips and total bytes? If it's unbounded, add a hard session-byte cap in §Bandwidth.
- Add fingerprint reference vectors from hoytech's test suite to the testing table. "Matches Hoyte reference vectors" should cite specific vectors so the implementation can't silently diverge.
- Specify version negotiation. What does a v2-capable peer send to a v1 worker that doesn't know
NegOpen? Today'sEnvelope::validate_versionrejects the whole envelope; the rollout plan needs to account for that. - Resolve
rust-nostr/negentropyvs. port before merging. A spec that depends on an unaudited external crate for a security-relevant (permission-gated, adversarial-timestamp-exposed) protocol shouldn't punt that question.
Happy to re-review once the sort-key decision is settled — that's the one that governs everything else.
Generated by Claude Code
| | `(hlc_timestamp, hash)` | HLCs (see [`crates/messaging/src/hlc.rs`](../../crates/messaging/src/hlc.rs)) give monotonic causal order across authors; resilient to clock skew | HLCs only stamp `Message` events today; non-message `EventKind` variants would need HLC adoption first | | ||
| | `(author_pubkey, seq)` primary with `(ts, hash)` fallback | Cheap fast-path for peers that share most chains | Two protocols to implement and reason about | | ||
|
|
||
| **Recommendation: `(timestamp_hint_ms, hash)`** for the initial |
There was a problem hiding this comment.
This is the load-bearing decision of the whole spec, and I think it's the wrong call as written.
crates/state/src/event.rs:202-203 explicitly says timestamp_hint_ms is "Display only — never used for ordering." Elevating it to the primary sort key of the canonical range index is a direct semantic reversal of that invariant. It also puts adversarial input on the hot path: since the field is in the signed payload, a malicious author can legitimately produce events claiming t = 0 or t = u64::MAX and the signature still verifies. "Bucket by epoch-day + gate SyncProvider" doesn't really defuse this: (a) the attacker only needs write access to the server (SendMessages), not SyncProvider; (b) epoch-day bucketing just moves the recursion cost into one bucket rather than eliminating it, and a single bucket of 10k skewed events will still blow past the 256 KB envelope budget discussed in §Bandwidth; (c) "kickable via governance" is an after-the-fact remedy for a protocol the attacker already made expensive.
Alternatives that the table undersells:
(HlcTimestamp, EventHash)— HLCs already exist (crates/messaging/src/hlc.rs) and give a monotonic, bounded-skew total order that is not client-controllable in the same way (each node advances its own clock on receive, bounding drift). The spec dismisses this because "HLCs only stampMessageevents today" — but that's an implementation gap, not a design constraint. Adopting HLC on everyEvent(not justMessage) is a much smaller change than baking an attacker-controllable field into a Merkle reconciliation structure for the lifetime of the protocol. It's also independently valuable for merge ordering inwillow-state.(author, seq)— the table says this "breaks the logarithmic property." That's only true if you insist on a single cross-author range. If you reconcile one chain at a time, you get exact sync in one round per divergent author with a trivial "max seq per author" vector (see my other comment on §Open Questions Comprehensive multi-peer E2E tests across all browsers #3). For Willow's per-author Merkle-DAG that is strictly cheaper than Negentropy in the common case.
Concretely, I'd like to see the spec either (1) switch the recommendation to HLC + hash and describe the HLC-on-every-event rollout, or (2) make a much stronger case that NIP-77 wire compatibility is worth giving up a signed invariant for — who are we actually interoperating with? This is a private DAG; we're not federating with Nostr relays.
Please don't treat "we can reuse rust-nostr/negentropy with minimal glue" as the tiebreaker. The cost of getting the sort key wrong is much larger than the cost of a 200-line port.
Generated by Claude Code
| client↔replay-worker path given per-author monotonicity? | ||
| 3. **Per-author fast path.** Can we short-circuit with a single | ||
| `max_seq_per_author` vector exchange *before* opening a negentropy | ||
| session, falling back to negentropy only when seq gaps exist? |
There was a problem hiding this comment.
This question is buried in Open Questions but I think it deserves to be promoted into the core of the spec — possibly displacing Negentropy as the default path.
Willow's Event has something Nostr doesn't: a strictly monotonic, gap-free per-author seq with a prev hash chain (crates/state/src/event.rs:190-194). Given that, the natural sync primitive is:
A → B: HashMap<EndpointId, u64> // my max seq per author
B → A: Vec<Event> // events where B.seq > A.seq for that author
That's O(authors) state, one round trip, and the "what's missing" answer is exact — no fingerprints, no recursion, no frame-splitting. For the dominant real-world cases (client reconnects after being offline an hour; replay worker warm-starts from storage) this strictly dominates Negentropy on bandwidth, code complexity, and latency.
Negentropy earns its keep only when peers have the same authors' events but at different seqs and there are large holes in the middle of chains — which in Willow shouldn't happen during normal operation because events arrive in order via gossip. The case where it's actually needed is cross-author ordering for UI display, which is a rendering concern, not a sync concern.
My suggestion: flip the framing of this spec. Default sync = per-author seq vector exchange. Negentropy is a fallback for (a) detecting DAG-level divergence via StateHash, (b) reconciling unusual histories produced by offline peers or deliberate forks. That also cleanly sidesteps the sort-key problem (you never need a canonical cross-author order on the wire).
If after analysis Negentropy really is the better primary, I'd want to see bandwidth/latency numbers for the common case comparing the two, not just asymptotic arguments.
Generated by Claude Code
| ## Bandwidth and safety | ||
|
|
||
| - Each `NegMsg` is capped by `MAX_DESER_SIZE` (256 KB); a single round | ||
| trip carries at most ~16 000 fingerprints or ~8 000 IDs. |
There was a problem hiding this comment.
The "256 KB carries ~16 000 fingerprints" arithmetic is optimistic once you account for the envelope overhead and the fact that NIP-77 frames encode (upper_bound_timestamp, id_prefix, mode, payload) per range, not just raw 16-byte fingerprints. The hoytech reference's frameSizeLimit default is 4096 bytes for a reason — it's tuned for Nostr relay WebSocket frames. We inherit 256 KB from MAX_DESER_SIZE (see crates/transport/src/lib.rs:36), but the underlying iroh gossip max_message_size is 64 KB (that comment exists in the same file). Which applies to NegMsg?
Concrete asks:
- State which transport path NegMsg travels (gossip vs. direct QUIC stream) and which size limit therefore binds.
- Specify
frameSizeLimitas a protocol parameter, not just "split as needed" — receivers need to know when to expect a continuation vs. a terminal frame, especially since NIP-77 reconciliation is stateful across frames. - Show a worst-case bound: given 1M events with adversarial timestamp bucketing (see my sort-key comment), what is the maximum number of NegMsg round trips before convergence? If it's unbounded in practice, we need a hard session-byte cap, not just the 10s / 30s timers.
Generated by Claude Code
| Channel-key events (`RotateChannelKey`) live in the same DAG as every | ||
| other event and therefore ride along inside negentropy sessions | ||
| automatically, subject to the filter. Per-recipient sealed key shares | ||
| are NOT part of the DAG and remain on their own point-to-point path. |
There was a problem hiding this comment.
"Per-recipient sealed key shares are NOT part of the DAG and remain on their own point-to-point path" — this needs to be a separate sub-spec or at minimum an explicit design sketch, not a one-liner dismissal. The consequences are:
- A peer that comes online after missing a
RotateChannelKeywill receive the event via Negentropy but can't decrypt messages sealed with the new key because the sealed share was unicast and lost. - The unicast path has no obvious recovery mechanism if the sender is offline — who re-sends the share? The relay? Any peer with
SyncProvider? That re-introduces exactly the sync problem Negentropy was meant to solve, but for key material. - If the answer is "any peer with the plaintext key re-seals for the joining peer," we need to specify where the authorization for that lives in the state machine.
Either expand this section with a real design or explicitly punt to a follow-up spec and acknowledge that without that follow-up, Negentropy sync is incomplete for the encrypted-channel use case. Right now the spec reads as if it's solved, and it isn't.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Round 2 review: comparative survey of sync protocols
Round 1 challenged Negentropy on its own terms (fingerprint correctness, sort-key adversarial behaviour, bucketing). Round 2 zooms out: is Negentropy even the right shape of protocol for Willow's data model? I read across the broader sync-protocol literature and the answer is "probably not, and there are at least two alternatives that are strictly better fits."
The headline: Willow already depends on iroh, iroh ships iroh-docs which already implements range-based set reconciliation (the same family of algorithm Negentropy belongs to, descended from Aljoscha Meyer's paper), and Willow's per-author monotonic-seq chains map exactly onto Scuttlebutt EBT — a simpler protocol that converges in O(authors) state and 1 RTT. The spec's open questions (#2 "rust-nostr crate maturity?", #3 "can per-author seq let us skip Negentropy entirely?") are gesturing at exactly this. The honest answers are "we don't need a new crate" and "yes, and we should."
1. Scuttlebutt EBT is the closest match to Willow's actual data model
Willow's Event carries (author_pubkey, seq, prev_hash) — a per-author append-only chain with content-addressed continuity. This is structurally identical to a Scuttlebutt feed. EBT is the protocol Scuttlebutt evolved specifically for that shape:
- Each peer sends a vector clock
{author -> last_seq_seen}. That's it. One round trip, O(authors) bytes — no recursion. - "Request skipping": peers cache the last vector clock the remote sent, and on reconnect omit any author whose seq hasn't advanced. A client that's been offline an hour and shares 99% of state ships ~zero bytes of metadata before useful payload starts flowing.
- Bandwidth is "linear with messages to be sent" — the symmetric-difference property the spec wants from Negentropy, but achieved without a tree-recursion protocol because per-author chains are already monotonic.
- After the clock exchange, the protocol degenerates to ordered streaming of
events[seq > known_seq]per author. There is no fingerprint to compute, no sort key to design, no timestamp adversary because seq is authoritative and monotonic per author.
The spec acknowledges this in passing ("(author_pubkey, seq) ... per-author chains are monotonic and authoritative; enables trivial vector-clock sync") and then dismisses it because "it breaks the logarithmic property — we'd reconcile one chain at a time, not one mixed stream." That objection is wrong about the costs. Negentropy's "log" is O(log |A ⊕ B|) round trips on a flat unsorted set. EBT's cost is O(authors) state, 1 round trip, and O(events_to_send) bandwidth. For Willow's workload — modest author counts (server members), high event counts per author, and the diff being "the seq tail of a few authors" — EBT wins on every dimension that matters except one: it doesn't reconcile cross-author time ranges well. But the spec doesn't actually need that; the integration-points table is all "since_ms = client.last_seen" type queries that are equivalently expressed as "authors with seq > known_seq".
References: ssbc/epidemic-broadcast-trees, Planetary developer portal: EBT replication.
2. iroh-docs already ships range-based set reconciliation
This is the most surprising finding. Quoting iroh-docs' own README:
"Range-based set reconciliation is a simple approach to efficiently compute the union of two sets over a network, based on recursively partitioning the sets and comparing fingerprints of the partitions to probabilistically detect whether a partition requires further work."
That is the same algorithm family as Negentropy, descended from Aljoscha Meyer's 2022 paper (Negentropy/NIP-77 is essentially an instantiation of this paper with specific encoding choices). iroh-docs is a "meta-protocol" built on top of iroh-gossip and iroh-blobs — both of which Willow already depends on per CLAUDE.md's dependency graph.
The spec should either (a) explain why iroh-docs is unsuitable and Negentropy is, or (b) reuse iroh-docs. Reasons it might be unsuitable that the spec should address explicitly:
- iroh-docs is keyed
(author, key)with last-write-wins semantics, not append-only event DAGs — adapting Willow'sEventKindgraph to fit may be awkward. - iroh-docs replicas have schema constraints that don't match Willow's heterogeneous
EventKindenum. - License/maturity/API stability of
iroh-docsvsrust-nostr/negentropy.
But none of those are in the spec. Open question #2 ("rust-nostr/negentropy mature enough?") should be reframed as "why are we adopting a Nostr-flavoured implementation when iroh's own sync stack is already in our dependency tree?" — that is the single highest-leverage question for this PR.
3. Automerge sync protocol — bloom filter approach
Automerge's sync protocol is a useful contrast. It exchanges:
- Each peer's heads (DAG frontier hashes) — a few bytes.
- A bloom filter of all known change hashes — probabilistic membership.
- The other peer responds with changes the bloom says it doesn't have.
Tradeoffs vs Negentropy:
| Automerge | Negentropy | EBT | |
|---|---|---|---|
| Round trips | 1–2 typically | O(log diff) | 1 |
| State exchanged | O(changes) bloom | O(diff) range tree | O(authors) clock |
| Exact? | False-positive rate (resends a few extras) | Exact | Exact |
| DAG-aware? | Yes (heads are graph frontiers) | No (flat sorted set) | No (per-author chains) |
Negentropy's exactness is overrated for Willow's case: the spec already says "after reconciliation, missing events are fetched via the existing event-fetch path, not inline." A handful of bloom false-positives = a handful of extra fetches = no observable difference. Automerge's approach is also a serious option, particularly because Willow's events form a DAG with parent hashes (prev_hash) — a heads-based protocol is a more natural fit than treating events as a flat sorted set.
4. Git's pack protocol — the "have/want" baseline
Git's smart-HTTP pack-protocol is the maturity benchmark for this problem space: 20 years of production deployment. It uses negotiation rather than fingerprints — client says "want X", server says "have Y", they walk back through history until they find common ancestors, server builds a packfile.
Properties: not logarithmic (linear in commits walked during negotiation), but cheap in practice because it walks from the tip and stops at the first common ancestor. It's DAG-aware (uses parent edges, like Willow's prev_hash), it's exact, and it's trivially adversary-resistant because it's bounded by what each side actually claims to have.
Worth a paragraph in the spec explaining why Willow rejects this approach. The natural answer is "doesn't compose with iroh gossip routing" but that's not in the spec.
5. Delta-state CRDTs — formal bandwidth bounds
The Almeida/Shoker/Baquero delta-CRDT line gives formal bandwidth-bounded sync for causally-ordered updates: bandwidth proportional to delta, not state, with provable convergence and proven anti-entropy semantics. Willow's events are causally ordered (parent-hash DAG), so this framework applies directly. Negentropy gives empirical bandwidth properties; delta-CRDT theory gives guarantees.
This is more "pick up techniques from" than "adopt wholesale" — but the spec frames Negentropy as if there's no formal alternative. There is.
6. Matrix federation — the cautionary tale
For completeness: Matrix's federation backfill + state-resolution v2 reconciles diverged DAGs. It is infamously difficult to implement correctly (multiple homeserver implementations have produced divergent room states from the same inputs). The lesson for Willow: protocols that try to do exact reconciliation of authority-bearing DAGs (state events, in Matrix; permission/role events, in Willow) accumulate edge cases. Authority events should sync via a different, simpler path than chat events. The spec collapses both into one Negentropy session via SyncFilter. That is dangerous — and the spec's note that "structural events ignore the channel filter so structure is always fully reconciled" is hinting at this without addressing it.
Bottom-line recommendation
Replace this spec with a comparison memo, then pick one of two paths:
Path A (recommended): EBT-shaped sync over iroh. Per-author vector-clock exchange + ordered seq streaming. Solves the actual workload (clients reconnecting after offline periods, worker↔worker replication of per-author chains) in 1 RTT and O(authors) state. No timestamp adversary, no sort-key design, no bucketing, no fingerprints. ~30% the implementation surface of Negentropy and a much smaller test matrix. Use a separate, simpler path for the small set of authority events (or fold them into the same per-author streams since they are per-author by construction).
Path B: reuse iroh-docs. If the team wants range-based reconciliation specifically, the algorithm is already in the dependency tree. Negotiate the schema impedance mismatch with the iroh team or adapt Willow's EventKind to fit. Avoids reimplementing a known-tricky protocol.
What this PR currently is: adopting a third protocol from the Nostr ecosystem when neither (a) the data model (per-author chains, not flat event sets) nor (b) the dependency graph (iroh's own range-based sync is already present) suggests Nostr's choice is the right one for Willow.
Open question #3 ("can we short-circuit with a single max_seq_per_author vector exchange before opening a negentropy session, falling back to negentropy only when seq gaps exist?") is the spec quietly noticing this itself. The honest read: if the per-author fast path handles the common case, what cases does Negentropy actually still serve? If the answer is "none in production," the fallback isn't a fallback — it's the whole protocol, and it's EBT.
References:
- NIP-77 / Negentropy
- Aljoscha Meyer, range-based set reconciliation
- iroh-docs README
- ssbc/epidemic-broadcast-trees
- Planetary EBT docs
- Automerge sync (concepts)
- Git pack-protocol
- Delta State Replicated Data Types (Almeida, Shoker, Baquero, 2016)
- Matrix State Resolution v2 analysis
Generated by Claude Code
Apply review decisions to the relay capability document spec: - Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical bytes, signature field excluded from canonicalisation). - Specify dispatch surgery: explicit branch in dispatch_connection for /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the handle_bootstrap_connection pattern. - Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in willow-state); list as future work. - Resolve multi-tenant question: one shared doc per host, relay is topic-agnostic. - Soften operator-metadata leakage: version is coarse semver, software is project name, both MAY be omitted. - Two-tier caching by status: ok=300s, degraded/read_only=5s with must-revalidate. - Recommend WS clients also send Sec-WebSocket-Protocol; JSON is advisory pre-connect. - Fix port framing: relay binds one port multiplexing TCP+WS, not two. - Drop sync_provider_only (operator vibes without a concrete pre-handshake check). - Add Cross-spec coordination table pinning feature tags for #214, #216, #217, #218, #219, #220, #221. - Rewrite Open Questions to keep only genuinely-open items (paid-relay semantics, utilisation telemetry, relay discovery, feature registry). https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
- Reframe as consolidation of existing HeadsSummary-based worker sync
with the legacy gossip-level state-hash dump (cite the in-tree TODO
at listeners.rs:292-297) instead of "introduce a new per-author
vector exchange". The worker path already does this; the novelty is
hoisting the same protocol to the gossip path.
- Replace the proposed HashMap<EndpointId, u64> request shape with
reuse of the existing HeadsSummary { heads: BTreeMap<EndpointId,
AuthorHead { seq, hash }> } so we keep the head hash for free fork
detection via compare_chains.
- Drop the bogus "EventStore trait gains methods" framing — no such
trait exists in willow-state. Describe the change as adding a
small known_authors helper to the existing EventDag and
StorageEventStore concrete types; defer trait extraction.
- Use String for server_id and channel IDs (matching EventKind) and
call out that ServerId / messaging::ChannelId newtypes are NOT the
types in use.
- Fix line citations: per-author seq check at dag.rs:146-160 (not
event.rs:190-194); timestamp_hint_ms doc at event.rs:216-217 (not
202-203); SyncProvider at event.rs:23.
- Reference apply_incremental (public) and EventDag::insert as the
apply path; note apply_event is private.
- Mark SyncProvider gating as PROPOSED (not current) — neither worker
role nor gossip path checks it today.
- Acknowledge the existing idx_events_author_seq index and propose
adding a server-prefixed variant via a new migration rather than
pretending the index is new.
- Clarify that WireMessage::SyncRequest/SyncBatch (gossip) and
WorkerRequest::Sync/WorkerResponse::SyncBatch (worker) are TWO
separate code paths both touched by this spec; the gossip payload
shape changes, the worker payload doesn't.
- Note current MessageType only allocates slots 0-6; defer adding a
dedicated Sync slot.
- Fix test-tier locations: state tests in sync.rs (not the
nonexistent store.rs); wire round-trip tests inline in wire.rs (not
the nonexistent transport/src/tests.rs); multi-peer convergence as
client crate test against MemNetwork per CLAUDE.md test-tier rule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Per-envelope budget now sized to 64 KiB gossip cap (not 256 KB MAX_DESER_SIZE) - events_since / heads_summary() correctly attributed to dag.rs (not sync.rs) - Storage shape claim corrected to Vec<EventHash> with skip-based scan - sync_since query plan description updated for OR-fanout / unknown-authors branch - Asymmetry note: requester-known authors we don't have are ignored - try_insert_event referenced as actual client entry point Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lation - round 4 - Per-envelope budget: 64 KiB minus small constant (~200B); dropped wrong "~57-60 KiB usable" - Migration: don't bump PROTOCOL_VERSION; use additive SyncRequestV2/SyncBatchV2 variants for soft rollout - SyncRequest gains request_id (matched to worker path's String for consolidation) - Worker path: only `more` added; outer WorkerWireMessage::Response.request_id reused - Index claim downgraded: NOT IN disjunct still requires server-scan; recommend restructuring sync_since to use explicit per-author predicates - HistorySyncComplete framed as defined by spec #214 (unmerged); SyncCompleted relationship spelled out - Author-count threshold corrected to ~900 - listeners.rs MAX_SYNC_BATCH_SIZE = 10_000 acknowledged; defense-in-depth retained - Line cites tightened across multiple sections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — not proposing implementation, only the spec.
What & why
Willow's current history sync is "replay the last 1000 events from a worker's ring buffer, or dump all archival from storage." For partially-overlapping peers this is wasteful — a client that's been offline for an hour re-downloads events it already has.
Nostr's NIP-77 Negentropy (Doug Hoyte) solves this with range-based set reconciliation: both sides sort events by a common key, exchange 16-byte fingerprints over ranges, recurse on mismatches. Bandwidth scales with symmetric difference, not total set size. strfry uses it for relay-to-relay replication.
This spec proposes adopting NIP-77-style Negentropy for Willow:
(timestamp_hint_ms, EventHash)— matches NIP-77's shape so we can reuserust-nostr/negentropy. Epoch-day bucketing plusSyncProvidergating mitigates adversarial timestamps.truncate16(sha256(xor_sum(ids) || count_le)))MessageTypevariants —NegOpen,NegMsg,NegClose,NegErr— fitting Willow's 256 KB envelope.SyncFilter(authors, time range, channel, EventKind)Spec file:
docs/specs/2026-04-24-negentropy-sync.mdOpen questions for review
(timestamp_hint_ms, hash)vs(author, seq)vs HLC. The per-author seq path might let us simplify to a vector-style "last seq per author" sync instead of full Negentropy.rust-nostr/negentropymature enough to depend on, or do we port?SyncProviderpermission — only providers serve Neg sessions?Composition with sibling specs
NegClosesupports_negentropyCommit is unsigned due to harness signing backend failure (same as sibling PRs in this set).
Generated by Claude Code