Skip to content

spec: negentropy-based range reconciliation for history sync#219

Merged
intendednull merged 5 commits into
mainfrom
claude/spec-negentropy-sync
Apr 26, 2026
Merged

spec: negentropy-based range reconciliation for history sync#219
intendednull merged 5 commits into
mainfrom
claude/spec-negentropy-sync

Conversation

@intendednull
Copy link
Copy Markdown
Owner

Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — not proposing implementation, only the spec.

What & why

Willow's current history sync is "replay the last 1000 events from a worker's ring buffer, or dump all archival from storage." For partially-overlapping peers this is wasteful — a client that's been offline for an hour re-downloads events it already has.

Nostr's NIP-77 Negentropy (Doug Hoyte) solves this with range-based set reconciliation: both sides sort events by a common key, exchange 16-byte fingerprints over ranges, recurse on mismatches. Bandwidth scales with symmetric difference, not total set size. strfry uses it for relay-to-relay replication.

This spec proposes adopting NIP-77-style Negentropy for Willow:

  • Sort key: (timestamp_hint_ms, EventHash) — matches NIP-77's shape so we can reuse rust-nostr/negentropy. Epoch-day bucketing plus SyncProvider gating mitigates adversarial timestamps.
  • Fingerprint: verbatim from NIP-77 (truncate16(sha256(xor_sum(ids) || count_le)))
  • Wire: 4 new MessageType variants — NegOpen, NegMsg, NegClose, NegErr — fitting Willow's 256 KB envelope.
  • Filter: SyncFilter (authors, time range, channel, EventKind)
  • Integration points (table in spec): client↔replay, client↔storage, replay↔storage, storage↔storage replication

Spec file: docs/specs/2026-04-24-negentropy-sync.md

Open questions for review

  1. Sort key choice (biggest design decision) — (timestamp_hint_ms, hash) vs (author, seq) vs HLC. The per-author seq path might let us simplify to a vector-style "last seq per author" sync instead of full Negentropy.
  2. Is rust-nostr/negentropy mature enough to depend on, or do we port?
  3. Does per-author monotonic seq actually let us skip Negentropy entirely for the common case?
  4. Encrypted channel keys — reconciled via the same flow, or separately?
  5. Interaction with SyncProvider permission — only providers serve Neg sessions?
  6. Bandwidth / frame budget — how does this interact with relay caps?

Composition with sibling specs

  • History sync EOSE (separate PR): natural termination signal for NegClose
  • Relay capability doc (separate PR): advertise supports_negentropy
  • Outbox relay discovery: discovery of Neg-capable providers

Commit is unsigned due to harness signing backend failure (same as sibling PRs in this set).


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec review. Solid framing of the problem, but several concrete issues block adoption as written.

Blocking

  1. Fingerprint formula is wrong. §Fingerprint specifies truncate16(sha256(xor_sum(ids) || count_le)). NIP-77 and negentropy-protocol-v1.md both specify (a) addition mod 2^256 of IDs interpreted as 32-byte little-endian integers — not XOR — and (b) the element count encoded as a varint — not a fixed little-endian u64. As written we will not interop with rust-nostr/negentropy or generate matching reference vectors, defeating the "reuse the crate with minimal glue" rationale. The protocol byte 0x61 is correct; please fix the digest construction and add Hoyte's published test vectors to the unit tier.

  2. MessageType = 7 collides with the EOSE spec. PR #214 (spec-history-sync-eose, branch claude/spec-history-sync-eose) reserves MessageType::HistorySyncComplete = 7 at the same crates/transport/src/lib.rs:64 slot. Pick non-overlapping tags or coordinate the numbering in one of the two specs before either lands.

  3. NegCloseHistorySyncComplete. §Completion signalling claims a Negentropy session naturally satisfies the EOSE contract. It does not: PR #214's HistorySyncComplete carries (topic_id, provider_peer, last_event_hash, epoch) precisely so the client can detect silent truncation and so a restarted provider doesn't reuse a stale marker. NegClose { session_id } carries none of that. Either (a) augment NegClose with the EOSE anchor fields and explicitly subsume #214, or (b) keep the two distinct and document the ordering (NegClose then HistorySyncComplete). The spec must pick one — silently dropping #214's invariants is the worst outcome.

Significant

  1. Reconsider Negentropy at all for the client↔replay-worker path. Willow's Event (crates/state/src/event.rs:184-204) has author: EndpointId, strictly-monotonic seq: u64, and signed prev: EventHash. Per-author chains are append-only and authoritative — not the unstructured ID set Negentropy is designed for. Open Question 3 (per-author fast-path) is the actually-important question this spec should answer, not punt: a one-round-trip max_seq_per_author vector exchange will resolve the overwhelmingly common reconnect-after-downtime case in 1 RTT and O(authors) bytes, with zero range index. Negentropy's log-diff property only pays off when symmetric difference is large and spans many authors with no shared structure — rare here. Recommend: spec the vector-clock fast-path as the primary protocol and Negentropy as a fallback for cross-worker / disjoint-author cases, not the other way round. This also removes the entire sort-key debate.

  2. hlc_timestamp sort key is not currently available. §Sort key lists (hlc_timestamp, hash) as an option, but Event does not carry an HLC — only timestamp_hint_ms. HLCs live in willow-messaging::hlc and are scoped to chat content. Adopting that key requires a wider state-machine change than the spec acknowledges; either remove the option or scope the migration.

  3. Adversarial timestamps mitigation is hand-wavy. Epoch-day bucketing helps but does not bound recursion within a poisoned bucket — a peer who lands 100k events at t=0 makes the first bucket's reconciliation worst-case. SyncProvider-gated serving doesn't help because the malicious data is already in the DAG and being reconciled. If we keep (timestamp_hint_ms, hash), the spec needs a hard recursion-depth or per-bucket size cap and a defined behaviour on hit (fall back to IdList? abort with NegErr(Blocked)?). This further argues for #4.

Minor

  1. rust-nostr/negentropy API fit. Crate is real (MIT, v0.4–0.5, MSRV 1.51, published as negentropy on crates.io), but pinned to NIP-77's (uint64, 32-byte id) item shape and an in-memory Storage model. The "minimal glue" claim should be qualified — we'll need a bridge type to back it with the SQLite range-scan iterator on storage workers, and we cannot use it directly with a custom sort key.

  2. Frame budget. 256 KB MAX_DESER_SIZE minus envelope overhead caps a single NegMsg at ~16k fingerprints. Sessions over a large symmetric difference will span many envelopes — the spec needs to say (a) is concurrent send allowed or strict ping-pong, (b) what happens if the responder's reply doesn't fit one envelope (the current text says "split into multiple NegMsg" but doesn't define ordering / interleaving with the next request frame), (c) backpressure when the gossip buffer is full.

  3. §Storage requirements: the range_scan signature returns Box<dyn Iterator> but EventStore impls cross WASM/native — confirm the trait stays object-safe and Send-bounded per the dual-target rule in CLAUDE.md.

  4. §Wire protocol table puts initial_msg: Vec<u8> inside NegOpen. NIP-77 mirrors this, but it's worth noting this means the first fingerprint already commits the initiator to a sort key the responder hasn't acked — define the failure mode if the responder disagrees on sort key (NegErr(Unsupported)?).

  5. Open Question 5 (require a permission to initiate): yes — at minimum require server membership, otherwise any peer that learns a ServerId can probe existence and event-set fingerprints. Easy to add now; awkward to add later.

Overall: the direction is reasonable, but I think the right v1 here is the per-author seq vector exchange, with Negentropy reserved for worker↔worker and disjoint-history cases. If we're going full Negentropy, the fingerprint and EOSE-overlap issues must be resolved before implementation starts.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough spec with a clear algorithmic story and good cross-references into the codebase — the algorithm summary, fingerprint definition, and storage-index sketch are all concrete enough to implement against. That said, I have substantive concerns about the primary design decision and the interaction with existing Willow invariants. I don't think this should land as-is.

Strengths

  • The motivation is well-grounded: the current "dump 1000 events from the replay ring buffer" path is genuinely wasteful, and O(log(|A ⊕ B|)) reconciliation is the right shape of answer for worker↔worker replication.
  • Mirroring NIP-77's fingerprint (truncate16(sha256(xor_sum || count_le))) byte-for-byte is the right call if we stay with Negentropy — XOR homomorphism is exactly what lets ranges split cheaply, and deviating would forfeit reference vectors.
  • Filter design is sensible: structural events (GrantPermission, CreateChannel) bypass the channel filter so server structure always converges. This is important and easy to get wrong.
  • NegClose satisfying the SyncComplete contract from the EOSE spec (#214) is a nice composition — one less redundant signal to carry.
  • Relay stays a stateless bridge (crates/relay/src/lib.rs:16-41), session state lives in participants. Keeps the trust model unchanged.
  • Test matrix names the right scenarios (three-peer, edge cases, byte-count assertion on reconnect).

Concerns

  1. Sort key is the wrong call and is the most important decision in the spec. timestamp_hint_ms is documented as "Display only — never used for ordering" (crates/state/src/event.rs:202-203). Making it the canonical sort key of the reconciliation index is a direct inversion of that invariant, and since the field is signed and author-controlled, it puts adversarial input on the hot path. The epoch-day bucketing + SyncProvider gating is weaker than it reads — any member with SendMessages can produce skewed timestamps, and bucketing just concentrates the pathology into a single bucket. See my inline comment on line 57.

  2. Negentropy may be overkill for Willow's DAG. Willow has monotonic per-author seq + prev chains that Nostr doesn't. A "max seq per author" vector exchange is O(authors) state, one round trip, exact — and strictly dominates Negentropy for the common reconnect/warm-start cases. Open Question #3 should not be an open question; it should drive the framing. See my inline comment on line 248.

  3. Encrypted channel-key distribution is waved away. "Sealed key shares stay on a parallel unicast flow" is one line describing a hard problem. If a peer comes online after missing RotateChannelKey and the sender is offline, there's no recovery story. Either design it here or explicitly defer and flag that encrypted-channel sync is incomplete. See inline comment on line 225.

  4. Envelope budget and frame sizing are underspecified. "16 000 fingerprints per NegMsg" assumes the full 256 KB MAX_DESER_SIZE, but the gossip transport has a 64 KB max_message_size and NIP-77 frames carry more than just the 16-byte digest. frameSizeLimit needs to be a protocol parameter and receivers need to know continuation semantics. See inline comment on line 201.

  5. rust-nostr/negentropy dependency risk unexamined. Open Question #1 flags it but doesn't answer it. License? Audit state? WASM support (we compile library crates to wasm32-unknown-unknown per CLAUDE.md)? Maintenance cadence? If the answer is "fork and port," that's a materially different cost than "add a dep." This needs to be resolved before merging the spec, not after.

  6. SyncProvider-only serving conflicts with peer-to-peer sync. The spec gates NegOpen responders to SyncProvider, but the Integration Points table lists "client ↔ replay worker" and "client ↔ storage worker" as the primary paths — fine if workers hold SyncProvider, but then regular peers can never directly reconcile with each other over gossip. Is that intentional? It means the system degrades to "you must have a worker online" for any history recovery, which is an availability regression from today's gossip replay. Open Question #5 circles this but doesn't resolve it.

  7. Worker architecture fit. docs/specs/2026-03-27-worker-nodes-design.md describes SyncRequest/SyncBatch as the current interface and lists a max_events config for replay workers. This spec replaces that path but doesn't discuss migration: do both protocols coexist during rollout, or is there a flag day? What happens when a new client talks to an old worker or vice versa? The four new MessageType variants don't imply version negotiation beyond the existing envelope version bump.

Suggestions

  • Reframe around per-author seq as the primary sync path. Keep Negentropy as a fallback for detecting DAG-level divergence (via StateHash) and for cross-author reconciliation of unusual histories. Default sync = seq-vector exchange, and that sidesteps the sort-key problem entirely.
  • If Negentropy stays primary, switch the sort key to (HlcTimestamp, EventHash). HLCs (crates/messaging/src/hlc.rs) exist, are monotonic, have bounded skew, and aren't attacker-chosen. Extending HLC to every EventKind (not just Message) is a cheaper change than eternally carrying attacker-controllable state in the reconciliation index, and it has standalone value for willow-state merge ordering.
  • Add a worst-case adversarial analysis. Given N events with arbitrary attacker-chosen timestamps within a bucket, what is the bound on round trips and total bytes? If it's unbounded, add a hard session-byte cap in §Bandwidth.
  • Add fingerprint reference vectors from hoytech's test suite to the testing table. "Matches Hoyte reference vectors" should cite specific vectors so the implementation can't silently diverge.
  • Specify version negotiation. What does a v2-capable peer send to a v1 worker that doesn't know NegOpen? Today's Envelope::validate_version rejects the whole envelope; the rollout plan needs to account for that.
  • Resolve rust-nostr/negentropy vs. port before merging. A spec that depends on an unaudited external crate for a security-relevant (permission-gated, adversarial-timestamp-exposed) protocol shouldn't punt that question.

Happy to re-review once the sort-key decision is settled — that's the one that governs everything else.


Generated by Claude Code

| `(hlc_timestamp, hash)` | HLCs (see [`crates/messaging/src/hlc.rs`](../../crates/messaging/src/hlc.rs)) give monotonic causal order across authors; resilient to clock skew | HLCs only stamp `Message` events today; non-message `EventKind` variants would need HLC adoption first |
| `(author_pubkey, seq)` primary with `(ts, hash)` fallback | Cheap fast-path for peers that share most chains | Two protocols to implement and reason about |

**Recommendation: `(timestamp_hint_ms, hash)`** for the initial
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the load-bearing decision of the whole spec, and I think it's the wrong call as written.

crates/state/src/event.rs:202-203 explicitly says timestamp_hint_ms is "Display only — never used for ordering." Elevating it to the primary sort key of the canonical range index is a direct semantic reversal of that invariant. It also puts adversarial input on the hot path: since the field is in the signed payload, a malicious author can legitimately produce events claiming t = 0 or t = u64::MAX and the signature still verifies. "Bucket by epoch-day + gate SyncProvider" doesn't really defuse this: (a) the attacker only needs write access to the server (SendMessages), not SyncProvider; (b) epoch-day bucketing just moves the recursion cost into one bucket rather than eliminating it, and a single bucket of 10k skewed events will still blow past the 256 KB envelope budget discussed in §Bandwidth; (c) "kickable via governance" is an after-the-fact remedy for a protocol the attacker already made expensive.

Alternatives that the table undersells:

  • (HlcTimestamp, EventHash) — HLCs already exist (crates/messaging/src/hlc.rs) and give a monotonic, bounded-skew total order that is not client-controllable in the same way (each node advances its own clock on receive, bounding drift). The spec dismisses this because "HLCs only stamp Message events today" — but that's an implementation gap, not a design constraint. Adopting HLC on every Event (not just Message) is a much smaller change than baking an attacker-controllable field into a Merkle reconciliation structure for the lifetime of the protocol. It's also independently valuable for merge ordering in willow-state.
  • (author, seq) — the table says this "breaks the logarithmic property." That's only true if you insist on a single cross-author range. If you reconcile one chain at a time, you get exact sync in one round per divergent author with a trivial "max seq per author" vector (see my other comment on §Open Questions Comprehensive multi-peer E2E tests across all browsers #3). For Willow's per-author Merkle-DAG that is strictly cheaper than Negentropy in the common case.

Concretely, I'd like to see the spec either (1) switch the recommendation to HLC + hash and describe the HLC-on-every-event rollout, or (2) make a much stronger case that NIP-77 wire compatibility is worth giving up a signed invariant for — who are we actually interoperating with? This is a private DAG; we're not federating with Nostr relays.

Please don't treat "we can reuse rust-nostr/negentropy with minimal glue" as the tiebreaker. The cost of getting the sort key wrong is much larger than the cost of a 200-line port.


Generated by Claude Code

client↔replay-worker path given per-author monotonicity?
3. **Per-author fast path.** Can we short-circuit with a single
`max_seq_per_author` vector exchange *before* opening a negentropy
session, falling back to negentropy only when seq gaps exist?
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question is buried in Open Questions but I think it deserves to be promoted into the core of the spec — possibly displacing Negentropy as the default path.

Willow's Event has something Nostr doesn't: a strictly monotonic, gap-free per-author seq with a prev hash chain (crates/state/src/event.rs:190-194). Given that, the natural sync primitive is:

A → B: HashMap<EndpointId, u64>   // my max seq per author
B → A: Vec<Event>                 // events where B.seq > A.seq for that author

That's O(authors) state, one round trip, and the "what's missing" answer is exact — no fingerprints, no recursion, no frame-splitting. For the dominant real-world cases (client reconnects after being offline an hour; replay worker warm-starts from storage) this strictly dominates Negentropy on bandwidth, code complexity, and latency.

Negentropy earns its keep only when peers have the same authors' events but at different seqs and there are large holes in the middle of chains — which in Willow shouldn't happen during normal operation because events arrive in order via gossip. The case where it's actually needed is cross-author ordering for UI display, which is a rendering concern, not a sync concern.

My suggestion: flip the framing of this spec. Default sync = per-author seq vector exchange. Negentropy is a fallback for (a) detecting DAG-level divergence via StateHash, (b) reconciling unusual histories produced by offline peers or deliberate forks. That also cleanly sidesteps the sort-key problem (you never need a canonical cross-author order on the wire).

If after analysis Negentropy really is the better primary, I'd want to see bandwidth/latency numbers for the common case comparing the two, not just asymptotic arguments.


Generated by Claude Code

## Bandwidth and safety

- Each `NegMsg` is capped by `MAX_DESER_SIZE` (256 KB); a single round
trip carries at most ~16 000 fingerprints or ~8 000 IDs.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "256 KB carries ~16 000 fingerprints" arithmetic is optimistic once you account for the envelope overhead and the fact that NIP-77 frames encode (upper_bound_timestamp, id_prefix, mode, payload) per range, not just raw 16-byte fingerprints. The hoytech reference's frameSizeLimit default is 4096 bytes for a reason — it's tuned for Nostr relay WebSocket frames. We inherit 256 KB from MAX_DESER_SIZE (see crates/transport/src/lib.rs:36), but the underlying iroh gossip max_message_size is 64 KB (that comment exists in the same file). Which applies to NegMsg?

Concrete asks:

  1. State which transport path NegMsg travels (gossip vs. direct QUIC stream) and which size limit therefore binds.
  2. Specify frameSizeLimit as a protocol parameter, not just "split as needed" — receivers need to know when to expect a continuation vs. a terminal frame, especially since NIP-77 reconciliation is stateful across frames.
  3. Show a worst-case bound: given 1M events with adversarial timestamp bucketing (see my sort-key comment), what is the maximum number of NegMsg round trips before convergence? If it's unbounded in practice, we need a hard session-byte cap, not just the 10s / 30s timers.

Generated by Claude Code

Channel-key events (`RotateChannelKey`) live in the same DAG as every
other event and therefore ride along inside negentropy sessions
automatically, subject to the filter. Per-recipient sealed key shares
are NOT part of the DAG and remain on their own point-to-point path.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Per-recipient sealed key shares are NOT part of the DAG and remain on their own point-to-point path" — this needs to be a separate sub-spec or at minimum an explicit design sketch, not a one-liner dismissal. The consequences are:

  • A peer that comes online after missing a RotateChannelKey will receive the event via Negentropy but can't decrypt messages sealed with the new key because the sealed share was unicast and lost.
  • The unicast path has no obvious recovery mechanism if the sender is offline — who re-sends the share? The relay? Any peer with SyncProvider? That re-introduces exactly the sync problem Negentropy was meant to solve, but for key material.
  • If the answer is "any peer with the plaintext key re-seals for the joining peer," we need to specify where the authorization for that lives in the state machine.

Either expand this section with a real design or explicitly punt to a follow-up spec and acknowledge that without that follow-up, Negentropy sync is incomplete for the encrypted-channel use case. Right now the spec reads as if it's solved, and it isn't.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 review: comparative survey of sync protocols

Round 1 challenged Negentropy on its own terms (fingerprint correctness, sort-key adversarial behaviour, bucketing). Round 2 zooms out: is Negentropy even the right shape of protocol for Willow's data model? I read across the broader sync-protocol literature and the answer is "probably not, and there are at least two alternatives that are strictly better fits."

The headline: Willow already depends on iroh, iroh ships iroh-docs which already implements range-based set reconciliation (the same family of algorithm Negentropy belongs to, descended from Aljoscha Meyer's paper), and Willow's per-author monotonic-seq chains map exactly onto Scuttlebutt EBT — a simpler protocol that converges in O(authors) state and 1 RTT. The spec's open questions (#2 "rust-nostr crate maturity?", #3 "can per-author seq let us skip Negentropy entirely?") are gesturing at exactly this. The honest answers are "we don't need a new crate" and "yes, and we should."

1. Scuttlebutt EBT is the closest match to Willow's actual data model

Willow's Event carries (author_pubkey, seq, prev_hash) — a per-author append-only chain with content-addressed continuity. This is structurally identical to a Scuttlebutt feed. EBT is the protocol Scuttlebutt evolved specifically for that shape:

  • Each peer sends a vector clock {author -> last_seq_seen}. That's it. One round trip, O(authors) bytes — no recursion.
  • "Request skipping": peers cache the last vector clock the remote sent, and on reconnect omit any author whose seq hasn't advanced. A client that's been offline an hour and shares 99% of state ships ~zero bytes of metadata before useful payload starts flowing.
  • Bandwidth is "linear with messages to be sent" — the symmetric-difference property the spec wants from Negentropy, but achieved without a tree-recursion protocol because per-author chains are already monotonic.
  • After the clock exchange, the protocol degenerates to ordered streaming of events[seq > known_seq] per author. There is no fingerprint to compute, no sort key to design, no timestamp adversary because seq is authoritative and monotonic per author.

The spec acknowledges this in passing ("(author_pubkey, seq) ... per-author chains are monotonic and authoritative; enables trivial vector-clock sync") and then dismisses it because "it breaks the logarithmic property — we'd reconcile one chain at a time, not one mixed stream." That objection is wrong about the costs. Negentropy's "log" is O(log |A ⊕ B|) round trips on a flat unsorted set. EBT's cost is O(authors) state, 1 round trip, and O(events_to_send) bandwidth. For Willow's workload — modest author counts (server members), high event counts per author, and the diff being "the seq tail of a few authors" — EBT wins on every dimension that matters except one: it doesn't reconcile cross-author time ranges well. But the spec doesn't actually need that; the integration-points table is all "since_ms = client.last_seen" type queries that are equivalently expressed as "authors with seq > known_seq".

References: ssbc/epidemic-broadcast-trees, Planetary developer portal: EBT replication.

2. iroh-docs already ships range-based set reconciliation

This is the most surprising finding. Quoting iroh-docs' own README:

"Range-based set reconciliation is a simple approach to efficiently compute the union of two sets over a network, based on recursively partitioning the sets and comparing fingerprints of the partitions to probabilistically detect whether a partition requires further work."

That is the same algorithm family as Negentropy, descended from Aljoscha Meyer's 2022 paper (Negentropy/NIP-77 is essentially an instantiation of this paper with specific encoding choices). iroh-docs is a "meta-protocol" built on top of iroh-gossip and iroh-blobs — both of which Willow already depends on per CLAUDE.md's dependency graph.

The spec should either (a) explain why iroh-docs is unsuitable and Negentropy is, or (b) reuse iroh-docs. Reasons it might be unsuitable that the spec should address explicitly:

  • iroh-docs is keyed (author, key) with last-write-wins semantics, not append-only event DAGs — adapting Willow's EventKind graph to fit may be awkward.
  • iroh-docs replicas have schema constraints that don't match Willow's heterogeneous EventKind enum.
  • License/maturity/API stability of iroh-docs vs rust-nostr/negentropy.

But none of those are in the spec. Open question #2 ("rust-nostr/negentropy mature enough?") should be reframed as "why are we adopting a Nostr-flavoured implementation when iroh's own sync stack is already in our dependency tree?" — that is the single highest-leverage question for this PR.

3. Automerge sync protocol — bloom filter approach

Automerge's sync protocol is a useful contrast. It exchanges:

  1. Each peer's heads (DAG frontier hashes) — a few bytes.
  2. A bloom filter of all known change hashes — probabilistic membership.
  3. The other peer responds with changes the bloom says it doesn't have.

Tradeoffs vs Negentropy:

Automerge Negentropy EBT
Round trips 1–2 typically O(log diff) 1
State exchanged O(changes) bloom O(diff) range tree O(authors) clock
Exact? False-positive rate (resends a few extras) Exact Exact
DAG-aware? Yes (heads are graph frontiers) No (flat sorted set) No (per-author chains)

Negentropy's exactness is overrated for Willow's case: the spec already says "after reconciliation, missing events are fetched via the existing event-fetch path, not inline." A handful of bloom false-positives = a handful of extra fetches = no observable difference. Automerge's approach is also a serious option, particularly because Willow's events form a DAG with parent hashes (prev_hash) — a heads-based protocol is a more natural fit than treating events as a flat sorted set.

4. Git's pack protocol — the "have/want" baseline

Git's smart-HTTP pack-protocol is the maturity benchmark for this problem space: 20 years of production deployment. It uses negotiation rather than fingerprints — client says "want X", server says "have Y", they walk back through history until they find common ancestors, server builds a packfile.

Properties: not logarithmic (linear in commits walked during negotiation), but cheap in practice because it walks from the tip and stops at the first common ancestor. It's DAG-aware (uses parent edges, like Willow's prev_hash), it's exact, and it's trivially adversary-resistant because it's bounded by what each side actually claims to have.

Worth a paragraph in the spec explaining why Willow rejects this approach. The natural answer is "doesn't compose with iroh gossip routing" but that's not in the spec.

5. Delta-state CRDTs — formal bandwidth bounds

The Almeida/Shoker/Baquero delta-CRDT line gives formal bandwidth-bounded sync for causally-ordered updates: bandwidth proportional to delta, not state, with provable convergence and proven anti-entropy semantics. Willow's events are causally ordered (parent-hash DAG), so this framework applies directly. Negentropy gives empirical bandwidth properties; delta-CRDT theory gives guarantees.

This is more "pick up techniques from" than "adopt wholesale" — but the spec frames Negentropy as if there's no formal alternative. There is.

6. Matrix federation — the cautionary tale

For completeness: Matrix's federation backfill + state-resolution v2 reconciles diverged DAGs. It is infamously difficult to implement correctly (multiple homeserver implementations have produced divergent room states from the same inputs). The lesson for Willow: protocols that try to do exact reconciliation of authority-bearing DAGs (state events, in Matrix; permission/role events, in Willow) accumulate edge cases. Authority events should sync via a different, simpler path than chat events. The spec collapses both into one Negentropy session via SyncFilter. That is dangerous — and the spec's note that "structural events ignore the channel filter so structure is always fully reconciled" is hinting at this without addressing it.

Bottom-line recommendation

Replace this spec with a comparison memo, then pick one of two paths:

Path A (recommended): EBT-shaped sync over iroh. Per-author vector-clock exchange + ordered seq streaming. Solves the actual workload (clients reconnecting after offline periods, worker↔worker replication of per-author chains) in 1 RTT and O(authors) state. No timestamp adversary, no sort-key design, no bucketing, no fingerprints. ~30% the implementation surface of Negentropy and a much smaller test matrix. Use a separate, simpler path for the small set of authority events (or fold them into the same per-author streams since they are per-author by construction).

Path B: reuse iroh-docs. If the team wants range-based reconciliation specifically, the algorithm is already in the dependency tree. Negotiate the schema impedance mismatch with the iroh team or adapt Willow's EventKind to fit. Avoids reimplementing a known-tricky protocol.

What this PR currently is: adopting a third protocol from the Nostr ecosystem when neither (a) the data model (per-author chains, not flat event sets) nor (b) the dependency graph (iroh's own range-based sync is already present) suggests Nostr's choice is the right one for Willow.

Open question #3 ("can we short-circuit with a single max_seq_per_author vector exchange before opening a negentropy session, falling back to negentropy only when seq gaps exist?") is the spec quietly noticing this itself. The honest read: if the per-author fast path handles the common case, what cases does Negentropy actually still serve? If the answer is "none in production," the fallback isn't a fallback — it's the whole protocol, and it's EBT.


References:


Generated by Claude Code

intendednull pushed a commit that referenced this pull request Apr 25, 2026
Apply review decisions to the relay capability document spec:

- Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical
  bytes, signature field excluded from canonicalisation).
- Specify dispatch surgery: explicit branch in dispatch_connection for
  /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT
  and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the
  handle_bootstrap_connection pattern.
- Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in
  willow-state); list as future work.
- Resolve multi-tenant question: one shared doc per host, relay is
  topic-agnostic.
- Soften operator-metadata leakage: version is coarse semver, software
  is project name, both MAY be omitted.
- Two-tier caching by status: ok=300s, degraded/read_only=5s with
  must-revalidate.
- Recommend WS clients also send Sec-WebSocket-Protocol; JSON is
  advisory pre-connect.
- Fix port framing: relay binds one port multiplexing TCP+WS, not two.
- Drop sync_provider_only (operator vibes without a concrete
  pre-handshake check).
- Add Cross-spec coordination table pinning feature tags for #214,
  #216, #217, #218, #219, #220, #221.
- Rewrite Open Questions to keep only genuinely-open items (paid-relay
  semantics, utilisation telemetry, relay discovery, feature registry).

https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
claude and others added 4 commits April 25, 2026 08:16
- Reframe as consolidation of existing HeadsSummary-based worker sync
  with the legacy gossip-level state-hash dump (cite the in-tree TODO
  at listeners.rs:292-297) instead of "introduce a new per-author
  vector exchange". The worker path already does this; the novelty is
  hoisting the same protocol to the gossip path.
- Replace the proposed HashMap<EndpointId, u64> request shape with
  reuse of the existing HeadsSummary { heads: BTreeMap<EndpointId,
  AuthorHead { seq, hash }> } so we keep the head hash for free fork
  detection via compare_chains.
- Drop the bogus "EventStore trait gains methods" framing — no such
  trait exists in willow-state. Describe the change as adding a
  small known_authors helper to the existing EventDag and
  StorageEventStore concrete types; defer trait extraction.
- Use String for server_id and channel IDs (matching EventKind) and
  call out that ServerId / messaging::ChannelId newtypes are NOT the
  types in use.
- Fix line citations: per-author seq check at dag.rs:146-160 (not
  event.rs:190-194); timestamp_hint_ms doc at event.rs:216-217 (not
  202-203); SyncProvider at event.rs:23.
- Reference apply_incremental (public) and EventDag::insert as the
  apply path; note apply_event is private.
- Mark SyncProvider gating as PROPOSED (not current) — neither worker
  role nor gossip path checks it today.
- Acknowledge the existing idx_events_author_seq index and propose
  adding a server-prefixed variant via a new migration rather than
  pretending the index is new.
- Clarify that WireMessage::SyncRequest/SyncBatch (gossip) and
  WorkerRequest::Sync/WorkerResponse::SyncBatch (worker) are TWO
  separate code paths both touched by this spec; the gossip payload
  shape changes, the worker payload doesn't.
- Note current MessageType only allocates slots 0-6; defer adding a
  dedicated Sync slot.
- Fix test-tier locations: state tests in sync.rs (not the
  nonexistent store.rs); wire round-trip tests inline in wire.rs (not
  the nonexistent transport/src/tests.rs); multi-peer convergence as
  client crate test against MemNetwork per CLAUDE.md test-tier rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Per-envelope budget now sized to 64 KiB gossip cap (not 256 KB MAX_DESER_SIZE)
- events_since / heads_summary() correctly attributed to dag.rs (not sync.rs)
- Storage shape claim corrected to Vec<EventHash> with skip-based scan
- sync_since query plan description updated for OR-fanout / unknown-authors branch
- Asymmetry note: requester-known authors we don't have are ignored
- try_insert_event referenced as actual client entry point

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lation - round 4

- Per-envelope budget: 64 KiB minus small constant (~200B); dropped wrong "~57-60 KiB usable"
- Migration: don't bump PROTOCOL_VERSION; use additive SyncRequestV2/SyncBatchV2 variants for soft rollout
- SyncRequest gains request_id (matched to worker path's String for consolidation)
- Worker path: only `more` added; outer WorkerWireMessage::Response.request_id reused
- Index claim downgraded: NOT IN disjunct still requires server-scan; recommend restructuring sync_since to use explicit per-author predicates
- HistorySyncComplete framed as defined by spec #214 (unmerged); SyncCompleted relationship spelled out
- Author-count threshold corrected to ~900
- listeners.rs MAX_SYNC_BATCH_SIZE = 10_000 acknowledged; defense-in-depth retained
- Line cites tightened across multiple sections

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intendednull intendednull merged commit 56880b2 into main Apr 26, 2026
5 checks passed
@intendednull intendednull deleted the claude/spec-negentropy-sync branch April 26, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants