spec: negentropy-based range reconciliation for history sync by intendednull · Pull Request #219 · intendednull/willow

intendednull · 2026-04-24T08:16:08Z

Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — not proposing implementation, only the spec.

What & why

Willow's current history sync is "replay the last 1000 events from a worker's ring buffer, or dump all archival from storage." For partially-overlapping peers this is wasteful — a client that's been offline for an hour re-downloads events it already has.

Nostr's NIP-77 Negentropy (Doug Hoyte) solves this with range-based set reconciliation: both sides sort events by a common key, exchange 16-byte fingerprints over ranges, recurse on mismatches. Bandwidth scales with symmetric difference, not total set size. strfry uses it for relay-to-relay replication.

This spec proposes adopting NIP-77-style Negentropy for Willow:

Sort key: (timestamp_hint_ms, EventHash) — matches NIP-77's shape so we can reuse rust-nostr/negentropy. Epoch-day bucketing plus SyncProvider gating mitigates adversarial timestamps.
Fingerprint: verbatim from NIP-77 (truncate16(sha256(xor_sum(ids) || count_le)))
Wire: 4 new MessageType variants — NegOpen, NegMsg, NegClose, NegErr — fitting Willow's 256 KB envelope.
Filter: SyncFilter (authors, time range, channel, EventKind)
Integration points (table in spec): client↔replay, client↔storage, replay↔storage, storage↔storage replication

Spec file: docs/specs/2026-04-24-negentropy-sync.md

Open questions for review

Sort key choice (biggest design decision) — (timestamp_hint_ms, hash) vs (author, seq) vs HLC. The per-author seq path might let us simplify to a vector-style "last seq per author" sync instead of full Negentropy.
Is rust-nostr/negentropy mature enough to depend on, or do we port?
Does per-author monotonic seq actually let us skip Negentropy entirely for the common case?
Encrypted channel keys — reconciled via the same flow, or separately?
Interaction with SyncProvider permission — only providers serve Neg sessions?
Bandwidth / frame budget — how does this interact with relay caps?

Composition with sibling specs

History sync EOSE (separate PR): natural termination signal for NegClose
Relay capability doc (separate PR): advertise supports_negentropy
Outbox relay discovery: discovery of Neg-capable providers

Commit is unsigned due to harness signing backend failure (same as sibling PRs in this set).

Generated by Claude Code

intendednull

Spec review. Solid framing of the problem, but several concrete issues block adoption as written.

Blocking

Fingerprint formula is wrong. §Fingerprint specifies truncate16(sha256(xor_sum(ids) || count_le)). NIP-77 and negentropy-protocol-v1.md both specify (a) addition mod 2^256 of IDs interpreted as 32-byte little-endian integers — not XOR — and (b) the element count encoded as a varint — not a fixed little-endian u64. As written we will not interop with rust-nostr/negentropy or generate matching reference vectors, defeating the "reuse the crate with minimal glue" rationale. The protocol byte 0x61 is correct; please fix the digest construction and add Hoyte's published test vectors to the unit tier.
MessageType = 7 collides with the EOSE spec. PR #214 (spec-history-sync-eose, branch claude/spec-history-sync-eose) reserves MessageType::HistorySyncComplete = 7 at the same crates/transport/src/lib.rs:64 slot. Pick non-overlapping tags or coordinate the numbering in one of the two specs before either lands.
NegClose ≠ HistorySyncComplete. §Completion signalling claims a Negentropy session naturally satisfies the EOSE contract. It does not: PR #214's HistorySyncComplete carries (topic_id, provider_peer, last_event_hash, epoch) precisely so the client can detect silent truncation and so a restarted provider doesn't reuse a stale marker. NegClose { session_id } carries none of that. Either (a) augment NegClose with the EOSE anchor fields and explicitly subsume #214, or (b) keep the two distinct and document the ordering (NegClose then HistorySyncComplete). The spec must pick one — silently dropping #214's invariants is the worst outcome.

Significant

Reconsider Negentropy at all for the client↔replay-worker path. Willow's Event (crates/state/src/event.rs:184-204) has author: EndpointId, strictly-monotonic seq: u64, and signed prev: EventHash. Per-author chains are append-only and authoritative — not the unstructured ID set Negentropy is designed for. Open Question 3 (per-author fast-path) is the actually-important question this spec should answer, not punt: a one-round-trip max_seq_per_author vector exchange will resolve the overwhelmingly common reconnect-after-downtime case in 1 RTT and O(authors) bytes, with zero range index. Negentropy's log-diff property only pays off when symmetric difference is large and spans many authors with no shared structure — rare here. Recommend: spec the vector-clock fast-path as the primary protocol and Negentropy as a fallback for cross-worker / disjoint-author cases, not the other way round. This also removes the entire sort-key debate.
hlc_timestamp sort key is not currently available. §Sort key lists (hlc_timestamp, hash) as an option, but Event does not carry an HLC — only timestamp_hint_ms. HLCs live in willow-messaging::hlc and are scoped to chat content. Adopting that key requires a wider state-machine change than the spec acknowledges; either remove the option or scope the migration.
Adversarial timestamps mitigation is hand-wavy. Epoch-day bucketing helps but does not bound recursion within a poisoned bucket — a peer who lands 100k events at t=0 makes the first bucket's reconciliation worst-case. SyncProvider-gated serving doesn't help because the malicious data is already in the DAG and being reconciled. If we keep (timestamp_hint_ms, hash), the spec needs a hard recursion-depth or per-bucket size cap and a defined behaviour on hit (fall back to IdList? abort with NegErr(Blocked)?). This further argues for #4.

Minor

rust-nostr/negentropy API fit. Crate is real (MIT, v0.4–0.5, MSRV 1.51, published as negentropy on crates.io), but pinned to NIP-77's (uint64, 32-byte id) item shape and an in-memory Storage model. The "minimal glue" claim should be qualified — we'll need a bridge type to back it with the SQLite range-scan iterator on storage workers, and we cannot use it directly with a custom sort key.
Frame budget. 256 KB MAX_DESER_SIZE minus envelope overhead caps a single NegMsg at ~16k fingerprints. Sessions over a large symmetric difference will span many envelopes — the spec needs to say (a) is concurrent send allowed or strict ping-pong, (b) what happens if the responder's reply doesn't fit one envelope (the current text says "split into multiple NegMsg" but doesn't define ordering / interleaving with the next request frame), (c) backpressure when the gossip buffer is full.
§Storage requirements: the range_scan signature returns Box<dyn Iterator> but EventStore impls cross WASM/native — confirm the trait stays object-safe and Send-bounded per the dual-target rule in CLAUDE.md.
§Wire protocol table puts initial_msg: Vec<u8> inside NegOpen. NIP-77 mirrors this, but it's worth noting this means the first fingerprint already commits the initiator to a sort key the responder hasn't acked — define the failure mode if the responder disagrees on sort key (NegErr(Unsupported)?).
Open Question 5 (require a permission to initiate): yes — at minimum require server membership, otherwise any peer that learns a ServerId can probe existence and event-set fingerprints. Easy to add now; awkward to add later.

Overall: the direction is reasonable, but I think the right v1 here is the per-author seq vector exchange, with Negentropy reserved for worker↔worker and disjoint-history cases. If we're going full Negentropy, the fingerprint and EOSE-overlap issues must be resolved before implementation starts.

Generated by Claude Code

intendednull

Thorough spec with a clear algorithmic story and good cross-references into the codebase — the algorithm summary, fingerprint definition, and storage-index sketch are all concrete enough to implement against. That said, I have substantive concerns about the primary design decision and the interaction with existing Willow invariants. I don't think this should land as-is.

Strengths

The motivation is well-grounded: the current "dump 1000 events from the replay ring buffer" path is genuinely wasteful, and O(log(|A ⊕ B|)) reconciliation is the right shape of answer for worker↔worker replication.
Mirroring NIP-77's fingerprint (truncate16(sha256(xor_sum || count_le))) byte-for-byte is the right call if we stay with Negentropy — XOR homomorphism is exactly what lets ranges split cheaply, and deviating would forfeit reference vectors.
Filter design is sensible: structural events (GrantPermission, CreateChannel) bypass the channel filter so server structure always converges. This is important and easy to get wrong.
NegClose satisfying the SyncComplete contract from the EOSE spec (#214) is a nice composition — one less redundant signal to carry.
Relay stays a stateless bridge (crates/relay/src/lib.rs:16-41), session state lives in participants. Keeps the trust model unchanged.
Test matrix names the right scenarios (three-peer, edge cases, byte-count assertion on reconnect).

Concerns

Sort key is the wrong call and is the most important decision in the spec. timestamp_hint_ms is documented as "Display only — never used for ordering" (crates/state/src/event.rs:202-203). Making it the canonical sort key of the reconciliation index is a direct inversion of that invariant, and since the field is signed and author-controlled, it puts adversarial input on the hot path. The epoch-day bucketing + SyncProvider gating is weaker than it reads — any member with SendMessages can produce skewed timestamps, and bucketing just concentrates the pathology into a single bucket. See my inline comment on line 57.
Negentropy may be overkill for Willow's DAG. Willow has monotonic per-author seq + prev chains that Nostr doesn't. A "max seq per author" vector exchange is O(authors) state, one round trip, exact — and strictly dominates Negentropy for the common reconnect/warm-start cases. Open Question #3 should not be an open question; it should drive the framing. See my inline comment on line 248.
Encrypted channel-key distribution is waved away. "Sealed key shares stay on a parallel unicast flow" is one line describing a hard problem. If a peer comes online after missing RotateChannelKey and the sender is offline, there's no recovery story. Either design it here or explicitly defer and flag that encrypted-channel sync is incomplete. See inline comment on line 225.
Envelope budget and frame sizing are underspecified. "16 000 fingerprints per NegMsg" assumes the full 256 KB MAX_DESER_SIZE, but the gossip transport has a 64 KB max_message_size and NIP-77 frames carry more than just the 16-byte digest. frameSizeLimit needs to be a protocol parameter and receivers need to know continuation semantics. See inline comment on line 201.
rust-nostr/negentropy dependency risk unexamined. Open Question #1 flags it but doesn't answer it. License? Audit state? WASM support (we compile library crates to wasm32-unknown-unknown per CLAUDE.md)? Maintenance cadence? If the answer is "fork and port," that's a materially different cost than "add a dep." This needs to be resolved before merging the spec, not after.
SyncProvider-only serving conflicts with peer-to-peer sync. The spec gates NegOpen responders to SyncProvider, but the Integration Points table lists "client ↔ replay worker" and "client ↔ storage worker" as the primary paths — fine if workers hold SyncProvider, but then regular peers can never directly reconcile with each other over gossip. Is that intentional? It means the system degrades to "you must have a worker online" for any history recovery, which is an availability regression from today's gossip replay. Open Question #5 circles this but doesn't resolve it.
Worker architecture fit. docs/specs/2026-03-27-worker-nodes-design.md describes SyncRequest/SyncBatch as the current interface and lists a max_events config for replay workers. This spec replaces that path but doesn't discuss migration: do both protocols coexist during rollout, or is there a flag day? What happens when a new client talks to an old worker or vice versa? The four new MessageType variants don't imply version negotiation beyond the existing envelope version bump.

Suggestions

Reframe around per-author seq as the primary sync path. Keep Negentropy as a fallback for detecting DAG-level divergence (via StateHash) and for cross-author reconciliation of unusual histories. Default sync = seq-vector exchange, and that sidesteps the sort-key problem entirely.
If Negentropy stays primary, switch the sort key to (HlcTimestamp, EventHash). HLCs (crates/messaging/src/hlc.rs) exist, are monotonic, have bounded skew, and aren't attacker-chosen. Extending HLC to every EventKind (not just Message) is a cheaper change than eternally carrying attacker-controllable state in the reconciliation index, and it has standalone value for willow-state merge ordering.
Add a worst-case adversarial analysis. Given N events with arbitrary attacker-chosen timestamps within a bucket, what is the bound on round trips and total bytes? If it's unbounded, add a hard session-byte cap in §Bandwidth.
Add fingerprint reference vectors from hoytech's test suite to the testing table. "Matches Hoyte reference vectors" should cite specific vectors so the implementation can't silently diverge.
Specify version negotiation. What does a v2-capable peer send to a v1 worker that doesn't know NegOpen? Today's Envelope::validate_version rejects the whole envelope; the rollout plan needs to account for that.
Resolve rust-nostr/negentropy vs. port before merging. A spec that depends on an unaudited external crate for a security-relevant (permission-gated, adversarial-timestamp-exposed) protocol shouldn't punt that question.

Happy to re-review once the sort-key decision is settled — that's the one that governs everything else.

Generated by Claude Code

intendednull · 2026-04-24T20:00:13Z

+| `(hlc_timestamp, hash)` | HLCs (see [`crates/messaging/src/hlc.rs`](../../crates/messaging/src/hlc.rs)) give monotonic causal order across authors; resilient to clock skew | HLCs only stamp `Message` events today; non-message `EventKind` variants would need HLC adoption first |
+| `(author_pubkey, seq)` primary with `(ts, hash)` fallback | Cheap fast-path for peers that share most chains | Two protocols to implement and reason about |
+
+**Recommendation: `(timestamp_hint_ms, hash)`** for the initial


This is the load-bearing decision of the whole spec, and I think it's the wrong call as written.

crates/state/src/event.rs:202-203 explicitly says timestamp_hint_ms is "Display only — never used for ordering." Elevating it to the primary sort key of the canonical range index is a direct semantic reversal of that invariant. It also puts adversarial input on the hot path: since the field is in the signed payload, a malicious author can legitimately produce events claiming t = 0 or t = u64::MAX and the signature still verifies. "Bucket by epoch-day + gate SyncProvider" doesn't really defuse this: (a) the attacker only needs write access to the server (SendMessages), not SyncProvider; (b) epoch-day bucketing just moves the recursion cost into one bucket rather than eliminating it, and a single bucket of 10k skewed events will still blow past the 256 KB envelope budget discussed in §Bandwidth; (c) "kickable via governance" is an after-the-fact remedy for a protocol the attacker already made expensive.

Alternatives that the table undersells:

(HlcTimestamp, EventHash) — HLCs already exist (crates/messaging/src/hlc.rs) and give a monotonic, bounded-skew total order that is not client-controllable in the same way (each node advances its own clock on receive, bounding drift). The spec dismisses this because "HLCs only stamp Message events today" — but that's an implementation gap, not a design constraint. Adopting HLC on every Event (not just Message) is a much smaller change than baking an attacker-controllable field into a Merkle reconciliation structure for the lifetime of the protocol. It's also independently valuable for merge ordering in willow-state.

(author, seq) — the table says this "breaks the logarithmic property." That's only true if you insist on a single cross-author range. If you reconcile one chain at a time, you get exact sync in one round per divergent author with a trivial "max seq per author" vector (see my other comment on §Open Questions Comprehensive multi-peer E2E tests across all browsers #3). For Willow's per-author Merkle-DAG that is strictly cheaper than Negentropy in the common case.

Concretely, I'd like to see the spec either (1) switch the recommendation to HLC + hash and describe the HLC-on-every-event rollout, or (2) make a much stronger case that NIP-77 wire compatibility is worth giving up a signed invariant for — who are we actually interoperating with? This is a private DAG; we're not federating with Nostr relays.

Please don't treat "we can reuse rust-nostr/negentropy with minimal glue" as the tiebreaker. The cost of getting the sort key wrong is much larger than the cost of a 200-line port.

Generated by Claude Code

intendednull · 2026-04-24T20:00:29Z

+   client↔replay-worker path given per-author monotonicity?
+3. **Per-author fast path.** Can we short-circuit with a single
+   `max_seq_per_author` vector exchange *before* opening a negentropy
+   session, falling back to negentropy only when seq gaps exist?


This question is buried in Open Questions but I think it deserves to be promoted into the core of the spec — possibly displacing Negentropy as the default path.

Willow's Event has something Nostr doesn't: a strictly monotonic, gap-free per-author seq with a prev hash chain (crates/state/src/event.rs:190-194). Given that, the natural sync primitive is:

A → B: HashMap<EndpointId, u64> // my max seq per author B → A: Vec<Event> // events where B.seq > A.seq for that author

That's O(authors) state, one round trip, and the "what's missing" answer is exact — no fingerprints, no recursion, no frame-splitting. For the dominant real-world cases (client reconnects after being offline an hour; replay worker warm-starts from storage) this strictly dominates Negentropy on bandwidth, code complexity, and latency.

Negentropy earns its keep only when peers have the same authors' events but at different seqs and there are large holes in the middle of chains — which in Willow shouldn't happen during normal operation because events arrive in order via gossip. The case where it's actually needed is cross-author ordering for UI display, which is a rendering concern, not a sync concern.

My suggestion: flip the framing of this spec. Default sync = per-author seq vector exchange. Negentropy is a fallback for (a) detecting DAG-level divergence via StateHash, (b) reconciling unusual histories produced by offline peers or deliberate forks. That also cleanly sidesteps the sort-key problem (you never need a canonical cross-author order on the wire).

If after analysis Negentropy really is the better primary, I'd want to see bandwidth/latency numbers for the common case comparing the two, not just asymptotic arguments.

Generated by Claude Code

intendednull · 2026-04-24T20:00:46Z

+## Bandwidth and safety
+
+- Each `NegMsg` is capped by `MAX_DESER_SIZE` (256 KB); a single round
+  trip carries at most ~16 000 fingerprints or ~8 000 IDs.


The "256 KB carries ~16 000 fingerprints" arithmetic is optimistic once you account for the envelope overhead and the fact that NIP-77 frames encode (upper_bound_timestamp, id_prefix, mode, payload) per range, not just raw 16-byte fingerprints. The hoytech reference's frameSizeLimit default is 4096 bytes for a reason — it's tuned for Nostr relay WebSocket frames. We inherit 256 KB from MAX_DESER_SIZE (see crates/transport/src/lib.rs:36), but the underlying iroh gossip max_message_size is 64 KB (that comment exists in the same file). Which applies to NegMsg?

Concrete asks:

State which transport path NegMsg travels (gossip vs. direct QUIC stream) and which size limit therefore binds.

Specify frameSizeLimit as a protocol parameter, not just "split as needed" — receivers need to know when to expect a continuation vs. a terminal frame, especially since NIP-77 reconciliation is stateful across frames.

Show a worst-case bound: given 1M events with adversarial timestamp bucketing (see my sort-key comment), what is the maximum number of NegMsg round trips before convergence? If it's unbounded in practice, we need a hard session-byte cap, not just the 10s / 30s timers.

Generated by Claude Code

intendednull · 2026-04-24T20:00:59Z

+Channel-key events (`RotateChannelKey`) live in the same DAG as every
+other event and therefore ride along inside negentropy sessions
+automatically, subject to the filter. Per-recipient sealed key shares
+are NOT part of the DAG and remain on their own point-to-point path.


"Per-recipient sealed key shares are NOT part of the DAG and remain on their own point-to-point path" — this needs to be a separate sub-spec or at minimum an explicit design sketch, not a one-liner dismissal. The consequences are:

A peer that comes online after missing a RotateChannelKey will receive the event via Negentropy but can't decrypt messages sealed with the new key because the sealed share was unicast and lost.

The unicast path has no obvious recovery mechanism if the sender is offline — who re-sends the share? The relay? Any peer with SyncProvider? That re-introduces exactly the sync problem Negentropy was meant to solve, but for key material.

If the answer is "any peer with the plaintext key re-seals for the joining peer," we need to specify where the authorization for that lives in the state machine.

Either expand this section with a real design or explicitly punt to a follow-up spec and acknowledge that without that follow-up, Negentropy sync is incomplete for the encrypted-channel use case. Right now the spec reads as if it's solved, and it isn't.

Generated by Claude Code

intendednull

Round 2 review: comparative survey of sync protocols

Round 1 challenged Negentropy on its own terms (fingerprint correctness, sort-key adversarial behaviour, bucketing). Round 2 zooms out: is Negentropy even the right shape of protocol for Willow's data model? I read across the broader sync-protocol literature and the answer is "probably not, and there are at least two alternatives that are strictly better fits."

The headline: Willow already depends on iroh, iroh ships iroh-docs which already implements range-based set reconciliation (the same family of algorithm Negentropy belongs to, descended from Aljoscha Meyer's paper), and Willow's per-author monotonic-seq chains map exactly onto Scuttlebutt EBT — a simpler protocol that converges in O(authors) state and 1 RTT. The spec's open questions (#2 "rust-nostr crate maturity?", #3 "can per-author seq let us skip Negentropy entirely?") are gesturing at exactly this. The honest answers are "we don't need a new crate" and "yes, and we should."

1. Scuttlebutt EBT is the closest match to Willow's actual data model

Willow's Event carries (author_pubkey, seq, prev_hash) — a per-author append-only chain with content-addressed continuity. This is structurally identical to a Scuttlebutt feed. EBT is the protocol Scuttlebutt evolved specifically for that shape:

Each peer sends a vector clock {author -> last_seq_seen}. That's it. One round trip, O(authors) bytes — no recursion.
"Request skipping": peers cache the last vector clock the remote sent, and on reconnect omit any author whose seq hasn't advanced. A client that's been offline an hour and shares 99% of state ships ~zero bytes of metadata before useful payload starts flowing.
Bandwidth is "linear with messages to be sent" — the symmetric-difference property the spec wants from Negentropy, but achieved without a tree-recursion protocol because per-author chains are already monotonic.
After the clock exchange, the protocol degenerates to ordered streaming of events[seq > known_seq] per author. There is no fingerprint to compute, no sort key to design, no timestamp adversary because seq is authoritative and monotonic per author.

The spec acknowledges this in passing ("(author_pubkey, seq) ... per-author chains are monotonic and authoritative; enables trivial vector-clock sync") and then dismisses it because "it breaks the logarithmic property — we'd reconcile one chain at a time, not one mixed stream." That objection is wrong about the costs. Negentropy's "log" is O(log |A ⊕ B|) round trips on a flat unsorted set. EBT's cost is O(authors) state, 1 round trip, and O(events_to_send) bandwidth. For Willow's workload — modest author counts (server members), high event counts per author, and the diff being "the seq tail of a few authors" — EBT wins on every dimension that matters except one: it doesn't reconcile cross-author time ranges well. But the spec doesn't actually need that; the integration-points table is all "since_ms = client.last_seen" type queries that are equivalently expressed as "authors with seq > known_seq".

References: ssbc/epidemic-broadcast-trees, Planetary developer portal: EBT replication.

2. iroh-docs already ships range-based set reconciliation

This is the most surprising finding. Quoting iroh-docs' own README:

"Range-based set reconciliation is a simple approach to efficiently compute the union of two sets over a network, based on recursively partitioning the sets and comparing fingerprints of the partitions to probabilistically detect whether a partition requires further work."

That is the same algorithm family as Negentropy, descended from Aljoscha Meyer's 2022 paper (Negentropy/NIP-77 is essentially an instantiation of this paper with specific encoding choices). iroh-docs is a "meta-protocol" built on top of iroh-gossip and iroh-blobs — both of which Willow already depends on per CLAUDE.md's dependency graph.

The spec should either (a) explain why iroh-docs is unsuitable and Negentropy is, or (b) reuse iroh-docs. Reasons it might be unsuitable that the spec should address explicitly:

iroh-docs is keyed (author, key) with last-write-wins semantics, not append-only event DAGs — adapting Willow's EventKind graph to fit may be awkward.
iroh-docs replicas have schema constraints that don't match Willow's heterogeneous EventKind enum.
License/maturity/API stability of iroh-docs vs rust-nostr/negentropy.

But none of those are in the spec. Open question #2 ("rust-nostr/negentropy mature enough?") should be reframed as "why are we adopting a Nostr-flavoured implementation when iroh's own sync stack is already in our dependency tree?" — that is the single highest-leverage question for this PR.

3. Automerge sync protocol — bloom filter approach

Automerge's sync protocol is a useful contrast. It exchanges:

Each peer's heads (DAG frontier hashes) — a few bytes.
A bloom filter of all known change hashes — probabilistic membership.
The other peer responds with changes the bloom says it doesn't have.

Tradeoffs vs Negentropy:

	Automerge	Negentropy	EBT
Round trips	1–2 typically	O(log diff)	1
State exchanged	O(changes) bloom	O(diff) range tree	O(authors) clock
Exact?	False-positive rate (resends a few extras)	Exact	Exact
DAG-aware?	Yes (heads are graph frontiers)	No (flat sorted set)	No (per-author chains)

Negentropy's exactness is overrated for Willow's case: the spec already says "after reconciliation, missing events are fetched via the existing event-fetch path, not inline." A handful of bloom false-positives = a handful of extra fetches = no observable difference. Automerge's approach is also a serious option, particularly because Willow's events form a DAG with parent hashes (prev_hash) — a heads-based protocol is a more natural fit than treating events as a flat sorted set.

4. Git's pack protocol — the "have/want" baseline

Git's smart-HTTP pack-protocol is the maturity benchmark for this problem space: 20 years of production deployment. It uses negotiation rather than fingerprints — client says "want X", server says "have Y", they walk back through history until they find common ancestors, server builds a packfile.

Properties: not logarithmic (linear in commits walked during negotiation), but cheap in practice because it walks from the tip and stops at the first common ancestor. It's DAG-aware (uses parent edges, like Willow's prev_hash), it's exact, and it's trivially adversary-resistant because it's bounded by what each side actually claims to have.

Worth a paragraph in the spec explaining why Willow rejects this approach. The natural answer is "doesn't compose with iroh gossip routing" but that's not in the spec.

5. Delta-state CRDTs — formal bandwidth bounds

The Almeida/Shoker/Baquero delta-CRDT line gives formal bandwidth-bounded sync for causally-ordered updates: bandwidth proportional to delta, not state, with provable convergence and proven anti-entropy semantics. Willow's events are causally ordered (parent-hash DAG), so this framework applies directly. Negentropy gives empirical bandwidth properties; delta-CRDT theory gives guarantees.

This is more "pick up techniques from" than "adopt wholesale" — but the spec frames Negentropy as if there's no formal alternative. There is.

6. Matrix federation — the cautionary tale

For completeness: Matrix's federation backfill + state-resolution v2 reconciles diverged DAGs. It is infamously difficult to implement correctly (multiple homeserver implementations have produced divergent room states from the same inputs). The lesson for Willow: protocols that try to do exact reconciliation of authority-bearing DAGs (state events, in Matrix; permission/role events, in Willow) accumulate edge cases. Authority events should sync via a different, simpler path than chat events. The spec collapses both into one Negentropy session via SyncFilter. That is dangerous — and the spec's note that "structural events ignore the channel filter so structure is always fully reconciled" is hinting at this without addressing it.

Bottom-line recommendation

Replace this spec with a comparison memo, then pick one of two paths:

Path A (recommended): EBT-shaped sync over iroh. Per-author vector-clock exchange + ordered seq streaming. Solves the actual workload (clients reconnecting after offline periods, worker↔worker replication of per-author chains) in 1 RTT and O(authors) state. No timestamp adversary, no sort-key design, no bucketing, no fingerprints. ~30% the implementation surface of Negentropy and a much smaller test matrix. Use a separate, simpler path for the small set of authority events (or fold them into the same per-author streams since they are per-author by construction).

Path B: reuse iroh-docs. If the team wants range-based reconciliation specifically, the algorithm is already in the dependency tree. Negotiate the schema impedance mismatch with the iroh team or adapt Willow's EventKind to fit. Avoids reimplementing a known-tricky protocol.

What this PR currently is: adopting a third protocol from the Nostr ecosystem when neither (a) the data model (per-author chains, not flat event sets) nor (b) the dependency graph (iroh's own range-based sync is already present) suggests Nostr's choice is the right one for Willow.

Open question #3 ("can we short-circuit with a single max_seq_per_author vector exchange before opening a negentropy session, falling back to negentropy only when seq gaps exist?") is the spec quietly noticing this itself. The honest read: if the per-author fast path handles the common case, what cases does Negentropy actually still serve? If the answer is "none in production," the fallback isn't a fallback — it's the whole protocol, and it's EBT.

References:

Generated by Claude Code

Apply review decisions to the relay capability document spec: - Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical bytes, signature field excluded from canonicalisation). - Specify dispatch surgery: explicit branch in dispatch_connection for /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the handle_bootstrap_connection pattern. - Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in willow-state); list as future work. - Resolve multi-tenant question: one shared doc per host, relay is topic-agnostic. - Soften operator-metadata leakage: version is coarse semver, software is project name, both MAY be omitted. - Two-tier caching by status: ok=300s, degraded/read_only=5s with must-revalidate. - Recommend WS clients also send Sec-WebSocket-Protocol; JSON is advisory pre-connect. - Fix port framing: relay binds one port multiplexing TCP+WS, not two. - Drop sync_provider_only (operator vibes without a concrete pre-handshake check). - Add Cross-spec coordination table pinning feature tags for #214, #216, #217, #218, #219, #220, #221. - Rewrite Open Questions to keep only genuinely-open items (paid-relay semantics, utilisation telemetry, relay discovery, feature registry). https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn

…audit pass

- Reframe as consolidation of existing HeadsSummary-based worker sync with the legacy gossip-level state-hash dump (cite the in-tree TODO at listeners.rs:292-297) instead of "introduce a new per-author vector exchange". The worker path already does this; the novelty is hoisting the same protocol to the gossip path. - Replace the proposed HashMap<EndpointId, u64> request shape with reuse of the existing HeadsSummary { heads: BTreeMap<EndpointId, AuthorHead { seq, hash }> } so we keep the head hash for free fork detection via compare_chains. - Drop the bogus "EventStore trait gains methods" framing — no such trait exists in willow-state. Describe the change as adding a small known_authors helper to the existing EventDag and StorageEventStore concrete types; defer trait extraction. - Use String for server_id and channel IDs (matching EventKind) and call out that ServerId / messaging::ChannelId newtypes are NOT the types in use. - Fix line citations: per-author seq check at dag.rs:146-160 (not event.rs:190-194); timestamp_hint_ms doc at event.rs:216-217 (not 202-203); SyncProvider at event.rs:23. - Reference apply_incremental (public) and EventDag::insert as the apply path; note apply_event is private. - Mark SyncProvider gating as PROPOSED (not current) — neither worker role nor gossip path checks it today. - Acknowledge the existing idx_events_author_seq index and propose adding a server-prefixed variant via a new migration rather than pretending the index is new. - Clarify that WireMessage::SyncRequest/SyncBatch (gossip) and WorkerRequest::Sync/WorkerResponse::SyncBatch (worker) are TWO separate code paths both touched by this spec; the gossip payload shape changes, the worker payload doesn't. - Note current MessageType only allocates slots 0-6; defer adding a dedicated Sync slot. - Fix test-tier locations: state tests in sync.rs (not the nonexistent store.rs); wire round-trip tests inline in wire.rs (not the nonexistent transport/src/tests.rs); multi-peer convergence as client crate test against MemNetwork per CLAUDE.md test-tier rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Per-envelope budget now sized to 64 KiB gossip cap (not 256 KB MAX_DESER_SIZE) - events_since / heads_summary() correctly attributed to dag.rs (not sync.rs) - Storage shape claim corrected to Vec<EventHash> with skip-based scan - sync_since query plan description updated for OR-fanout / unknown-authors branch - Asymmetry note: requester-known authors we don't have are ignored - try_insert_event referenced as actual client entry point Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lation - round 4 - Per-envelope budget: 64 KiB minus small constant (~200B); dropped wrong "~57-60 KiB usable" - Migration: don't bump PROTOCOL_VERSION; use additive SyncRequestV2/SyncBatchV2 variants for soft rollout - SyncRequest gains request_id (matched to worker path's String for consolidation) - Worker path: only `more` added; outer WorkerWireMessage::Response.request_id reused - Index claim downgraded: NOT IN disjunct still requires server-scan; recommend restructuring sync_since to use explicit per-author predicates - HistorySyncComplete framed as defined by spec #214 (unmerged); SyncCompleted relationship spelled out - Author-count threshold corrected to ~900 - listeners.rs MAX_SYNC_BATCH_SIZE = 10_000 acknowledged; defense-in-depth retained - Line cites tightened across multiple sections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spec: negentropy-based range reconciliation for history sync

dfa31d0

intendednull commented Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

spec: relay capability document (NIP-11-style HTTP sidecar) #215

Merged

spec: history sync completion signal (EOSE-equivalent) #214

Merged

spec: seal + gift-wrap DM format for metadata privacy #218

Merged

intendednull commented Apr 24, 2026

View reviewed changes

intendednull commented Apr 25, 2026

View reviewed changes

claude and others added 4 commits April 25, 2026 08:16

spec(#219): replace negentropy with per-author seq vector exchange - …

9a92233

…audit pass

intendednull merged commit 56880b2 into main Apr 26, 2026
5 checks passed

intendednull deleted the claude/spec-negentropy-sync branch April 26, 2026 07:26

intendednull mentioned this pull request Apr 26, 2026

Implement heads-based delta sync (per spec from #219) #382

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: negentropy-based range reconciliation for history sync#219

spec: negentropy-based range reconciliation for history sync#219
intendednull merged 5 commits into
mainfrom
claude/spec-negentropy-sync

intendednull commented Apr 24, 2026

Uh oh!

intendednull left a comment

Uh oh!

intendednull left a comment

Uh oh!

intendednull Apr 24, 2026

Uh oh!

intendednull Apr 24, 2026

Uh oh!

intendednull Apr 24, 2026

Uh oh!

intendednull Apr 24, 2026

Uh oh!

intendednull left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

intendednull commented Apr 24, 2026

What & why

Open questions for review

Composition with sibling specs

Uh oh!

intendednull left a comment

Choose a reason for hiding this comment

Uh oh!

intendednull left a comment

Choose a reason for hiding this comment

Strengths

Concerns

Suggestions

Uh oh!

intendednull Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

intendednull Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

intendednull Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

intendednull Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

intendednull left a comment

Choose a reason for hiding this comment

Round 2 review: comparative survey of sync protocols

1. Scuttlebutt EBT is the closest match to Willow's actual data model

2. iroh-docs already ships range-based set reconciliation

3. Automerge sync protocol — bloom filter approach

4. Git's pack protocol — the "have/want" baseline

5. Delta-state CRDTs — formal bandwidth bounds

6. Matrix federation — the cautionary tale

Bottom-line recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants