Skip to content

spec: machine-readable rejection reasons on wire protocol#216

Merged
intendednull merged 4 commits into
mainfrom
claude/spec-error-prefixes
Apr 26, 2026
Merged

spec: machine-readable rejection reasons on wire protocol#216
intendednull merged 4 commits into
mainfrom
claude/spec-error-prefixes

Conversation

@intendednull
Copy link
Copy Markdown
Owner

Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — not proposing implementation, only the spec.

What & why

Nostr's NIP-01 defines machine-readable prefixes on OK/CLOSED messages (duplicate:, rate-limited:, invalid:, auth-required:, restricted:, ...) so clients can react programmatically — retry on rate-limited:, prompt for auth on auth-required:, silently drop on duplicate:. Willow currently has no equivalent; rejection handling is ad-hoc string-matching or silent drops.

Since Willow's wire format is bincode rather than JSON, this spec proposes a typed enum (WireRejectReason) instead of string prefixes — safer, still extensible via #[non_exhaustive]. New MessageType::Reject { reason, context } variant, context carrying event hash / topic id / envelope seq. 16 variants justified by concrete rejection sites in willow-state (InsertError, check_permission), willow-transport (UnsupportedVersion, size cap), willow-identity (InvalidSignature, PeerMismatch), and willow-relay (topic validation, MAX_TOPICS, connection cap).

Spec file: docs/specs/2026-04-24-error-prefixes.md

Open questions for review

  1. Signing: does the Reject envelope need a signature, or is the raw-transport-level delivery trust enough?
  2. Correlation id / send-id to link a Reject to the specific send that triggered it
  3. PermissionDenied(Permission) — does naming the specific permission leak information?
  4. Who enforces rate-limit? Relay only, or also worker nodes?
  5. ACK symmetry — do successful inserts also emit an explicit MessageType::Ack?
  6. Should rejected events feed a structured audit log exposed in the capability doc?

Composition with sibling specs

  • Relay capability doc: advertise supported reject variants
  • History sync EOSE: HistorySyncFailed could be a Reject variant
  • Epoch rotation: rejection of bad-epoch RotateChannelKey

Commit is unsigned due to harness signing backend failure (same as sibling PRs in this set).


Generated by Claude Code

Adds a design spec for a typed WireRejectReason enum carried in a new
MessageType::Reject envelope, inspired by Nostr NIP-01's OK/CLOSED
prefix convention. Maps every current rejection site in willow-state,
willow-transport, willow-identity, and willow-relay to a canonical
reason variant so clients can programmatically retry, re-auth, chunk,
or silence rejections instead of matching on free-form strings.

Co-authored-by: Claude <noreply@anthropic.com>

NIP-01 solves this for Nostr by prefixing OK/CLOSED messages with a
single-word machine-readable tag (`duplicate:`, `pow:`, `blocked:`,
`rate-limited:`, `invalid:`, `restricted:`, `mute:`, `auth-required:`,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth-required: is from NIP-42 (auth), not NIP-01. The NIP-01 OK/CLOSED list is 8 prefixes: duplicate, pow, blocked, rate-limited, invalid, restricted, mute, error. Either drop auth-required: from this sentence or footnote that it comes from NIP-42 — otherwise readers cross-checking will be confused.


Generated by Claude Code

produce the type without depending on `willow-state`:

```rust
#[non_exhaustive]
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns about this enum living in willow-transport:

  1. The comment on PermissionDenied(Permission) says "re-exported from willow-state", but that inverts the dependency graph from CLAUDE.md: willow-transport is at the bottom, with willow-state and willow-identity above it. Importing Permission from willow-state into willow-transport would create a cycle (state already depends transitively on transport via identity). Options: (a) move Permission into a new bottom-of-stack crate (willow-types?), (b) keep WireRejectReason in willow-state and accept that willow-transport/willow-identity callers can only produce a string-typed reason, or (c) carry permission: u8 here with the mapping table living in willow-state.

  2. Invalid(String) and Restricted(String) undermine the "machine-readable, never matched on" discipline. The mapping table at line 122–137 already carries five distinct sub-cases as Invalid("…") strings (duplicate genesis, first-event-not-CreateServer, missing governance dep, peer_id mismatch, serde failure). Either promote the load-bearing ones to typed variants (DuplicateGenesis, NotGenesis, MissingGovernanceDep, PeerIdMismatch) or be explicit that those strings are advisory and clients must not match on them — otherwise a relay-forwarded "duplicate genesis" string becomes a de-facto API.


Generated by Claude Code

Capacity, // relay MAX_TOPICS or similar cap hit
ServerError, // generic last-resort
}
```
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Unauthenticated / BadEnvelope distinct from SignatureInvalid? Today IdentityError has InvalidSignature, PeerMismatch, Serde, and PublicKeyDecode as separate variants — collapsing all of them into SignatureInvalid + Invalid("serde: …") loses the PublicKeyDecode and PeerMismatch distinctions that the existing code carefully maintains. PeerMismatch in particular is a spoofing signal (different code path from a corrupted-bytes InvalidSignature) and the receiver may want to treat it differently (e.g. score the sender for abuse).

Also: there's no variant for "I don't know who you are" / connection-level reject. Useful for the future case where the relay grows its own auth.


Generated by Claude Code

Topic(TopicId),
Envelope, // predates any event hash
}
```
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RejectContext::Envelope without a hash makes correlation impossible — open question 2 already flags this, but it's worth deciding here rather than punting. Without a send_id the Envelope arm is essentially "something you sent recently failed", which the client can't act on (which event do I retry? which UI bubble do I flip?). Suggest either (a) require all outbound envelopes to carry a u64 send_id from the start so this variant is well-defined, or (b) drop Envelope and accept that pre-event-hash failures are local-only (the spec already does this for the iroh receive error).


Generated by Claude Code

| Relay `MAX_TOPICS` cap reached | [`relay/lib.rs:398`](../../crates/relay/src/lib.rs) | `Capacity` |
| Relay connection cap reached | [`relay/lib.rs:155`](../../crates/relay/src/lib.rs) | `RateLimited { retry_after_ms }` |
| Relay not granted `SyncProvider` | (future history-serve guard) | `NotSyncProvider` |
| iroh gossip receive error | [`network/iroh.rs:164`](../../crates/network/src/iroh.rs) | `ServerError` (local-only; not sent) |
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few mapping-table accuracy nits:

  • dag.rs:19 is the doc comment for InvalidSignature, not NotGenesis (which is at line 20 of the variant list). Same for dag.rs:36/37DuplicateGenesis is line 36, MissingGovernanceDep line 38. These are off-by-one.
  • The "Relay connection cap reached" → RateLimited mapping at relay/lib.rs:155 (actually line 156–164 for the bootstrap cap and 337–347 for the proxy cap) is currently a drop(stream) with no response sent at all — there's no transport channel for sending a Reject back at this layer because the connection itself is being refused. Worth noting this is "future work" or moving the entry to the local-only column.
  • materialize.rs:24 is the Rejected(String) variant of ApplyResult, but ApplyResult is internal to the state crate — it never crosses the wire today. The mapping is correct in spirit (this is where the strings come from), but worth saying explicitly that this requires plumbing ApplyResult::Rejected outward through the client/network layer, which is a real (non-trivial) change.

Generated by Claude Code

`human = Some("unknown reason variant")`, logs the raw bytes at
`debug!`, and leaves the original outbound event in its retry
queue — safe default, because a newer peer that accepts is still
reachable elsewhere in the mesh.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bincode encodes enums with a u32 discriminant. A receiver that hits an unknown discriminant fails unpack_envelope cleanly" — this is the load-bearing claim of the section, and it's actually the opposite of forward compatibility. A v1 receiver that sees a v2 variant cannot read any of the surrounding fields (context, human) because the enum decode failure aborts the whole RejectPayload decode at crates/transport/src/lib.rs:198. The "leaves the original outbound event in its retry queue" mitigation is fine, but the information in the new variant — most critically a RateLimited.retry_after_ms from a newer peer — is lost.

Strongly consider: WireRejectReason::Unknown { tag: u16, payload: Vec<u8> } or a top-level RejectPayload { tag: u16, … } where tag is parsed first and reason is Option<WireRejectReason> (decoded best-effort). That gives you the actual NIP-01 property — old peers can render the prefix even if they don't understand the variant.

Also worth noting: the spec also says PROTOCOL_VERSION is reserved for "breaking" changes, but adding a variant that older peers can't decode is a breaking wire change under the current bincode setup. The two bullets contradict each other.


Generated by Claude Code

subscriber`. Metrics counters key on the enum's `discriminant()` for
a flat histogram of rejection causes over time. Operators get
"how many `RateLimited` per minute?" for free, with no string
parsing.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::mem::discriminant() does not implement Hash directly across serde boundaries — it's not stable across compilations and not a counter key. For metrics you want either an explicit fn name(&self) -> &'static str on WireRejectReason (cheap, stable, log-friendly) or #[repr(u8)] with explicit discriminants. As written this won't compile against typical metrics / prometheus counter APIs.


Generated by Claude Code

Comment thread docs/specs/2026-04-24-error-prefixes.md Outdated
3. Is `PermissionDenied` leaking too much to an untrusted relay?
Telling the rejector which `Permission` they lack is fine;
telling a third party could help an attacker enumerate roles.
Route: relays forward rejections verbatim, clients filter.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Route: relays forward rejections verbatim, clients filter" is the wrong default. A misbehaving relay or client peer can spam fabricated PermissionDenied(KickMembers) / PermissionDenied(ManageRoles) rejections at unrelated peers to enumerate who holds which permission against state they haven't seen yet. Better default: a Reject payload sent to a peer who is not the original event author MUST coarsen to Restricted("permission denied"). The original author's client can be told the typed Permission because they already know what they tried to do; nobody else needs that detail.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid framing — the typed-enum-over-string-prefix justification is well argued and the mapping table is the right shape. A few load-bearing issues to resolve before this becomes implementable, plus several gaps in the rejection-site survey.

Strengths

  • Picking a typed enum over Nostr's strings is the correct call given Willow's bincode wire format. Compile-time exhaustive matching at the client is a real win the Nostr ecosystem cannot get.
  • The human: Option<String> escape hatch for free-form Display output is a nice middle ground — it preserves operator/log fidelity without re-introducing string-matching semantics.
  • Splitting RejectContext::{Event, Topic, Envelope} correctly recognises that some rejections fire before any event hash exists. Good instinct.
  • Tests section names the right three layers (produced / serialised / consumed) and the forward-compat synthetic-discriminant test is exactly the kind of thing usually forgotten.

Concerns

1. The crate placement creates a dependency cycle

The spec puts WireRejectReason in willow-transport but has the PermissionDenied(Permission) variant import Permission from willow-state. Per crates/state/Cargo.toml and the dep graph in CLAUDE.md, willow-state already depends on willow-transport — so re-exporting Permission from transport is a cycle.

Three options: (a) move Permission down into willow-transport (cleanest, but Permission is genuinely a state-machine concept); (b) keep WireRejectReason in willow-transport but make the variant PermissionDenied(u8) with a discriminant table maintained in willow-state; (c) put the enum in a new willow-wire-types leaf crate that both state and transport can depend on. I'd push for (c) — it also gives you a natural home for RejectContext which already wants EventHash (state) + TopicId (network).

2. Variant coverage misses real rejection sites

Grepping Err( across the workspace turns up several peer-visible rejection paths the 16-variant table doesn't account for:

  • CryptoError::DecryptionFailed (crates/crypto/src/lib.rs:401, also returned from open_content at line 296). When a client receives a Content::Encrypted it can't decrypt — wrong epoch, missing channel key, tampered ciphertext — this is recoverable by requesting the key, not by retrying the send. Needs its own variant or it gets bucketed into the useless Invalid("…").
  • CryptoError::RatchetCounterOutOfRange { claimed, max } (crates/crypto/src/lib.rs:279). Specifically interesting because PR #220 (epoch rotation) makes this a first-class case the client must distinguish from DecryptionFailed.
  • StoreError::DuplicateId (crates/messaging/src/store.rs:97). Different semantic from InsertError::Duplicate — this is the message store, not the DAG. Today both collapse to Duplicate in the table; that's probably fine, but call it out.
  • WorkerResponse::Denied { reason } (crates/storage/src/role.rs:74,82, crates/replay/src/role.rs:274,321) is a peer-visible rejection (worker → client) and isn't in the table at all. Concrete cases today: "unknown server", "history not served", "query failed". These map to a mix of Invalid, NotSyncProvider, and ServerError, but the spec needs to say which.
  • ApplyResult::Rejected("proposal not found") (crates/state/src/materialize.rs:161) and Rejected("author is not a member") (line 494) are real rejection paths the spec lumps under "PermissionDenied/Restricted depending on cause" without saying which.
  • IdentityError::PeerMismatch is mapped to Invalid("peer_id mismatch") but per crates/identity/src/lib.rs:78–84 the error carries structured claimed/signer fields. Stuffing them into a string discards the structure the original error went out of its way to preserve. Worth a PeerMismatch { claimed, signer } variant.

Also: EventKind is 22 variants per crates/state/src/event.rs:68 (the doc comment says so explicitly), not 17 — the spec's framing implies a closer mapping than exists. Worth a sweep.

3. The #[non_exhaustive] + bincode forward-compat story is wrong

The spec claims "bincode encodes enums with a u32 discriminant. A receiver that hits an unknown discriminant fails unpack_envelope cleanly". This is misleading. Bincode returns a generic ErrorKind::InvalidValue / ErrorKind::Custom("invalid value: integer N, expected variant index 0..=K") indistinguishable from "the bytes were corrupt mid-stream". You cannot programmatically tell "this is a future variant I should ignore" from "the message is corrupted and I should drop the connection".

#[non_exhaustive] only helps the Rust compiler force a wildcard arm on the consumer; it does nothing for the bincode wire. If the goal is true forward compat, the spec needs an explicit discriminant prefix on the wire (e.g. wrap the reason in RejectPayload { reason_tag: u32, reason_body: Vec<u8>, ... } so unknown tags can be skipped without failing the whole envelope), or accept that adding a variant is a breaking wire change and bump PROTOCOL_VERSION. Either is fine, but the spec needs to pick one and stop hand-waving.

4. Authority on Reject envelopes — answer must be "yes, signed"

Open question 1 frames signing as optional ("Probably yes"). Given how Willow gossip works (crates/network/src/iroh.rs, broadcast topic), any peer subscribed to the same topic can publish a MessageType::Reject with a forged RejectContext::Event(victim_hash). An attacker doesn't need to be the relay — they only need to be in the gossip mesh. Without signing, a single malicious peer can:

  • Pop up ui.prompt_reauth() modals on every send (DoS the UX)
  • Trigger spurious sync.request_history(context) storms via fake ParentHashMismatch
  • Suppress legitimate sends by faking PermissionDenied(SendMessages) against another peer's hash

The spec should mandate signing of the RejectPayload by the rejecting peer's identity, and clients should drop unsigned Rejects or Rejects from peers whose claim of authority over the topic isn't credible (e.g., not the relay's EndpointId and not a SyncProvider-granted member). This isn't an open question — it's a load-bearing decision and the spec should make it.

5. Correlation is a real blocker for RejectContext::Envelope

For Event(hash) rejections this is fine — the event already has a content hash before send, and the client can keep an outbox keyed by hash. But envelope-level rejections (UnsupportedVersion, PayloadTooLarge) fire before the inner payload deserialises, so the rejecter doesn't have an event hash to echo back. The spec's RejectContext::Envelope is just a unit variant — clients have no way to know which outbound envelope this Reject corresponds to.

Open question 2 proposes a send_id: u64 stamp. Endorsing that: a 64-bit per-connection-monotonic send-id added to Envelope is cheap, would make all three context variants correlatable, and is a one-time wire change that pairs naturally with bumping PROTOCOL_VERSION for MessageType::Reject introduction anyway. Recommend rolling it into this spec rather than punting.

6. ACK symmetry — current state is "fire and pray"

Open question 5: today there is no application-level ACK. The client treats successful TopicHandle::broadcast (crates/network/src/iroh.rs) as success, but iroh-gossip only confirms local submission to the gossip layer, not that any peer received or accepted the event. The "no reject within N seconds" heuristic is unworkable on a partitioned mesh — the only honest signal of acceptance today is seeing your own event echoed back via gossip.

Recommendation: defer MessageType::Ack but call out explicitly that without it, Reject is negative-only. The client UX must continue to use "saw my event in the ordered stream" as the success signal. Otherwise users will get UIs that show "delivered" the moment a single peer accepted, which is misleading on a P2P mesh.

7. Rate-limit ownership

Open question 4: the table only assigns RateLimited to the relay's connection cap. But per crates/storage/src/role.rs:74 and crates/replay/src/role.rs:321, worker nodes already have a WorkerResponse::Denied path and could trivially throttle expensive History/Sync requests. Recommend the spec say: any node CAN emit RateLimited; relay does it for transport-layer load, workers do it for query-layer load. Currently neither has actual rate-limit infrastructure — the spec should note that this is a prerequisite, not a side-effect.

8. PermissionDenied(Permission) leak — a position

Spec asks if naming the permission leaks info. Position: leakage is real but small, and the alternative is worse.

  • The signed event already carries (author, kind), so any third party in the gossip mesh already knows what was attempted. The rejecter merely confirms what's structurally derivable.
  • The exception is SyncProvider and CreateInvite — these are not bound to specific event kinds in the same 1:1 way (crates/state/src/materialize.rs:266-300), so a PermissionDenied(SyncProvider) against a sync-attempt does narrowly confirm "this peer is not a sync provider on this server", which an attacker probing for relay topology might find useful.
  • Mitigation: collapse permission disclosure to a coarser bucket for the wire (PermissionDenied(Coarse) where Coarse ∈ {Read, Write, Manage}) and let the local human string carry the precise permission for the rejected peer's own log. The third party sees "denied, write tier"; the rejected peer sees "you lack SendMessages".

9. ParentHashMismatch { expected, actual } — lower risk than the spec implies

Per crates/state/src/dag.rs:166-172, expected is the offending author's own current head in the rejecter's view, not a global state hash. Echoing it back tells the sender "I think you're at hash X" — which is exactly the information the sender needs to recover, and which a sync request would have to disclose anyway. Not a leak.

10. Sibling-PR consistency

  • PR #214 (history-sync EOSE) mentions HistorySyncFailed as a future Reject variant. The spec should at minimum reserve a HistorySyncFailed variant or pin down whether Invalid("history…") is the canonical encoding — otherwise both specs ship and we end up with two ways to say the same thing.
  • PR #220 (epoch rotation) explicitly mentions "rejection of bad-epoch RotateChannelKey" composing with this spec. There's no WrongEpoch { expected, actual } variant. Either add it now or note that Invalid("epoch: …") is the agreed encoding.
  • PR #215 (capability doc) suggests advertising supports_machine_readable_errors: bool. Spec should reciprocate by mentioning how a client behaves against a relay that doesn't support MessageType::Reject (silent old behaviour, presumably).

11. Minor

  • managed_dag.rs mentioned in the review brief — actual file is crates/state/src/managed.rs. Spec doesn't reference either, just calling it out.
  • The iroh gossip receive error → ServerError (local-only; not sent) row at the bottom of the mapping table is correct but worth pulling into a top-level "what we explicitly do NOT send on the wire" section, alongside any decode failure on the inner payload (you literally cannot send a Reject to a peer whose bytes you couldn't decode, so the recovery has to be transport-level).
  • RejectPayload.human: Option<String> is unbounded — should have a max length cap (e.g. 256 bytes) enforced on serialization, since this is attacker-controlled data hitting the receiver's logger and possibly UI.

Suggestions

  1. Move WireRejectReason to a new willow-wire-types leaf crate to break the cycle with Permission.
  2. Add send_id: u64 to Envelope in this spec; it's a one-time breaking change that pays for itself in correlation.
  3. Mandate signing of RejectPayload by the rejecting peer; resolve open question 1 firmly as "yes".
  4. Add WrongEpoch, DecryptionFailed, RatchetCounterOutOfRange, and HistorySyncFailed variants to align with sibling specs and cover the missing rejection sites.
  5. Pick an explicit forward-compat strategy: either tag-prefixed reasons (skip-unknown) or accept that variant additions bump PROTOCOL_VERSION. Don't claim bincode handles this for you.
  6. Add a max-length cap on human.
  7. Resolve the PermissionDenied leak by sending only a coarse tier on the wire and putting the precise permission in human for the rejected peer's own consumption.

Happy to look at a v2.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline notes on specific lines, paired with the top-level review.


Generated by Claude Code

Comment thread docs/specs/2026-04-24-error-prefixes.md Outdated
Duplicate,
Invalid(String),
RateLimited { retry_after_ms: u64 },
PermissionDenied(Permission), // re-exported from willow-state
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency cycle: WireRejectReason lives in willow-transport, but Permission is defined in willow-state (crates/state/src/event.rs:21), which already depends on willow-transport per the dep graph in CLAUDE.md. Re-exporting Permission "from willow-state" as the comment suggests is structurally impossible.

Recommend a new willow-wire-types leaf crate that both state and transport depend on. It would also be the natural home for RejectContext (which already wants both EventHash from state and TopicId from network).


Generated by Claude Code

TopicInvalid(String),
Capacity, // relay MAX_TOPICS or similar cap hit
ServerError, // generic last-resort
}
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variant gaps. The 16-variant set misses several real peer-visible rejection paths I found grepping Err( across the workspace:

  • CryptoError::DecryptionFailed (crates/crypto/src/lib.rs:296,401) — recoverable via key request, not by retry; needs its own variant.
  • CryptoError::RatchetCounterOutOfRange { claimed, max } (crates/crypto/src/lib.rs:279) — first-class case once PR spec: epoch-driven channel key rotation for PCS #220 (epoch rotation) lands.
  • WorkerResponse::Denied { reason } from storage/replay (crates/storage/src/role.rs:74,82, crates/replay/src/role.rs:274,321) — peer-visible worker→client rejections, not in the table at all.
  • IdentityError::PeerMismatch { claimed, signer } is mapped to Invalid("peer_id mismatch"), discarding the structured fields the original error preserves. Worth PeerMismatch { claimed, signer }.
  • ApplyResult::Rejected("proposal not found") (crates/state/src/materialize.rs:161) and Rejected("not a member") (line 494) are real paths the table lumps under "PermissionDenied/Restricted depending on cause" without specifying which.
  • Sibling PR spec: epoch-driven channel key rotation for PCS #220 mentions bad-epoch RotateChannelKey rejection; needs WrongEpoch { expected, actual }.
  • Sibling PR spec: history sync completion signal (EOSE-equivalent) #214 mentions HistorySyncFailed; either reserve the variant or pick the canonical encoding now.

Also: the spec implies a 17-variant EventKind mapping, but crates/state/src/event.rs:68 says explicitly "All possible state mutations — 22 variants". Worth a re-sweep.


Generated by Claude Code

Comment on lines +170 to +180
## Extensibility & versioning

- `#[non_exhaustive]` forces downstream `match` to carry a wildcard,
so adding a variant is never a SemVer break.
- bincode encodes enums with a `u32` discriminant. A receiver that
hits an unknown discriminant fails `unpack_envelope` cleanly; the
client treats the failed decode as a local `ServerError` with
`human = Some("unknown reason variant")`, logs the raw bytes at
`debug!`, and leaves the original outbound event in its retry
queue — safe default, because a newer peer that accepts is still
reachable elsewhere in the mesh.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claim about #[non_exhaustive] + bincode forward compat is wrong. #[non_exhaustive] only forces a wildcard arm at the Rust compile site; it does nothing for the wire. Bincode does not "fail cleanly" with a useful error on an unknown discriminant — it returns a generic ErrorKind::InvalidValue / ErrorKind::Custom that is indistinguishable from corrupt bytes mid-stream. The receiver cannot programmatically tell "future variant, ignore" from "this whole envelope is junk, drop the connection".

If real forward compat is the goal, the spec needs an explicit length-prefixed body so unknown reason tags can be skipped without failing the envelope, e.g.:

struct RejectPayload {
    reason_tag: u32,           // canonical discriminant
    reason_body: Vec<u8>,      // bincoded variant body, skippable on unknown tag
    context: RejectContext,
    human: Option<String>,
}

Or accept that adding a variant is a wire-breaking change and bump PROTOCOL_VERSION. Either is fine; please pick one.


Generated by Claude Code

Comment thread docs/specs/2026-04-24-error-prefixes.md Outdated
Comment on lines +225 to +229
1. Should `RejectPayload` be authenticated? A relay that forges
`PermissionDenied` against a peer's legitimate event could
suppress that peer's messaging UI. Probably yes — sign the
payload with the rejecting peer's identity so clients can decide
whether to trust it.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to graduate from "open question" to "decided: yes, must be signed". Willow gossip is broadcast (crates/network/src/iroh.rs), so any peer subscribed to the topic — not just the relay — can publish a MessageType::Reject with a forged RejectContext::Event(victim_hash). Concrete attacks an unsigned Reject enables for a single mesh participant:

  • Pop ui.prompt_reauth() modals on every send (UX DoS via fake AuthRequired).
  • Trigger sync.request_history(context) storms via fake ParentHashMismatch.
  • Suppress legitimate sends by faking PermissionDenied(SendMessages) against another peer's hash.

Mandate: RejectPayload must be Ed25519-signed by the rejecting peer's identity (use the existing willow_identity::pack envelope at crates/identity/src/lib.rs:426). Clients drop unsigned Rejects, and the consume arm at line 153 should additionally weight Rejects by the signer's authority on the topic (relay endpoint vs. random member vs. SyncProvider-granted peer).


Generated by Claude Code

Comment thread docs/specs/2026-04-24-error-prefixes.md Outdated
Comment on lines +104 to +109
#[derive(Clone, Debug, Serialize, Deserialize)]
pub enum RejectContext {
Event(EventHash),
Topic(TopicId),
Envelope, // predates any event hash
}
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RejectContext::Envelope is a unit variant, so envelope-level rejections (UnsupportedVersion, PayloadTooLarge, the line 162 Invalid("deser: …") shape error) cannot be correlated with the offending send — the rejecter never decoded an event hash.

Endorsing open question 2's proposal: stamp a send_id: u64 on every Envelope (crates/transport/src/lib.rs:99) and echo it in the Reject. This is a one-time wire change that pairs naturally with bumping PROTOCOL_VERSION for MessageType::Reject = 7 introduction anyway, and it makes all three context variants correlatable. Suggest folding it into this spec rather than punting — without it, half the new Reject variants are useless to the client.

Also: human is unbounded attacker-controlled data hitting the receiver's logger and possibly UI. Cap it (e.g. 256 bytes) at serialization time.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@intendednull intendednull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2: Comparative research (non-Nostr prior art)

Surveyed gRPC, RFC 7807/9457, Matrix, AMQP 0-9-1 and 1.0, TLS 1.3 alerts, and Cap'n Proto RPC. The spec lands in a defensible spot, but a few of the choices put it on the minority side of decisions every other protocol made differently. Worth interrogating before bincode locks the shape in.

1. The dominant shape is "tagged union + open dictionary," not "closed enum"

Almost every system I looked at separates a small machine-readable tag from an open extension bag:

System Tag Extension
gRPC Code enum (closed, ~17 values) google.rpc.Status.details: repeated AnyRetryInfo, ErrorInfo, BadRequest, QuotaFailure, PreconditionFailure, ResourceInfo, Help, LocalizedMessage
RFC 9457 type: URI arbitrary JSON extension members; clients MUST ignore unknown
Matrix errcode: "M_LIMIT_EXCEEDED" arbitrary sibling fields (retry_after_ms, soft_logout, room_version, …)
AMQP 1.0 error-condition: symbol ("amqp:not-found") info: map
Cap'n Proto Type enum of 4 values free reason: Text
TLS 1.3 fixed u8 AlertDescription none

Willow's WireRejectReason puts payload inside each variant (RateLimited { retry_after_ms }, PermissionDenied(Permission), SeqGap { expected, actual }). That's ergonomic Rust, but it means every new piece of metadata is a wire-format change. If next year you want retry_after_ms on Capacity too, or a quota_remaining on RateLimited, you mint a new variant.

The { tag: String, details: BTreeMap<String, Value> } shape — gRPC's design — decouples the two evolution axes. New context fields ride along on existing tags; new tags don't break old receivers because they fall through the wildcard arm. Spec already pays the wildcard tax via #[non_exhaustive], so the type-safety win over a string tag is smaller than it looks: callers must _ => either way.

A middle ground worth costing out:

pub struct RejectPayload {
    pub tag: WireRejectTag,            // small closed enum, just the tag
    pub details: RejectDetails,        // typed-but-optional fields
    pub human: Option<String>,
}
pub struct RejectDetails {
    pub retry_after_ms: Option<u64>,
    pub permission: Option<Permission>,
    pub expected_hash: Option<EventHash>,
    pub actual_hash: Option<EventHash>,
    pub limit: Option<u64>,
    pub actual: Option<u64>,
    // …
}

Adding a field is now backward-compatible at the bincode level (still risky — see #4). Adding a tag still requires the wildcard. You keep exhaustive matching where it matters (the tag) and lose the per-variant struct surface area.

2. Nobody binds the tag to a URI — and the only ones who tried (RFC 7807) regret it

RFC 7807/9457's type is a URI specifically so it's globally unique and dereferenceable to docs. In practice clients SHOULD NOT auto-dereference it and rarely do. The "human eventually" promise mostly doesn't pay off — Matrix's M_LIMIT_EXCEEDED-style namespaced strings are the de-facto winner because they're cheaper to type and just as unique-in-practice.

For Willow specifically: a closed Rust enum is already globally unique within the project. Going to URIs only helps if third-party Willow forks want to mint reasons without coordinating — which the spec's #[non_exhaustive] + bincode discriminants explicitly don't support. So URI tags would be cargo-culting RFC 9457's worst feature. Skip.

3. Why the "rich details" pattern fails in practice (and what would make Willow stick)

gRPC's biggest lesson is unflattering: most clients ignore Status.details entirely because it's Any-typed, requires a second protobuf import to unpack, and is inconsistently populated across server SDKs. RFC 7807's type field is similarly under-used — Spring Boot et al. surface title/detail to humans and hand-wave the URI. Matrix's errcode is the success story: clients do branch on it, because the spec lists every code, gives each one a single canonical meaning, and the human field is genuinely human (no JSON nested inside it).

What that suggests for Willow:

  • The mapping table is the most important asset in the spec. Keep it in-tree as the canonical source, not in prose. Generate the enum from it if feasible.
  • Resist Invalid(String) and Restricted(String). Those are exactly the "stringly reason" pattern this spec is supposed to retire. The "absorb future sub-categories" justification (lines 181–183) is the slippery slope — Matrix's error: "..." field is openly described as "for the user", not for branching, and it works. Willow's two String variants will silently accumulate ad-hoc categories that the matching client code can't enumerate. Either commit to a tag and add new tags as needed, or fold these into a generic Other { tag: String } so the stringliness is at least named.
  • One canonical retry_after_ms location. See #5.

4. TLS 1.3 alert-style brittleness vs. bincode-discriminant brittleness

TLS 1.3's fixed AlertDescription u8 is fast but strictly additive — old clients seeing a new alert just see decode_error (or worse, drop the connection). New alerts shipped in TLS 1.3 (missing_extension, certificate_required, no_application_protocol, …) are simply invisible to TLS 1.2 implementations. That's been acceptable for TLS because alerts are advisory; the connection is dead either way.

Bincode enum mismatch has the same property: a v2 sender emits Capacity (discriminant 14), a v1 receiver decoding into a 12-variant enum gets a deserialization failure on the whole envelope. Spec acknowledges this on line 178–183 and degrades gracefully to a synthetic ServerError. That's fine only if Reject payloads are advisory and never gate retries. The moment a client uses RateLimited.retry_after_ms to throttle, an unknown new "RateLimitedV2 { retry_after_ms, jitter_ms }" variant becomes a v1 retry-loop. Worth stating explicitly in the spec: receivers MUST default to "no retry, surface to user" on any decode failure, never to "retry immediately". Open-question 4 ("advisory or enforced retry_after") is downstream of this.

If long-term forward compat matters, the gRPC/Matrix style (small closed tag, optional typed details) survives a v1↔v2 mismatch because the tag space is small and stable; the details bag is ignorable. The current spec's per-variant payloads do not.

5. Cross-cutting retry primitive — gRPC vs HTTP teach a real lesson

gRPC encodes RetryInfo.retry_delay as a protobuf message inside trailing metadata. HTTP uses the Retry-After header. Different transports, but both kept the retry hint out of the status code itself so it could ride along with non-rate-limit errors too (overloaded backends, maintenance windows, quota refills).

Willow currently puts retry_after_ms only on RateLimited. But the mapping table already shows Capacity triggering at the relay's connection cap (line 139) and it's a perfectly valid retry case — the relay will have capacity in 30s. Same for ServerError during a deploy. Recommend hoisting retry_after_ms: Option<u64> to RejectPayload itself (or to the shared RejectDetails if you adopt #1), so any reason can carry a hint and the client backoff logic has one place to look:

ClientEvent::Rejected(payload) => {
    if let Some(ms) = payload.retry_after_ms { backoff.schedule(ms); }
    match payload.tag {}
}

This also kills the awkwardness of RateLimited { retry_after_ms } being the only variant whose presence in the enum implies a numeric field — every other variant is opaque.

Concrete recommendations

  1. Decide the tag/details split before merging. Either commit to per-variant payloads (current) and accept that wire-format evolution = new variants only, OR move payload to a shared RejectDetails and document the forward-compat envelope.
  2. Drop Invalid(String) and Restricted(String) in favor of named tags; if a stringly escape hatch is needed, name it Other { tag: String } so it's obviously the loose case.
  3. Hoist retry_after_ms out of RateLimited into the envelope so any reason can carry it.
  4. Add a normative line: receivers MUST treat unknown discriminants / decode failures as non-retryable.
  5. Don't go URI-typed. RFC 9457's worst feature.

What's actually fine as-is

  • Single human field, separate from machine tag (Matrix/RFC 9457 agree).
  • Decoupling RejectContext from WireRejectReason (gRPC does the same with RequestInfo).
  • Locating the type in willow-transport to avoid willow-state dep cycles.
  • #[non_exhaustive] on the enum.

Sources:


Generated by Claude Code

intendednull pushed a commit that referenced this pull request Apr 25, 2026
Apply review decisions to the relay capability document spec:

- Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical
  bytes, signature field excluded from canonicalisation).
- Specify dispatch surgery: explicit branch in dispatch_connection for
  /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT
  and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the
  handle_bootstrap_connection pattern.
- Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in
  willow-state); list as future work.
- Resolve multi-tenant question: one shared doc per host, relay is
  topic-agnostic.
- Soften operator-metadata leakage: version is coarse semver, software
  is project name, both MAY be omitted.
- Two-tier caching by status: ok=300s, degraded/read_only=5s with
  must-revalidate.
- Recommend WS clients also send Sec-WebSocket-Protocol; JSON is
  advisory pre-connect.
- Fix port framing: relay binds one port multiplexing TCP+WS, not two.
- Drop sync_provider_only (operator vibes without a concrete
  pre-handshake check).
- Add Cross-spec coordination table pinning feature tags for #214,
  #216, #217, #218, #219, #220, #221.
- Rewrite Open Questions to keep only genuinely-open items (paid-relay
  semantics, utilisation telemetry, relay discovery, feature registry).

https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
intendednull and others added 3 commits April 25, 2026 01:59
- fix relay/lib.rs:155 → :156 (Err arm of try_acquire_owned)
- fix identity/lib.rs:51 → :52 (InvalidSignature variant decl)
- cite transport/lib.rs:138 (pack) instead of :99 (Envelope) for bincode evidence
- call out the got → actual field rename vs existing internal types
- add three additional materialize.rs rejection rows (admin-only, missing proposal, RotateChannelKey non-member) and note the table is illustrative, not exhaustive
- singularize the "local-only" sentence (only one such row exists)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ia WireMessage - round 2

- WireRejectReason/RejectPayload now live in willow-common (transport→state cycle avoided)
- Re-targeted to a new WireMessage::Reject variant; described receiver-side wiring
- Mapping table line cite for MissingGovernanceDep: dag.rs:37 → :38
- Motivation: Identity::verify (no such method) → SignedMessage::verify

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… - round 4

- PR #214 reframed as co-proposed (not landed precedent); merge-order
  is independent, conflicts are trivial.
- PermissionDenied: explicitly part of the work to thread typed
  Permission through check_permission and InsertError, instead of
  parsing it back out of a formatted string.
- OQ1 dropped: the envelope is already authenticated via pack_wire,
  so the body is correct and the question was incoherent.
- target_peer: EndpointId added to RejectPayload; clients filter by
  self.endpoint_id() (mirrors VoiceSignal/JoinResponse/JoinDenied).
- Wording: "dropped silently" -> "logged-only and not signaled".
- Dep graph: state -> identity -> transport (transitive, not direct).
- TopicId is not actually re-exported by willow-network; spec now
  says so and uses [u8; 32] on the wire.
- PartialEq, Eq added to RejectPayload + RejectContext.
- Relay connection-cap mapped to Capacity (logged-only; semaphore
  has no retry_after_ms basis); RateLimited moved to future producers.
- Future-producer rows split into their own subsection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intendednull intendednull merged commit 5b65513 into main Apr 26, 2026
5 checks passed
@intendednull intendednull deleted the claude/spec-error-prefixes branch April 26, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants