spec: machine-readable rejection reasons on wire protocol#216
Conversation
Adds a design spec for a typed WireRejectReason enum carried in a new MessageType::Reject envelope, inspired by Nostr NIP-01's OK/CLOSED prefix convention. Maps every current rejection site in willow-state, willow-transport, willow-identity, and willow-relay to a canonical reason variant so clients can programmatically retry, re-auth, chunk, or silence rejections instead of matching on free-form strings. Co-authored-by: Claude <noreply@anthropic.com>
|
|
||
| NIP-01 solves this for Nostr by prefixing OK/CLOSED messages with a | ||
| single-word machine-readable tag (`duplicate:`, `pow:`, `blocked:`, | ||
| `rate-limited:`, `invalid:`, `restricted:`, `mute:`, `auth-required:`, |
There was a problem hiding this comment.
auth-required: is from NIP-42 (auth), not NIP-01. The NIP-01 OK/CLOSED list is 8 prefixes: duplicate, pow, blocked, rate-limited, invalid, restricted, mute, error. Either drop auth-required: from this sentence or footnote that it comes from NIP-42 — otherwise readers cross-checking will be confused.
Generated by Claude Code
| produce the type without depending on `willow-state`: | ||
|
|
||
| ```rust | ||
| #[non_exhaustive] |
There was a problem hiding this comment.
Two concerns about this enum living in willow-transport:
-
The comment on
PermissionDenied(Permission)says "re-exported fromwillow-state", but that inverts the dependency graph fromCLAUDE.md:willow-transportis at the bottom, withwillow-stateandwillow-identityabove it. ImportingPermissionfromwillow-stateintowillow-transportwould create a cycle (state already depends transitively on transport via identity). Options: (a) movePermissioninto a new bottom-of-stack crate (willow-types?), (b) keepWireRejectReasoninwillow-stateand accept thatwillow-transport/willow-identitycallers can only produce a string-typed reason, or (c) carrypermission: u8here with the mapping table living inwillow-state. -
Invalid(String)andRestricted(String)undermine the "machine-readable, never matched on" discipline. The mapping table at line 122–137 already carries five distinct sub-cases asInvalid("…")strings (duplicate genesis, first-event-not-CreateServer, missing governance dep, peer_id mismatch, serde failure). Either promote the load-bearing ones to typed variants (DuplicateGenesis,NotGenesis,MissingGovernanceDep,PeerIdMismatch) or be explicit that those strings are advisory and clients must not match on them — otherwise a relay-forwarded "duplicate genesis" string becomes a de-facto API.
Generated by Claude Code
| Capacity, // relay MAX_TOPICS or similar cap hit | ||
| ServerError, // generic last-resort | ||
| } | ||
| ``` |
There was a problem hiding this comment.
No Unauthenticated / BadEnvelope distinct from SignatureInvalid? Today IdentityError has InvalidSignature, PeerMismatch, Serde, and PublicKeyDecode as separate variants — collapsing all of them into SignatureInvalid + Invalid("serde: …") loses the PublicKeyDecode and PeerMismatch distinctions that the existing code carefully maintains. PeerMismatch in particular is a spoofing signal (different code path from a corrupted-bytes InvalidSignature) and the receiver may want to treat it differently (e.g. score the sender for abuse).
Also: there's no variant for "I don't know who you are" / connection-level reject. Useful for the future case where the relay grows its own auth.
Generated by Claude Code
| Topic(TopicId), | ||
| Envelope, // predates any event hash | ||
| } | ||
| ``` |
There was a problem hiding this comment.
RejectContext::Envelope without a hash makes correlation impossible — open question 2 already flags this, but it's worth deciding here rather than punting. Without a send_id the Envelope arm is essentially "something you sent recently failed", which the client can't act on (which event do I retry? which UI bubble do I flip?). Suggest either (a) require all outbound envelopes to carry a u64 send_id from the start so this variant is well-defined, or (b) drop Envelope and accept that pre-event-hash failures are local-only (the spec already does this for the iroh receive error).
Generated by Claude Code
| | Relay `MAX_TOPICS` cap reached | [`relay/lib.rs:398`](../../crates/relay/src/lib.rs) | `Capacity` | | ||
| | Relay connection cap reached | [`relay/lib.rs:155`](../../crates/relay/src/lib.rs) | `RateLimited { retry_after_ms }` | | ||
| | Relay not granted `SyncProvider` | (future history-serve guard) | `NotSyncProvider` | | ||
| | iroh gossip receive error | [`network/iroh.rs:164`](../../crates/network/src/iroh.rs) | `ServerError` (local-only; not sent) | |
There was a problem hiding this comment.
A few mapping-table accuracy nits:
dag.rs:19is the doc comment forInvalidSignature, notNotGenesis(which is at line 20 of the variant list). Same fordag.rs:36/37—DuplicateGenesisis line 36,MissingGovernanceDepline 38. These are off-by-one.- The "Relay connection cap reached" →
RateLimitedmapping atrelay/lib.rs:155(actually line 156–164 for the bootstrap cap and 337–347 for the proxy cap) is currently adrop(stream)with no response sent at all — there's no transport channel for sending aRejectback at this layer because the connection itself is being refused. Worth noting this is "future work" or moving the entry to the local-only column. materialize.rs:24is theRejected(String)variant ofApplyResult, butApplyResultis internal to the state crate — it never crosses the wire today. The mapping is correct in spirit (this is where the strings come from), but worth saying explicitly that this requires plumbingApplyResult::Rejectedoutward through the client/network layer, which is a real (non-trivial) change.
Generated by Claude Code
| `human = Some("unknown reason variant")`, logs the raw bytes at | ||
| `debug!`, and leaves the original outbound event in its retry | ||
| queue — safe default, because a newer peer that accepts is still | ||
| reachable elsewhere in the mesh. |
There was a problem hiding this comment.
"bincode encodes enums with a u32 discriminant. A receiver that hits an unknown discriminant fails unpack_envelope cleanly" — this is the load-bearing claim of the section, and it's actually the opposite of forward compatibility. A v1 receiver that sees a v2 variant cannot read any of the surrounding fields (context, human) because the enum decode failure aborts the whole RejectPayload decode at crates/transport/src/lib.rs:198. The "leaves the original outbound event in its retry queue" mitigation is fine, but the information in the new variant — most critically a RateLimited.retry_after_ms from a newer peer — is lost.
Strongly consider: WireRejectReason::Unknown { tag: u16, payload: Vec<u8> } or a top-level RejectPayload { tag: u16, … } where tag is parsed first and reason is Option<WireRejectReason> (decoded best-effort). That gives you the actual NIP-01 property — old peers can render the prefix even if they don't understand the variant.
Also worth noting: the spec also says PROTOCOL_VERSION is reserved for "breaking" changes, but adding a variant that older peers can't decode is a breaking wire change under the current bincode setup. The two bullets contradict each other.
Generated by Claude Code
| subscriber`. Metrics counters key on the enum's `discriminant()` for | ||
| a flat histogram of rejection causes over time. Operators get | ||
| "how many `RateLimited` per minute?" for free, with no string | ||
| parsing. |
There was a problem hiding this comment.
std::mem::discriminant() does not implement Hash directly across serde boundaries — it's not stable across compilations and not a counter key. For metrics you want either an explicit fn name(&self) -> &'static str on WireRejectReason (cheap, stable, log-friendly) or #[repr(u8)] with explicit discriminants. As written this won't compile against typical metrics / prometheus counter APIs.
Generated by Claude Code
| 3. Is `PermissionDenied` leaking too much to an untrusted relay? | ||
| Telling the rejector which `Permission` they lack is fine; | ||
| telling a third party could help an attacker enumerate roles. | ||
| Route: relays forward rejections verbatim, clients filter. |
There was a problem hiding this comment.
"Route: relays forward rejections verbatim, clients filter" is the wrong default. A misbehaving relay or client peer can spam fabricated PermissionDenied(KickMembers) / PermissionDenied(ManageRoles) rejections at unrelated peers to enumerate who holds which permission against state they haven't seen yet. Better default: a Reject payload sent to a peer who is not the original event author MUST coarsen to Restricted("permission denied"). The original author's client can be told the typed Permission because they already know what they tried to do; nobody else needs that detail.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Solid framing — the typed-enum-over-string-prefix justification is well argued and the mapping table is the right shape. A few load-bearing issues to resolve before this becomes implementable, plus several gaps in the rejection-site survey.
Strengths
- Picking a typed enum over Nostr's strings is the correct call given Willow's bincode wire format. Compile-time exhaustive matching at the client is a real win the Nostr ecosystem cannot get.
- The
human: Option<String>escape hatch for free-formDisplayoutput is a nice middle ground — it preserves operator/log fidelity without re-introducing string-matching semantics. - Splitting
RejectContext::{Event, Topic, Envelope}correctly recognises that some rejections fire before any event hash exists. Good instinct. - Tests section names the right three layers (produced / serialised / consumed) and the forward-compat synthetic-discriminant test is exactly the kind of thing usually forgotten.
Concerns
1. The crate placement creates a dependency cycle
The spec puts WireRejectReason in willow-transport but has the PermissionDenied(Permission) variant import Permission from willow-state. Per crates/state/Cargo.toml and the dep graph in CLAUDE.md, willow-state already depends on willow-transport — so re-exporting Permission from transport is a cycle.
Three options: (a) move Permission down into willow-transport (cleanest, but Permission is genuinely a state-machine concept); (b) keep WireRejectReason in willow-transport but make the variant PermissionDenied(u8) with a discriminant table maintained in willow-state; (c) put the enum in a new willow-wire-types leaf crate that both state and transport can depend on. I'd push for (c) — it also gives you a natural home for RejectContext which already wants EventHash (state) + TopicId (network).
2. Variant coverage misses real rejection sites
Grepping Err( across the workspace turns up several peer-visible rejection paths the 16-variant table doesn't account for:
CryptoError::DecryptionFailed(crates/crypto/src/lib.rs:401, also returned fromopen_contentat line 296). When a client receives aContent::Encryptedit can't decrypt — wrong epoch, missing channel key, tampered ciphertext — this is recoverable by requesting the key, not by retrying the send. Needs its own variant or it gets bucketed into the uselessInvalid("…").CryptoError::RatchetCounterOutOfRange { claimed, max }(crates/crypto/src/lib.rs:279). Specifically interesting because PR #220 (epoch rotation) makes this a first-class case the client must distinguish fromDecryptionFailed.StoreError::DuplicateId(crates/messaging/src/store.rs:97). Different semantic fromInsertError::Duplicate— this is the message store, not the DAG. Today both collapse toDuplicatein the table; that's probably fine, but call it out.WorkerResponse::Denied { reason }(crates/storage/src/role.rs:74,82,crates/replay/src/role.rs:274,321) is a peer-visible rejection (worker → client) and isn't in the table at all. Concrete cases today: "unknown server", "history not served", "query failed". These map to a mix ofInvalid,NotSyncProvider, andServerError, but the spec needs to say which.ApplyResult::Rejected("proposal not found")(crates/state/src/materialize.rs:161) andRejected("author is not a member")(line 494) are real rejection paths the spec lumps under "PermissionDenied/Restricted depending on cause" without saying which.IdentityError::PeerMismatchis mapped toInvalid("peer_id mismatch")but percrates/identity/src/lib.rs:78–84the error carries structuredclaimed/signerfields. Stuffing them into a string discards the structure the original error went out of its way to preserve. Worth aPeerMismatch { claimed, signer }variant.
Also: EventKind is 22 variants per crates/state/src/event.rs:68 (the doc comment says so explicitly), not 17 — the spec's framing implies a closer mapping than exists. Worth a sweep.
3. The #[non_exhaustive] + bincode forward-compat story is wrong
The spec claims "bincode encodes enums with a u32 discriminant. A receiver that hits an unknown discriminant fails unpack_envelope cleanly". This is misleading. Bincode returns a generic ErrorKind::InvalidValue / ErrorKind::Custom("invalid value: integer N, expected variant index 0..=K") indistinguishable from "the bytes were corrupt mid-stream". You cannot programmatically tell "this is a future variant I should ignore" from "the message is corrupted and I should drop the connection".
#[non_exhaustive] only helps the Rust compiler force a wildcard arm on the consumer; it does nothing for the bincode wire. If the goal is true forward compat, the spec needs an explicit discriminant prefix on the wire (e.g. wrap the reason in RejectPayload { reason_tag: u32, reason_body: Vec<u8>, ... } so unknown tags can be skipped without failing the whole envelope), or accept that adding a variant is a breaking wire change and bump PROTOCOL_VERSION. Either is fine, but the spec needs to pick one and stop hand-waving.
4. Authority on Reject envelopes — answer must be "yes, signed"
Open question 1 frames signing as optional ("Probably yes"). Given how Willow gossip works (crates/network/src/iroh.rs, broadcast topic), any peer subscribed to the same topic can publish a MessageType::Reject with a forged RejectContext::Event(victim_hash). An attacker doesn't need to be the relay — they only need to be in the gossip mesh. Without signing, a single malicious peer can:
- Pop up
ui.prompt_reauth()modals on every send (DoS the UX) - Trigger spurious
sync.request_history(context)storms via fakeParentHashMismatch - Suppress legitimate sends by faking
PermissionDenied(SendMessages)against another peer's hash
The spec should mandate signing of the RejectPayload by the rejecting peer's identity, and clients should drop unsigned Rejects or Rejects from peers whose claim of authority over the topic isn't credible (e.g., not the relay's EndpointId and not a SyncProvider-granted member). This isn't an open question — it's a load-bearing decision and the spec should make it.
5. Correlation is a real blocker for RejectContext::Envelope
For Event(hash) rejections this is fine — the event already has a content hash before send, and the client can keep an outbox keyed by hash. But envelope-level rejections (UnsupportedVersion, PayloadTooLarge) fire before the inner payload deserialises, so the rejecter doesn't have an event hash to echo back. The spec's RejectContext::Envelope is just a unit variant — clients have no way to know which outbound envelope this Reject corresponds to.
Open question 2 proposes a send_id: u64 stamp. Endorsing that: a 64-bit per-connection-monotonic send-id added to Envelope is cheap, would make all three context variants correlatable, and is a one-time wire change that pairs naturally with bumping PROTOCOL_VERSION for MessageType::Reject introduction anyway. Recommend rolling it into this spec rather than punting.
6. ACK symmetry — current state is "fire and pray"
Open question 5: today there is no application-level ACK. The client treats successful TopicHandle::broadcast (crates/network/src/iroh.rs) as success, but iroh-gossip only confirms local submission to the gossip layer, not that any peer received or accepted the event. The "no reject within N seconds" heuristic is unworkable on a partitioned mesh — the only honest signal of acceptance today is seeing your own event echoed back via gossip.
Recommendation: defer MessageType::Ack but call out explicitly that without it, Reject is negative-only. The client UX must continue to use "saw my event in the ordered stream" as the success signal. Otherwise users will get UIs that show "delivered" the moment a single peer accepted, which is misleading on a P2P mesh.
7. Rate-limit ownership
Open question 4: the table only assigns RateLimited to the relay's connection cap. But per crates/storage/src/role.rs:74 and crates/replay/src/role.rs:321, worker nodes already have a WorkerResponse::Denied path and could trivially throttle expensive History/Sync requests. Recommend the spec say: any node CAN emit RateLimited; relay does it for transport-layer load, workers do it for query-layer load. Currently neither has actual rate-limit infrastructure — the spec should note that this is a prerequisite, not a side-effect.
8. PermissionDenied(Permission) leak — a position
Spec asks if naming the permission leaks info. Position: leakage is real but small, and the alternative is worse.
- The signed event already carries
(author, kind), so any third party in the gossip mesh already knows what was attempted. The rejecter merely confirms what's structurally derivable. - The exception is
SyncProviderandCreateInvite— these are not bound to specific event kinds in the same 1:1 way (crates/state/src/materialize.rs:266-300), so aPermissionDenied(SyncProvider)against a sync-attempt does narrowly confirm "this peer is not a sync provider on this server", which an attacker probing for relay topology might find useful. - Mitigation: collapse permission disclosure to a coarser bucket for the wire (
PermissionDenied(Coarse)whereCoarse ∈ {Read, Write, Manage}) and let the localhumanstring carry the precise permission for the rejected peer's own log. The third party sees "denied, write tier"; the rejected peer sees "you lackSendMessages".
9. ParentHashMismatch { expected, actual } — lower risk than the spec implies
Per crates/state/src/dag.rs:166-172, expected is the offending author's own current head in the rejecter's view, not a global state hash. Echoing it back tells the sender "I think you're at hash X" — which is exactly the information the sender needs to recover, and which a sync request would have to disclose anyway. Not a leak.
10. Sibling-PR consistency
- PR #214 (history-sync EOSE) mentions
HistorySyncFailedas a future Reject variant. The spec should at minimum reserve aHistorySyncFailedvariant or pin down whetherInvalid("history…")is the canonical encoding — otherwise both specs ship and we end up with two ways to say the same thing. - PR #220 (epoch rotation) explicitly mentions "rejection of bad-epoch
RotateChannelKey" composing with this spec. There's noWrongEpoch { expected, actual }variant. Either add it now or note thatInvalid("epoch: …")is the agreed encoding. - PR #215 (capability doc) suggests advertising
supports_machine_readable_errors: bool. Spec should reciprocate by mentioning how a client behaves against a relay that doesn't supportMessageType::Reject(silent old behaviour, presumably).
11. Minor
managed_dag.rsmentioned in the review brief — actual file iscrates/state/src/managed.rs. Spec doesn't reference either, just calling it out.- The
iroh gossip receive error → ServerError (local-only; not sent)row at the bottom of the mapping table is correct but worth pulling into a top-level "what we explicitly do NOT send on the wire" section, alongside any decode failure on the inner payload (you literally cannot send a Reject to a peer whose bytes you couldn't decode, so the recovery has to be transport-level). RejectPayload.human: Option<String>is unbounded — should have a max length cap (e.g. 256 bytes) enforced on serialization, since this is attacker-controlled data hitting the receiver's logger and possibly UI.
Suggestions
- Move
WireRejectReasonto a newwillow-wire-typesleaf crate to break the cycle withPermission. - Add
send_id: u64toEnvelopein this spec; it's a one-time breaking change that pays for itself in correlation. - Mandate signing of
RejectPayloadby the rejecting peer; resolve open question 1 firmly as "yes". - Add
WrongEpoch,DecryptionFailed,RatchetCounterOutOfRange, andHistorySyncFailedvariants to align with sibling specs and cover the missing rejection sites. - Pick an explicit forward-compat strategy: either tag-prefixed reasons (skip-unknown) or accept that variant additions bump
PROTOCOL_VERSION. Don't claim bincode handles this for you. - Add a max-length cap on
human. - Resolve the
PermissionDeniedleak by sending only a coarse tier on the wire and putting the precise permission inhumanfor the rejected peer's own consumption.
Happy to look at a v2.
Generated by Claude Code
| Duplicate, | ||
| Invalid(String), | ||
| RateLimited { retry_after_ms: u64 }, | ||
| PermissionDenied(Permission), // re-exported from willow-state |
There was a problem hiding this comment.
Dependency cycle: WireRejectReason lives in willow-transport, but Permission is defined in willow-state (crates/state/src/event.rs:21), which already depends on willow-transport per the dep graph in CLAUDE.md. Re-exporting Permission "from willow-state" as the comment suggests is structurally impossible.
Recommend a new willow-wire-types leaf crate that both state and transport depend on. It would also be the natural home for RejectContext (which already wants both EventHash from state and TopicId from network).
Generated by Claude Code
| TopicInvalid(String), | ||
| Capacity, // relay MAX_TOPICS or similar cap hit | ||
| ServerError, // generic last-resort | ||
| } |
There was a problem hiding this comment.
Variant gaps. The 16-variant set misses several real peer-visible rejection paths I found grepping Err( across the workspace:
CryptoError::DecryptionFailed(crates/crypto/src/lib.rs:296,401) — recoverable via key request, not by retry; needs its own variant.CryptoError::RatchetCounterOutOfRange { claimed, max }(crates/crypto/src/lib.rs:279) — first-class case once PR spec: epoch-driven channel key rotation for PCS #220 (epoch rotation) lands.WorkerResponse::Denied { reason }from storage/replay (crates/storage/src/role.rs:74,82,crates/replay/src/role.rs:274,321) — peer-visible worker→client rejections, not in the table at all.IdentityError::PeerMismatch { claimed, signer }is mapped toInvalid("peer_id mismatch"), discarding the structured fields the original error preserves. WorthPeerMismatch { claimed, signer }.ApplyResult::Rejected("proposal not found")(crates/state/src/materialize.rs:161) andRejected("not a member")(line 494) are real paths the table lumps under "PermissionDenied/Restricted depending on cause" without specifying which.- Sibling PR spec: epoch-driven channel key rotation for PCS #220 mentions bad-epoch
RotateChannelKeyrejection; needsWrongEpoch { expected, actual }. - Sibling PR spec: history sync completion signal (EOSE-equivalent) #214 mentions
HistorySyncFailed; either reserve the variant or pick the canonical encoding now.
Also: the spec implies a 17-variant EventKind mapping, but crates/state/src/event.rs:68 says explicitly "All possible state mutations — 22 variants". Worth a re-sweep.
Generated by Claude Code
| ## Extensibility & versioning | ||
|
|
||
| - `#[non_exhaustive]` forces downstream `match` to carry a wildcard, | ||
| so adding a variant is never a SemVer break. | ||
| - bincode encodes enums with a `u32` discriminant. A receiver that | ||
| hits an unknown discriminant fails `unpack_envelope` cleanly; the | ||
| client treats the failed decode as a local `ServerError` with | ||
| `human = Some("unknown reason variant")`, logs the raw bytes at | ||
| `debug!`, and leaves the original outbound event in its retry | ||
| queue — safe default, because a newer peer that accepts is still | ||
| reachable elsewhere in the mesh. |
There was a problem hiding this comment.
This claim about #[non_exhaustive] + bincode forward compat is wrong. #[non_exhaustive] only forces a wildcard arm at the Rust compile site; it does nothing for the wire. Bincode does not "fail cleanly" with a useful error on an unknown discriminant — it returns a generic ErrorKind::InvalidValue / ErrorKind::Custom that is indistinguishable from corrupt bytes mid-stream. The receiver cannot programmatically tell "future variant, ignore" from "this whole envelope is junk, drop the connection".
If real forward compat is the goal, the spec needs an explicit length-prefixed body so unknown reason tags can be skipped without failing the envelope, e.g.:
struct RejectPayload {
reason_tag: u32, // canonical discriminant
reason_body: Vec<u8>, // bincoded variant body, skippable on unknown tag
context: RejectContext,
human: Option<String>,
}Or accept that adding a variant is a wire-breaking change and bump PROTOCOL_VERSION. Either is fine; please pick one.
Generated by Claude Code
| 1. Should `RejectPayload` be authenticated? A relay that forges | ||
| `PermissionDenied` against a peer's legitimate event could | ||
| suppress that peer's messaging UI. Probably yes — sign the | ||
| payload with the rejecting peer's identity so clients can decide | ||
| whether to trust it. |
There was a problem hiding this comment.
This needs to graduate from "open question" to "decided: yes, must be signed". Willow gossip is broadcast (crates/network/src/iroh.rs), so any peer subscribed to the topic — not just the relay — can publish a MessageType::Reject with a forged RejectContext::Event(victim_hash). Concrete attacks an unsigned Reject enables for a single mesh participant:
- Pop
ui.prompt_reauth()modals on every send (UX DoS via fakeAuthRequired). - Trigger
sync.request_history(context)storms via fakeParentHashMismatch. - Suppress legitimate sends by faking
PermissionDenied(SendMessages)against another peer's hash.
Mandate: RejectPayload must be Ed25519-signed by the rejecting peer's identity (use the existing willow_identity::pack envelope at crates/identity/src/lib.rs:426). Clients drop unsigned Rejects, and the consume arm at line 153 should additionally weight Rejects by the signer's authority on the topic (relay endpoint vs. random member vs. SyncProvider-granted peer).
Generated by Claude Code
| #[derive(Clone, Debug, Serialize, Deserialize)] | ||
| pub enum RejectContext { | ||
| Event(EventHash), | ||
| Topic(TopicId), | ||
| Envelope, // predates any event hash | ||
| } |
There was a problem hiding this comment.
RejectContext::Envelope is a unit variant, so envelope-level rejections (UnsupportedVersion, PayloadTooLarge, the line 162 Invalid("deser: …") shape error) cannot be correlated with the offending send — the rejecter never decoded an event hash.
Endorsing open question 2's proposal: stamp a send_id: u64 on every Envelope (crates/transport/src/lib.rs:99) and echo it in the Reject. This is a one-time wire change that pairs naturally with bumping PROTOCOL_VERSION for MessageType::Reject = 7 introduction anyway, and it makes all three context variants correlatable. Suggest folding it into this spec rather than punting — without it, half the new Reject variants are useless to the client.
Also: human is unbounded attacker-controlled data hitting the receiver's logger and possibly UI. Cap it (e.g. 256 bytes) at serialization time.
Generated by Claude Code
intendednull
left a comment
There was a problem hiding this comment.
Round 2: Comparative research (non-Nostr prior art)
Surveyed gRPC, RFC 7807/9457, Matrix, AMQP 0-9-1 and 1.0, TLS 1.3 alerts, and Cap'n Proto RPC. The spec lands in a defensible spot, but a few of the choices put it on the minority side of decisions every other protocol made differently. Worth interrogating before bincode locks the shape in.
1. The dominant shape is "tagged union + open dictionary," not "closed enum"
Almost every system I looked at separates a small machine-readable tag from an open extension bag:
| System | Tag | Extension |
|---|---|---|
| gRPC | Code enum (closed, ~17 values) |
google.rpc.Status.details: repeated Any — RetryInfo, ErrorInfo, BadRequest, QuotaFailure, PreconditionFailure, ResourceInfo, Help, LocalizedMessage |
| RFC 9457 | type: URI |
arbitrary JSON extension members; clients MUST ignore unknown |
| Matrix | errcode: "M_LIMIT_EXCEEDED" |
arbitrary sibling fields (retry_after_ms, soft_logout, room_version, …) |
| AMQP 1.0 | error-condition: symbol ("amqp:not-found") |
info: map |
| Cap'n Proto | Type enum of 4 values |
free reason: Text |
| TLS 1.3 | fixed u8 AlertDescription |
none |
Willow's WireRejectReason puts payload inside each variant (RateLimited { retry_after_ms }, PermissionDenied(Permission), SeqGap { expected, actual }). That's ergonomic Rust, but it means every new piece of metadata is a wire-format change. If next year you want retry_after_ms on Capacity too, or a quota_remaining on RateLimited, you mint a new variant.
The { tag: String, details: BTreeMap<String, Value> } shape — gRPC's design — decouples the two evolution axes. New context fields ride along on existing tags; new tags don't break old receivers because they fall through the wildcard arm. Spec already pays the wildcard tax via #[non_exhaustive], so the type-safety win over a string tag is smaller than it looks: callers must _ => either way.
A middle ground worth costing out:
pub struct RejectPayload {
pub tag: WireRejectTag, // small closed enum, just the tag
pub details: RejectDetails, // typed-but-optional fields
pub human: Option<String>,
}
pub struct RejectDetails {
pub retry_after_ms: Option<u64>,
pub permission: Option<Permission>,
pub expected_hash: Option<EventHash>,
pub actual_hash: Option<EventHash>,
pub limit: Option<u64>,
pub actual: Option<u64>,
// …
}Adding a field is now backward-compatible at the bincode level (still risky — see #4). Adding a tag still requires the wildcard. You keep exhaustive matching where it matters (the tag) and lose the per-variant struct surface area.
2. Nobody binds the tag to a URI — and the only ones who tried (RFC 7807) regret it
RFC 7807/9457's type is a URI specifically so it's globally unique and dereferenceable to docs. In practice clients SHOULD NOT auto-dereference it and rarely do. The "human eventually" promise mostly doesn't pay off — Matrix's M_LIMIT_EXCEEDED-style namespaced strings are the de-facto winner because they're cheaper to type and just as unique-in-practice.
For Willow specifically: a closed Rust enum is already globally unique within the project. Going to URIs only helps if third-party Willow forks want to mint reasons without coordinating — which the spec's #[non_exhaustive] + bincode discriminants explicitly don't support. So URI tags would be cargo-culting RFC 9457's worst feature. Skip.
3. Why the "rich details" pattern fails in practice (and what would make Willow stick)
gRPC's biggest lesson is unflattering: most clients ignore Status.details entirely because it's Any-typed, requires a second protobuf import to unpack, and is inconsistently populated across server SDKs. RFC 7807's type field is similarly under-used — Spring Boot et al. surface title/detail to humans and hand-wave the URI. Matrix's errcode is the success story: clients do branch on it, because the spec lists every code, gives each one a single canonical meaning, and the human field is genuinely human (no JSON nested inside it).
What that suggests for Willow:
- The mapping table is the most important asset in the spec. Keep it in-tree as the canonical source, not in prose. Generate the enum from it if feasible.
- Resist
Invalid(String)andRestricted(String). Those are exactly the "stringly reason" pattern this spec is supposed to retire. The "absorb future sub-categories" justification (lines 181–183) is the slippery slope — Matrix'serror: "..."field is openly described as "for the user", not for branching, and it works. Willow's twoStringvariants will silently accumulate ad-hoc categories that the matching client code can't enumerate. Either commit to a tag and add new tags as needed, or fold these into a genericOther { tag: String }so the stringliness is at least named. - One canonical
retry_after_mslocation. See #5.
4. TLS 1.3 alert-style brittleness vs. bincode-discriminant brittleness
TLS 1.3's fixed AlertDescription u8 is fast but strictly additive — old clients seeing a new alert just see decode_error (or worse, drop the connection). New alerts shipped in TLS 1.3 (missing_extension, certificate_required, no_application_protocol, …) are simply invisible to TLS 1.2 implementations. That's been acceptable for TLS because alerts are advisory; the connection is dead either way.
Bincode enum mismatch has the same property: a v2 sender emits Capacity (discriminant 14), a v1 receiver decoding into a 12-variant enum gets a deserialization failure on the whole envelope. Spec acknowledges this on line 178–183 and degrades gracefully to a synthetic ServerError. That's fine only if Reject payloads are advisory and never gate retries. The moment a client uses RateLimited.retry_after_ms to throttle, an unknown new "RateLimitedV2 { retry_after_ms, jitter_ms }" variant becomes a v1 retry-loop. Worth stating explicitly in the spec: receivers MUST default to "no retry, surface to user" on any decode failure, never to "retry immediately". Open-question 4 ("advisory or enforced retry_after") is downstream of this.
If long-term forward compat matters, the gRPC/Matrix style (small closed tag, optional typed details) survives a v1↔v2 mismatch because the tag space is small and stable; the details bag is ignorable. The current spec's per-variant payloads do not.
5. Cross-cutting retry primitive — gRPC vs HTTP teach a real lesson
gRPC encodes RetryInfo.retry_delay as a protobuf message inside trailing metadata. HTTP uses the Retry-After header. Different transports, but both kept the retry hint out of the status code itself so it could ride along with non-rate-limit errors too (overloaded backends, maintenance windows, quota refills).
Willow currently puts retry_after_ms only on RateLimited. But the mapping table already shows Capacity triggering at the relay's connection cap (line 139) and it's a perfectly valid retry case — the relay will have capacity in 30s. Same for ServerError during a deploy. Recommend hoisting retry_after_ms: Option<u64> to RejectPayload itself (or to the shared RejectDetails if you adopt #1), so any reason can carry a hint and the client backoff logic has one place to look:
ClientEvent::Rejected(payload) => {
if let Some(ms) = payload.retry_after_ms { backoff.schedule(ms); }
match payload.tag { … }
}This also kills the awkwardness of RateLimited { retry_after_ms } being the only variant whose presence in the enum implies a numeric field — every other variant is opaque.
Concrete recommendations
- Decide the tag/details split before merging. Either commit to per-variant payloads (current) and accept that wire-format evolution = new variants only, OR move payload to a shared
RejectDetailsand document the forward-compat envelope. - Drop
Invalid(String)andRestricted(String)in favor of named tags; if a stringly escape hatch is needed, name itOther { tag: String }so it's obviously the loose case. - Hoist
retry_after_msout ofRateLimitedinto the envelope so any reason can carry it. - Add a normative line: receivers MUST treat unknown discriminants / decode failures as non-retryable.
- Don't go URI-typed. RFC 9457's worst feature.
What's actually fine as-is
- Single human field, separate from machine tag (Matrix/RFC 9457 agree).
- Decoupling
RejectContextfromWireRejectReason(gRPC does the same withRequestInfo). - Locating the type in
willow-transportto avoidwillow-statedep cycles. #[non_exhaustive]on the enum.
Sources:
- gRPC error model
- google.rpc error_details.proto
- RFC 9457 — Problem Details
- Matrix standard error response
- TLS 1.3 alerts (RFC 8446 §6)
- Cap'n Proto rpc.capnp Exception type
- AMQP 0-9-1 reference
- RFC 7807 retrospective
Generated by Claude Code
Apply review decisions to the relay capability document spec: - Promote signing to v1 MUST (inline signature, RFC 8785 JCS canonical bytes, signature field excluded from canonicalisation). - Specify dispatch surgery: explicit branch in dispatch_connection for /.well-known/willow plus OPTIONS preflight; reuse BOOTSTRAP_IO_TIMEOUT and MAX_CONCURRENT_BOOTSTRAP_CONNECTIONS; extend (not mirror) the handle_bootstrap_connection pattern. - Drop event_schema_range (no EVENT_SCHEMA_VERSION exists in willow-state); list as future work. - Resolve multi-tenant question: one shared doc per host, relay is topic-agnostic. - Soften operator-metadata leakage: version is coarse semver, software is project name, both MAY be omitted. - Two-tier caching by status: ok=300s, degraded/read_only=5s with must-revalidate. - Recommend WS clients also send Sec-WebSocket-Protocol; JSON is advisory pre-connect. - Fix port framing: relay binds one port multiplexing TCP+WS, not two. - Drop sync_provider_only (operator vibes without a concrete pre-handshake check). - Add Cross-spec coordination table pinning feature tags for #214, #216, #217, #218, #219, #220, #221. - Rewrite Open Questions to keep only genuinely-open items (paid-relay semantics, utilisation telemetry, relay discovery, feature registry). https://claude.ai/code/session_01XmbVXWnKTRVjPp9kmKRSBn
- fix relay/lib.rs:155 → :156 (Err arm of try_acquire_owned) - fix identity/lib.rs:51 → :52 (InvalidSignature variant decl) - cite transport/lib.rs:138 (pack) instead of :99 (Envelope) for bincode evidence - call out the got → actual field rename vs existing internal types - add three additional materialize.rs rejection rows (admin-only, missing proposal, RotateChannelKey non-member) and note the table is illustrative, not exhaustive - singularize the "local-only" sentence (only one such row exists) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ia WireMessage - round 2 - WireRejectReason/RejectPayload now live in willow-common (transport→state cycle avoided) - Re-targeted to a new WireMessage::Reject variant; described receiver-side wiring - Mapping table line cite for MissingGovernanceDep: dag.rs:37 → :38 - Motivation: Identity::verify (no such method) → SignedMessage::verify Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… - round 4 - PR #214 reframed as co-proposed (not landed precedent); merge-order is independent, conflicts are trivial. - PermissionDenied: explicitly part of the work to thread typed Permission through check_permission and InsertError, instead of parsing it back out of a formatted string. - OQ1 dropped: the envelope is already authenticated via pack_wire, so the body is correct and the question was incoherent. - target_peer: EndpointId added to RejectPayload; clients filter by self.endpoint_id() (mirrors VoiceSignal/JoinResponse/JoinDenied). - Wording: "dropped silently" -> "logged-only and not signaled". - Dep graph: state -> identity -> transport (transitive, not direct). - TopicId is not actually re-exported by willow-network; spec now says so and uses [u8; 32] on the wire. - PartialEq, Eq added to RejectPayload + RejectContext. - Relay connection-cap mapped to Capacity (logged-only; semaphore has no retry_after_ms basis); RateLimited moved to future producers. - Future-producer rows split into their own subsection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part of a set of 8 specs drawing lessons from Nostr's protocol and ecosystem. Use this PR to discuss the design — not proposing implementation, only the spec.
What & why
Nostr's NIP-01 defines machine-readable prefixes on OK/CLOSED messages (
duplicate:,rate-limited:,invalid:,auth-required:,restricted:, ...) so clients can react programmatically — retry onrate-limited:, prompt for auth onauth-required:, silently drop onduplicate:. Willow currently has no equivalent; rejection handling is ad-hoc string-matching or silent drops.Since Willow's wire format is bincode rather than JSON, this spec proposes a typed enum (
WireRejectReason) instead of string prefixes — safer, still extensible via#[non_exhaustive]. NewMessageType::Reject { reason, context }variant, context carrying event hash / topic id / envelope seq. 16 variants justified by concrete rejection sites inwillow-state(InsertError,check_permission),willow-transport(UnsupportedVersion, size cap),willow-identity(InvalidSignature,PeerMismatch), andwillow-relay(topic validation,MAX_TOPICS, connection cap).Spec file:
docs/specs/2026-04-24-error-prefixes.mdOpen questions for review
Rejectenvelope need a signature, or is the raw-transport-level delivery trust enough?Rejectto the specific send that triggered itPermissionDenied(Permission)— does naming the specific permission leak information?MessageType::Ack?Composition with sibling specs
HistorySyncFailedcould be a Reject variantRotateChannelKeyCommit is unsigned due to harness signing backend failure (same as sibling PRs in this set).
Generated by Claude Code