diff --git a/docs/specs/2026-04-24-epoch-key-rotation.md b/docs/specs/2026-04-24-epoch-key-rotation.md new file mode 100644 index 00000000..bc61bbad --- /dev/null +++ b/docs/specs/2026-04-24-epoch-key-rotation.md @@ -0,0 +1,496 @@ +# Epoch-Driven Channel Key Rotation + +> **One-sentence summary:** Derive a fresh channel encryption epoch from +> every membership-changing state event, so compromise of a channel key +> expires the next time someone joins, leaves, or is kicked. + +Lessons taken from Nostr's trajectory — NIP-04/44/17 all lack forward +secrecy because the conversation key is a deterministic `HKDF(ECDH)` +output, and NIP-EE/Marmot's attempt to bolt MLS on top is heavy enough +that uptake has stalled. Willow's `ChannelKey` +(`crates/crypto/src/lib.rs:96–112`) is a long-term symmetric secret +intended for distribution at invite time. Production code does not yet +call `seal_content` — `Content::Encrypted` is never produced today — +nor does any production path emit `RotateChannelKey` or call +`generate_channel_key()` (today the only callers are tests; see +`crates/state/src/tests.rs`, `crates/crypto/src/lib.rs` test +modules, and the `#[cfg(test)]` helpers in +`crates/client/src/invite.rs:171` and +`crates/client/src/lib.rs:1261`). The state-machine handler, the wire variant, and the crypto +primitive all exist as plumbing; what is missing is a producer in +`willow-client` (in the channel-creation path — +`Client::create_server` / `Client::create_channel` — and the +invite-issuance path — `generate_invite` / `accept_invite`) +that actually generates and distributes a key. So the PCS gap is +currently latent: once message encryption *and* the channel-key +producer are wired up, a compromise of any member's `ChannelKey` would +leak every past and future message until someone manually rotates, +and there is no client API that exposes such a rotation today. This +spec closes that gap *before* encryption ships, by introducing a new +`RotateChannelKeyV2` event whose application is triggered by existing +membership-changing state events: `Propose { KickMember } + Vote`, +`RevokePermission`, `AssignRole`, and explicit out-of-band rotations. + +## Threat model + +**Provides:** + +- **Post-compromise security.** After a compromise *and* at least one + membership change, the old key no longer decrypts new messages. +- **Weak forward secrecy.** Past ciphertext is safe only if every + member actually deletes old epoch keys after use. Willow cannot + enforce that — it is a client-policy matter. +- **Partial metadata hiding.** Derived TopicIds rotate per epoch, so a + passive gossip observer loses membership continuity across + rotations. + +**Does not provide:** + +- **Full forward secrecy** of in-flight messages — that would need a + double-ratchet, out of scope here. +- **Post-quantum confidentiality** — X25519 only. +- **IP-level or timing privacy** — that is a transport concern. +- **Protection of pre-join history** from a new member — the default + policy grants new members the current epoch key only. See the + "Joining" section. + +## Epoch definition + +An `Epoch` is `(channel_id, epoch_number: u32)`. The width matches the +existing `SealedContent.key_epoch: u32` and `KeyRatchet::epoch: u32` +plumbing; `u32` (4B epochs) is far more than any channel will exhaust, +and widening every encrypt/decrypt path to `u64` for a theoretical +ceiling we will never hit is unjustified churn. Epoch 0 is the key +generated by the channel creator (under this spec: +`willow_crypto::generate_channel_key()` will be called by the client +at channel-creation time and distributed via the invite flow / a +follow-up genesis `RotateChannelKeyV2`; the `CreateChannel` event +itself carries no key material — only `name`, `channel_id`, and +`kind`). Epoch N+1 is produced by authoring a `RotateChannelKeyV2` +event that points at the triggering membership event. + +A `RotateChannelKeyV2` event is *valid* (and required) only when its +`trigger` references one of the membership-changing events below. An +explicit out-of-band rotation is allowed with `trigger = None`. The +state machine — not the trigger event itself — increments +`epoch_number` when it applies the `RotateChannelKeyV2` event. + +| Trigger event referenced by `RotateChannelKeyV2.trigger` | Rotates? | Why | +|------------------------------------------------------------------|----------|------------------------------------------------------| +| Channel creation (out-of-band key gen + genesis `RotateChannelKeyV2`) | Genesis | Establishes epoch 0. `trigger = None`. | +| `None` (explicit out-of-band rotation, `epoch ≥ 1`) | Yes | Manual admin-driven rotation. | +| `Propose { action: ProposedAction::KickMember { .. } }` | Yes | Kicked member must lose future read access. Kicks are governance-only — no direct `KickMember` EventKind exists. The trigger is the **`Propose` hash**, regardless of how many votes ratify it (see "Trigger identity" below). | +| `RevokePermission { SendMessages }` | Yes | Revoked writer must also lose read on same rotation. | +| `RevokePermission { SyncProvider }` | Yes | Former provider must not silently keep decrypting. | +| `AssignRole` (membership-changing — see below) | Yes | Let rotation double as the join-key handoff. | +| `AssignRole` (no membership change) | n/a | Spec rejects `RotateChannelKeyV2` triggered by a no-op assignment. | +| `GrantPermission { SendMessages }` | Yes | Mirror of revoke. | +| `DeleteChannel` | N/A | No further epochs; rotation rejected. | +| `Message`, `EditMessage`, `Reaction`, … | No | Content events are never valid triggers. | +| `SetProfile`, pins, renames | No | Nothing membership-sensitive; rejected as triggers. | + +**What "membership-changing `AssignRole`" means.** Today there is no +per-channel membership concept — channel access is gated entirely by +who holds the channel key (entries in `state.channel_keys`). For the +purposes of this spec, an `AssignRole` is "membership-changing" iff it +results in the assignee newly satisfying the role-permission predicate +for `SendMessages` on this channel (analogous logic for new +`SyncProvider`s). A no-op assignment of a role the peer already holds +is not a valid `trigger`. Note that `AssignRole` is itself a no-op for +any peer not already in `state.members` +(`crates/state/src/materialize.rs:381–387`) — only `GrantPermission` +auto-creates a `Member` entry today — so the predicate is well-defined +on the post-state. Once a richer per-channel ACL lands, this predicate +moves to the new ACL surface. + +**Event model: rotation is its own DAG event.** A +`RotateChannelKeyV2` is a normal author-signed, content-addressed +event. Membership events do **not** mutate `state.channel_keys` or +`epoch_number` as a side effect; only `apply_event` for +`RotateChannelKeyV2` does, after validating that the referenced +`trigger` is (a) already applied, (b) of an admissible kind from the +table above, and (c) appears post-state to actually have caused a +membership change. The `required_permission()` changes for +`RotateChannelKeyV2` land alongside the enum changes in +`crates/state/src/materialize.rs`. This keeps the existing "reject +before sign" flow (see +`docs/specs/2026-04-12-state-authority-and-mutations.md`) intact: an +author who lacks `ManageChannels`, or who points `trigger` at a +non-existent / wrong-kind event, gets their event rejected before it +joins the DAG. + +**Trigger identity for vote-driven rotations.** The proposal-and-vote +flow makes "the event that drove the kick" ambiguous: threshold can +be met by the original Vote, can be re-met retroactively after a +`RevokeAdmin` shrinks the admin set +(`crates/state/src/materialize.rs:242–258` — +`reevaluate_all_proposals`, reached via the +`apply_proposed_action` → `cleanup_votes_and_reevaluate` +chain at `crates/state/src/materialize.rs:234–239`), and an +owner-override may apply on `Propose` itself. To avoid that ambiguity, **`trigger` for any +vote-driven rotation MUST be the `Propose` event hash, never a +specific Vote**. The Propose hash is: + +- stable: it is content-addressed at proposal-creation time and + never changes; +- available early: any potential rotator knows it the moment they + see the Propose; +- race-free: it does not depend on which Vote happened to cross the + threshold; +- single-valued: there is exactly one Propose per `KickMember` action. + +The state machine validates the rotation by checking that the +referenced `Propose` is in `state.applied_events` AND that the +proposal is no longer in `state.pending_proposals` (i.e. it has been +ratified — pending proposals get removed from +`state.pending_proposals` only on threshold crossing). The +(rejected) alternative was a synthetic +`hash(Propose.hash || sorted_ratifying_vote_hashes)` identifier; it +adds determinism work for no extra security and would have to be +recomputed every time `reevaluate_all_proposals` shifts the ratifying +set. + +## Key derivation + +Both derivations use the standard HKDF Extract+Expand flow with an +explicit, versioned domain separator in `info` — following the same +versioned `info` convention as the existing `HKDF_*_DOMAIN` constants +(`crates/crypto/src/lib.rs:55–63` — `HKDF_RATCHET_MSG_DOMAIN`, +`HKDF_KEYWRAP_DOMAIN`) and the Extract→Expand discipline in +`KeyRatchet::next_key` (`crates/crypto/src/lib.rs:158–190`). The +explicit Extract `salt` (`b"willow-crypto/v1/epoch/salt"`) is **new** +to this spec — all four existing `Hkdf::::new(...)` call sites +in `crates/crypto` use either `None` (`crates/crypto/src/lib.rs:159, +403, 435`) or an unkeyed advance label (`crates/crypto/src/lib.rs:180`, +which passes `Some(&info)` — the same `info` bytes already used as the +Expand label) as salt; this is the first use of an explicit, versioned, +fixed-string salt. Using a non-empty, versioned salt for epoch +derivation is a deliberate hardening: even if `epoch_key[N]` is ever +reused as IKM in another context, the domain-separated PRK will not +collide. The salt convention follows +the same `willow-crypto/v1/...` versioning so a future semantic change +bumps the `v1` segment, exactly like the `info` strings. + +```text +epoch_key[0] = CSPRNG at channel creation (existing CreateChannel path) + +prk = HKDF-Extract( + salt = b"willow-crypto/v1/epoch/salt", + ikm = epoch_key[N] || triggering_event.hash + ) // 32 bytes +epoch_key[N+1] = HKDF-Expand( + prk = prk, + info = b"willow-crypto/v1/epoch/key", + L = 32 + ) // 32 bytes +epoch_key_id = HKDF-Expand( + prk = prk, + info = b"willow-crypto/v1/epoch/id", + L = 16 + ) // 16 bytes +``` + +- SHA-256 is the HKDF hash throughout — matches Willow's existing + `KeyRatchet` in `crates/crypto/src/lib.rs:136–208`. +- The `info` strings live alongside the existing + `HKDF_*_DOMAIN` constants. +- `triggering_event.hash` is sufficient — the parent DAG context is + already folded in because the event hash commits to `prev` and + `deps`. Folding the full state hash in addition was considered and + rejected: it forces ordering determinism inside the derivation, and + HLC/DAG merge can momentarily disagree on state hash even when the + set of events is identical. +- `epoch_key_id` is a public, 128-bit identifier safe to appear on + the wire, unlike the raw key. + +## Distribution + +Rotation needs two things on the DAG: the *fact* that rotation +happened (so everyone increments `epoch_number`), and the *ciphertext* +of the new key under each remaining member's public key. + +**Wire-compat strategy: new variant.** Willow events are serialized +via `bincode::serialize(&signable)` for both hashing and signing +(`crates/state/src/event.rs:252,278`). Bincode is positional, not +field-named; `#[serde(default)]` on a new struct field does **not** +make a payload that omits the field round-trip — deserialization +fails at EOF, and even if it didn't, re-serializing the in-memory +value would produce a different byte length and break the SHA-256 +hash check inside `Event::verify`. Adding `epoch` and `trigger` +fields to the existing `EventKind::RotateChannelKey` is therefore not +a viable path. Instead this spec introduces a new EventKind variant: + +```rust +// New variant — added to `EventKind` in crates/state/src/event.rs. +RotateChannelKeyV2 { + channel_id: String, + /// Epoch this rotation establishes. MUST equal `prev_epoch + 1`, + /// where `prev_epoch` is the channel's current epoch (0 if no + /// `RotateChannelKeyV2` has applied yet for this channel). + /// Matches `SealedContent.key_epoch` width. + epoch: u32, + /// Hash of the membership event that triggered this rotation, + /// or `None` for the genesis rotation (`epoch == 0`) and for + /// explicit admin-initiated out-of-band rotations. + trigger: Option, + /// `encrypt_channel_key_for` blobs, one per intended recipient. + encrypted_keys: Vec<(EndpointId, Vec)>, +} +``` + +The legacy `EventKind::RotateChannelKey` variant +(`crates/state/src/event.rs:152–155`) is **kept verbatim** — its +serialized shape never changes, so any historical event that may +have been persisted continues to deserialize and verify. Under this +spec, however, the legacy variant carries no epoch semantics: when +`apply_event` encounters a `RotateChannelKey`, it treats it as an +opaque epoch-0 key seed (it inserts entries into +`state.channel_keys` for the listed peers but does not advance +`epoch_number`). All new rotation traffic — including the genesis +rotation produced by `generate_channel_key()` at channel creation — +MUST use `RotateChannelKeyV2`. Choosing a brand-new variant over an +explicit pre-1.0 wire-break (the alternative considered, and the +path taken historically for the HKDF-prefix change documented at +`crates/crypto/src/lib.rs:51–53`) was the cleaner path here because: + +- it preserves any persisted history without a one-shot migration; +- it gives a clean place to put the new validation rules (epoch + monotonicity, trigger validation, kicked-peer exclusion) without + entangling them with legacy semantics; +- the in-memory cost is one extra enum variant. + +**Genesis convention.** The first `RotateChannelKeyV2` for a freshly +created channel carries `epoch = 0` and `trigger = None` (the +genesis key produced by `generate_channel_key()` at channel-creation +time). The next rotation caused by a membership event carries +`epoch = 1` and `trigger = Some(...)`, and so on. + +**Validation rules** for `RotateChannelKeyV2` (enforced in a new +`apply_event` arm, separate from the legacy `RotateChannelKey` +handler): + +- Author MUST hold `ManageChannels` for this channel (mirrors the + legacy variant's permission gate). +- `epoch` MUST equal `prev_epoch + 1` for non-genesis rotations, and + MUST equal `0` for the genesis rotation (only allowed when no + `RotateChannelKeyV2` has been applied yet for this channel). + `prev_epoch` is tracked in a new state field + `channel_epochs: BTreeMap` on `ServerState` (parallel + to the existing `channel_keys` field — see + `crates/state/src/server.rs:55`). +- If `trigger` is `Some(hash)`: + - `hash` MUST be in `state.applied_events` + (`crates/state/src/server.rs:84`); + - the referenced event's kind MUST be one of the admissible + kinds from the trigger table above; + - for `Propose { KickMember }`, the proposal MUST have already + been accepted (i.e. removed from `state.pending_proposals` by + the threshold-crossing path); + - the rotation MUST NOT include the kicked / revoked peer in + `encrypted_keys`. +- If `trigger` is `None`: only allowed for `epoch == 0` (genesis) or + for explicit out-of-band rotations authored by an admin + (server admin per `state.admins`) — this prevents non-admins from + silently bypassing the trigger requirement. +- Every `(peer_id, key_bytes)` in `encrypted_keys` MUST have + `peer_id` in the post-state member set. (The legacy handler at + `crates/state/src/materialize.rs:487–505` checks that the *author* + is a member but does not validate each `(peer_id, key_bytes)` + recipient against the post-state member set; per-recipient + validation is the new check this spec introduces.) +- `encrypted_keys` continues to use `encrypt_channel_key_for` + (`crates/crypto/src/lib.rs:388–422`) — ephemeral X25519 + HKDF + + ChaCha20-Poly1305. + +**Out-of-order: hold-and-defer.** Willow's insert flow tolerates +missing deps. If a `RotateChannelKeyV2` arrives before its `trigger` +event has been applied, the state machine does **not** reject it +outright: it holds the rotation in a per-channel "pending rotations" +queue and re-runs validation each time `state.applied_events` grows. +Once the trigger applies, the pending rotation applies in the same +pass. To bound memory and avoid keeping a stale rotation alive +indefinitely, a configurable timeout (default: 5 minutes of +wall-clock time after first observation) drops the rotation; the +client surfaces this as a warning and SHOULD re-author a fresh +rotation. The (rejected) alternative was reject-on-arrival, which +forces every well-behaved peer to retransmit on every transient +out-of-order delivery and gives an attacker a trivial way to grief +rotations by reordering gossip. + +**Pairing on the DAG.** A membership event and its follow-up +`RotateChannelKeyV2` are separate DAG entries but logically paired. +Peers SHOULD emit them back-to-back. If a membership event is +applied without a subsequent rotation appearing for a configurable +timeout, clients MUST surface a warning — the channel is running on +the pre-change key. + +## Topic ID rotation + +The runtime topic string is built by `make_topic` at +`crates/client/src/util.rs:55–58` (`format!("{}/{}", server_id, +channel_name)`) and then hashed by `topic_id` at +`crates/network/src/topics.rs:12` (BLAKE3 over the resulting string). +The defined-but-unused `channel_topic` helper at `topics.rs:42` uses +the same `format!` shape but with a `ChannelId(Uuid)` instead of the +human-readable channel name; both feed into the same `topic_id` hash. +The runtime topic is stable for the channel's life, so passive gossip +observers can correlate traffic volume with membership. Under this +spec: + +```text +TopicId(channel, epoch) = blake3( + b"willow-topic-v1" + || channel_id_bytes + || epoch_key_id +) +``` + +Using `epoch_key_id` (not `epoch_number`) means a non-member cannot +predict future topic IDs. Members transition topics on each epoch +event — they already have `epoch_key[N+1]`, so they know +`epoch_key_id[N+1]`, so they can subscribe to the new topic +atomically. The old topic stays alive briefly for in-flight messages +and is abandoned. + +## SealedContent integration + +`SealedContent.key_epoch: u32` in `crates/messaging/src/lib.rs:159–172` +already exists and is plumbed through `seal_content` / +`seal_content_with_counter` / `open_content_bounded` +(`crates/crypto/src/lib.rs:251–284`). The sender writes `key_epoch` +inside `seal_content_with_counter` (`crates/crypto/src/lib.rs:281`); +the receiver reads it at `crates/crypto/src/lib.rs:330` to call +`derive_message_key`. It is +always zero in production today only because no production caller +currently invokes `seal_content` — the field is wired but unused. +Under this spec the field becomes authoritative once message +encryption is wired up: the sender sets it to the epoch number whose +key encrypted the payload; the receiver indexes into their local +`BTreeMap<(String /* channel_id */, u32), EpochKey>`. Note: this +spec uses the state-side channel identifier (a `String` matching +`EventKind::CreateChannel.channel_id` and the keys of +`state.channel_keys`), NOT the messaging-layer `ChannelId(Uuid)` from +`willow-messaging`. We deliberately key off the state identifier so +the same lookup table works for both ratchet derivation and +state-machine validation; a future unification of the two `ChannelId` +representations is out of scope for this spec. + +`ratchet_counter` continues to work for within-epoch per-message +derivation via `KeyRatchet`. + +## Joining + +A member is added via `AssignRole` (direct) or via accepted +`Propose { AddMember }` (if/when that's added). Either way: + +1. Membership event applied at the DAG head. +2. The author of the membership event (or any other member with + `ManageChannels`) emits the follow-up `RotateChannelKeyV2` + referencing the membership event in `trigger` and including the + new member in `encrypted_keys`. +3. The new member decrypts their entry and learns `epoch_key[N+1]`. +4. They subscribe to the new `TopicId`. + +**Past-message access policy (default):** new members receive +`epoch_key[N+1]` only. They cannot decrypt epochs 0..=N. This matches +MLS-style "post-join confidentiality" and is the safer default. An +opt-in `ShareHistoricalKeys` channel setting could loosen this — left +out of scope for this spec; see open questions. + +## Identity-key vs signing-key separation + +NIP-EE's hard rule — "the MLS signing key MUST differ from the Nostr +identity key" — is sound. Willow's Ed25519 identity currently signs +events AND is the X25519 peer for channel-key wrapping. This spec +does **not** split them, but recommends that a follow-up spec +introduce a per-session signing key chained to the long-term identity +via a `RegisterSessionKey` event. That lets rotation extend to +signing material without losing account continuity. + +## Relay / worker trust + +Relays and storage workers never see epoch keys — only ciphertext and +the `encrypted_keys` blobs, which are themselves encrypted to member +public keys. A `SyncProvider` that replays events cannot read channel +content regardless of epoch. + +A compromised storage worker can withhold a rotation event to keep +clients on a stale epoch. Mitigations: the "no rotation seen since +membership event" client warning (the state machine itself enforces +no clock cap — `timestamp_hint_ms` is display-only — so the warning +must live in the client, driven by wall-clock comparison against the +applied membership event); and multi-provider state-hash agreement +(already in use), which catches most withholding because divergent +peers will see different `state.channel_keys` contents and therefore +different state hashes. + +## Tests + +- **Unit:** `epoch_key[N+1]` matches the `HKDF-Extract` spec vector + for known inputs; `epoch_key_id` derivation stable. +- **State:** each entry in the "rotates?" table produces a valid + `RotateChannelKeyV2` that applies; non-admissible kinds in + `trigger` are rejected. +- **State:** `RotateChannelKeyV2` with `encrypted_keys` for a + not-in-member-set peer is rejected. +- **State:** `RotateChannelKeyV2` whose `trigger` references an + unapplied event is held in the per-channel pending queue and + applies once the trigger applies; the same rotation past the + timeout is dropped. +- **State:** `RotateChannelKeyV2` whose `trigger` references a + `Propose { KickMember }` that is still in `state.pending_proposals` + (i.e. not yet ratified) is rejected. (Note: there is no `Rejected` + terminal state for proposals — proposals that fail to cross the + threshold simply remain pending until they do, or forever if they + never do.) +- **State:** `epoch` monotonicity — a `RotateChannelKeyV2` with + `epoch != prev + 1` is rejected; the genesis rotation is rejected + if applied to a channel that already has any + `RotateChannelKeyV2`. +- **State:** the legacy `RotateChannelKey` variant continues to + apply (no epoch advance); a `RotateChannelKey` followed by a + genesis `RotateChannelKeyV2` is accepted (the V2 establishes + epoch 0). +- **Wire (bincode round-trip):** an `Event` whose `kind` is the + legacy `RotateChannelKey` variant serialized before this spec + bincode-deserializes byte-identically and `Event::verify()` + returns true. An `Event` whose `kind` is `RotateChannelKeyV2` + bincode-round-trips and `verify()` returns true. +- **Integration:** kick scenario — kicked peer's pre-kick ciphertext + decrypts, post-kick ciphertext does not, even though they retained + `epoch_key[N]`. +- **Integration:** join-and-catch-up — new member decrypts post-join + messages, cannot decrypt pre-join messages (default policy). +- **Browser:** UI surfaces a warning when a membership change sits + unaccompanied by a rotation past the timeout. + +## Interaction with other specs + +- **Seal + gift-wrap DMs** (separate spec): DMs don't use channel + keys, so epoch rotation doesn't apply directly. DMs need their own + FS/PCS story. +- **Negentropy history sync** (separate spec): rotation events are + normal DAG entries; no special sync handling. +- **Relay capability doc**: consider advertising + `supports_epoch_rotation: bool` so clients can warn operators of + old relays. + +## Open questions + +1. **Past-message access policy.** Default is "new members cannot + decrypt pre-join." Some communities will want the opposite for + onboarding ("read the archive before joining"). Do we add an + opt-in `ShareHistoricalKeys` channel flag, or defer entirely? +2. **Identity vs signing key separation.** Land the split now, or in + a follow-up? The sooner we split, the less churn later — but it + touches `willow-identity` and every signing path. +3. **Derivation input.** `prev_key || trigger.hash` vs + `prev_key || server_state_hash_after_trigger`. The former is + simpler; the latter commits to more context but may diverge during + DAG merge. +4. **Retention of old epoch keys.** Needed for history replay and + late-arriving messages; deleting them is what actually delivers + forward secrecy. Who decides the TTL, and is it per-client? +5. **Rotation storm.** A rapid sequence of kicks produces a rotation + per kick. Do we batch — e.g., coalesce rotations within a short + window — or accept the overhead for clarity?