diff --git a/.claude/skills/resolving-issues/SKILL.md b/.claude/skills/resolving-issues/SKILL.md index 99ee970f..962b81c8 100644 --- a/.claude/skills/resolving-issues/SKILL.md +++ b/.claude/skills/resolving-issues/SKILL.md @@ -59,6 +59,10 @@ Why this shape: - if closed issue is a sub-issue of a parent, run parent close-out check (see ## Parent issue close-out) — close parent too if this was its last open child Closing GH issues = metadata work, not code work; allowed under "coordinator never codes." Implementers only dispatch for *unresolved* issues. This pass typically clears audit lessons (already folded into skills), structural-deps follow-ups (still structural), and audit findings closed by intervening fixes — saves dispatch overhead on no-op work. + + **When same-day audit-to-fix gap, expect ~zero already-fixed hits.** The sweep yields most when the audit ran days/weeks before the fix run (intervening PRs land between audit and fix). When the audit was filed earlier the same day against the current `main`, no fixes can have landed between audit and fix run by definition — the sweep correctly returns empty. Don't over-invest sweep effort in this case; one quick `git log ..origin/main` check on each pick's path is enough to confirm. + + **Ambiguous-fix-path (audit premise real, fix is a design call) — coordinator-skip without close.** Some audit findings flag a real concern but the prescribed fix is ambiguous-by-design with ≥ 2 legitimate approaches that need a design decision the coordinator can't make unilaterally. Common shape: "audit-glob doesn't catch in-file `mod tests` — move to external test file OR update glob." Both are valid; picking either silently is wrong. **Action:** during step 6 picks, if the audit's literal fix conflicts with HEAD reality (e.g. code already satisfies the surface premise) AND the underlying concern requires a design call, **skip the issue from this run's dispatch queue without closing it**. Note it in the run-end `## Lessons Learned` so the next session has full context. Don't comment on the issue — the audit description already captures the concern; a "skipped, design call needed" comment adds noise without progress. The next run with fresh eyes / accumulated context can decide. 7. Per remaining issue, sequential, max 10 per run: - **Pre-dispatch sync:** before spawning each implementer, in the coordinator's checkout: ```bash diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml index 9ed9b83b..22c1c244 100644 --- a/.github/workflows/deploy.yml +++ b/.github/workflows/deploy.yml @@ -72,6 +72,17 @@ jobs: - name: Verify deployment run: | - sleep 3 - curl -sf -o /dev/null -w "Web: HTTP %{http_code}\n" https://willow.intendednull.com/ + # Poll the web endpoint until it serves (max ~30s) instead of a + # blind 3s sleep — avoids flaking when the relay takes longer to + # bind and masking a silent deploy failure. + for i in $(seq 1 15); do + if curl -sf -o /dev/null -w "Web: HTTP %{http_code}\n" https://willow.intendednull.com/; then + break + fi + if [ "$i" -eq 15 ]; then + echo "verify: HTTP probe never succeeded after 15 attempts (~30s)" >&2 + exit 1 + fi + sleep 2 + done sshpass -p '${{ secrets.DEPLOY_PASSWORD }}' ssh -o StrictHostKeyChecking=no root@willow.intendednull.com 'systemctl is-active willow-relay' diff --git a/CLAUDE.md b/CLAUDE.md index a984800b..1c90bd3f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -342,7 +342,7 @@ All shared state derived from per-author Merkle DAG of signed events. `willow-st - **EventDag** (`crates/state/src/dag.rs`): in-memory store of all known events, indexed by `EventHash`. `EventDag::insert` validates signature, genesis, `prev`/`deps` linkage. No explicit "merge" — DAG converges as events arrive. - **ServerState** (`crates/state/src/server.rs`): materialized state derived by walking the DAG. - **`materialize::apply_incremental(state, event)`**: ONLY public mutation entry point. Pure. Internally calls `apply_event` + `required_permission` for authority checks. -- **Permission model**: Owner = root of trust. Fine-grained permissions (SyncProvider, ManageChannels, ManageRoles, KickMembers, SendMessages, CreateInvite, Administrator) granted via `GrantPermission` events. +- **Permission model**: Owner = root of trust. Fine-grained permissions (SyncProvider, ManageChannels, ManageRoles, SendMessages, CreateInvite) granted via `GrantPermission` events. Admin status is structurally separate — managed exclusively through `ProposedAction` + vote path, never via `GrantPermission`. Kicks are an admin-only `ProposedAction` (no granular "can kick" permission). - **Sync** (`crates/state/src/sync.rs`): `HeadsSummary` = compact per-author DAG state for efficient sync; `PendingBuffer` holds events arriving before their `prev` chain predecessors. ### Trust Model diff --git a/Cargo.lock b/Cargo.lock index ccf03abd..24d93c8b 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -6151,7 +6151,6 @@ dependencies = [ "serde", "tempfile", "thiserror 2.0.18", - "tokio", "willow-transport", "zeroize", ] @@ -6224,6 +6223,7 @@ dependencies = [ "tracing", "tracing-subscriber", "uuid", + "willow-common", "willow-identity", "willow-network", "willow-state", diff --git a/crates/client/src/worker_cache.rs b/crates/client/src/worker_cache.rs index c633d781..1515aadf 100644 --- a/crates/client/src/worker_cache.rs +++ b/crates/client/src/worker_cache.rs @@ -58,7 +58,15 @@ impl WorkerCache { /// Evict workers that haven't sent a heartbeat within the TTL. pub fn evict_stale(&mut self) { - let cutoff = Instant::now() - self.ttl; + self.evict_stale_at(Instant::now()); + } + + /// Evict workers stale relative to an injected `now`. + /// + /// Used by tests to remove timing dependencies; production callers should + /// use [`Self::evict_stale`]. + pub(crate) fn evict_stale_at(&mut self, now: Instant) { + let cutoff = now - self.ttl; self.workers.retain(|_, info| info.last_seen > cutoff); } @@ -171,10 +179,11 @@ mod tests { #[test] fn evict_stale_removes_expired() { - let mut cache = WorkerCache::new(Duration::from_millis(1)); + let mut cache = WorkerCache::new(Duration::from_secs(30)); cache.update(&make_replay_announcement(gen_id(), vec!["srv-1"])); - std::thread::sleep(Duration::from_millis(10)); - cache.evict_stale(); + // Simulate "now" being well past the TTL boundary instead of sleeping. + let future = Instant::now() + Duration::from_secs(60); + cache.evict_stale_at(future); assert!(cache.is_empty()); } diff --git a/crates/common/src/lib.rs b/crates/common/src/lib.rs index e9ab8c3a..eab85f82 100644 --- a/crates/common/src/lib.rs +++ b/crates/common/src/lib.rs @@ -24,3 +24,21 @@ pub use worker_types::*; /// `willow-common` (already a dep of both crates) guarantees they stay /// aligned at compile time. pub const SYNC_BATCH_LIMIT: usize = 10_000; + +/// Maximum number of `(author, head)` entries accepted from a peer-supplied +/// [`willow_state::HeadsSummary`] in a single sync / history call. +/// +/// Single source of truth shared by: +/// - `willow-storage` — `sync_since` / `history` build SQL by concatenating +/// one `(author = ? AND seq ?)` fragment per entry; an oversize +/// summary would either exceed rusqlite's bind-parameter limit +/// (default 32766) or waste CPU compiling a giant prepared statement. +/// - `willow-replay` — `handle_request(Sync)` iterates per-author into a +/// `BTreeMap` then walks the in-memory DAG; an oversize summary forces +/// O(N) work and is the same DoS shape as the storage path. +/// +/// 256 is well above any plausible honest server's distinct-author count +/// while keeping bind-parameter and BTreeMap-construction costs bounded. +/// Both production sites MUST agree on this value, so the canonical +/// definition lives here alongside [`SYNC_BATCH_LIMIT`]. +pub const MAX_AUTHORS_PER_SYNC: usize = 256; diff --git a/crates/identity/Cargo.toml b/crates/identity/Cargo.toml index 6942b9ce..0ecae1fd 100644 --- a/crates/identity/Cargo.toml +++ b/crates/identity/Cargo.toml @@ -20,5 +20,4 @@ getrandom_02 = { package = "getrandom", version = "0.2", features = ["js"] } getrandom_03 = { package = "getrandom", version = "0.3", features = ["wasm_js"] } [dev-dependencies] -tokio = { workspace = true } tempfile = "3" diff --git a/crates/replay/Cargo.toml b/crates/replay/Cargo.toml index 0e648e25..2f489bd8 100644 --- a/crates/replay/Cargo.toml +++ b/crates/replay/Cargo.toml @@ -12,6 +12,7 @@ willow-worker = { path = "../worker" } willow-state = { path = "../state" } willow-identity = { path = "../identity" } willow-network = { path = "../network" } +willow-common = { path = "../common" } clap = { version = "4", features = ["derive"] } dirs = "6" anyhow = { workspace = true } diff --git a/crates/replay/src/role.rs b/crates/replay/src/role.rs index d51b56a7..1fe2da6d 100644 --- a/crates/replay/src/role.rs +++ b/crates/replay/src/role.rs @@ -7,6 +7,7 @@ use std::collections::{BTreeMap, HashMap}; use tracing::warn; +use willow_common::MAX_AUTHORS_PER_SYNC; use willow_state::{ apply_incremental, Event, EventDag, EventHash, EventKind, HeadsSummary, InsertError, PendingBuffer, ServerState, Snapshot, DEFAULT_PENDING_MAX_AGE_MS, DEFAULT_PENDING_MAX_ENTRIES, @@ -273,6 +274,21 @@ impl WorkerRole for ReplayRole { fn handle_request(&mut self, req: WorkerRequest) -> WorkerResponse { match req { WorkerRequest::Sync { server_id, heads } => { + // Reject peer-supplied summaries that would force O(N) + // BTreeMap construction and DAG walks (see + // `MAX_AUTHORS_PER_SYNC`). Mirrors the storage cap added in + // PR #507 / b075140; gated before any allocation so a hostile + // request fails fast. + if heads.heads.len() > MAX_AUTHORS_PER_SYNC { + return WorkerResponse::Denied { + reason: format!( + "too many heads in sync request: {} > {}", + heads.heads.len(), + MAX_AUTHORS_PER_SYNC + ), + }; + } + let data = match self.servers.get(&server_id) { Some(d) => d, None => { @@ -1458,6 +1474,80 @@ mod tests { } } + // ── Issue #514: oversize HeadsSummary rejection ────────────────────── + // + // Mirrors the storage cap added by PR #507 / b075140. Without a guard + // here, a malicious peer could send a multi-thousand-entry HeadsSummary + // and force replay to do per-author BTreeMap inserts and DAG walks for + // every entry — same DoS shape as the storage path the sibling PR fixed. + + /// Build a `HeadsSummary` with `n` distinct random authors. Mirrors the + /// helper in `crates/storage/src/store.rs` (sibling cap test infra). + fn heads_summary_with_authors(n: usize) -> HeadsSummary { + use willow_state::AuthorHead; + let mut heads = BTreeMap::new(); + for _ in 0..n { + let id = Identity::generate(); + heads.insert( + id.endpoint_id(), + AuthorHead { + seq: 1, + hash: EventHash::ZERO, + }, + ); + } + HeadsSummary { heads } + } + + /// A peer-supplied `HeadsSummary` with more than `MAX_AUTHORS_PER_SYNC` + /// entries must be rejected by `handle_request(Sync)` before any + /// per-author BTreeMap construction or DAG walk occurs. + #[test] + fn sync_request_rejects_oversize_heads() { + use willow_common::MAX_AUTHORS_PER_SYNC; + let mut role = ReplayRole::new(ReplayConfig::default()); + let (_, _) = setup_server(&mut role, "srv-1"); + + let oversize = heads_summary_with_authors(MAX_AUTHORS_PER_SYNC + 1); + let resp = role.handle_request(WorkerRequest::Sync { + server_id: "srv-1".to_string(), + heads: oversize, + }); + + match resp { + WorkerResponse::Denied { reason } => { + assert!( + reason.contains("too many heads"), + "denial reason should mention the cap; got: {reason}" + ); + } + other => panic!("expected Denied for oversize heads, got: {other:?}"), + } + } + + /// `handle_request(Sync)` must accept exactly `MAX_AUTHORS_PER_SYNC` + /// entries — the cap is inclusive on the legal side. + #[test] + fn sync_request_accepts_exact_cap_heads() { + use willow_common::MAX_AUTHORS_PER_SYNC; + let mut role = ReplayRole::new(ReplayConfig::default()); + let (_, _) = setup_server(&mut role, "srv-1"); + + let at_cap = heads_summary_with_authors(MAX_AUTHORS_PER_SYNC); + let resp = role.handle_request(WorkerRequest::Sync { + server_id: "srv-1".to_string(), + heads: at_cap, + }); + + // The peer's heads mention authors we don't know, so events_since + // returns the genesis event; our store has 1 author the peer doesn't, + // so they_are_behind is true. Either branch is acceptable — the only + // forbidden outcome is Denied for at-cap input. + if let WorkerResponse::Denied { reason } = resp { + panic!("at-cap heads must not be denied; got reason: {reason}"); + } + } + /// When the configured `pending_max_entries` is exceeded, the oldest /// entries are evicted and `pending_count()` reflects the cap. #[test] diff --git a/crates/storage/src/store.rs b/crates/storage/src/store.rs index 03028aae..69026a12 100644 --- a/crates/storage/src/store.rs +++ b/crates/storage/src/store.rs @@ -20,25 +20,9 @@ //! never reordered or rewritten so existing databases stay consistent. use rusqlite::{params, Connection}; -use willow_common::SYNC_BATCH_LIMIT; +use willow_common::{MAX_AUTHORS_PER_SYNC, SYNC_BATCH_LIMIT}; use willow_state::{Event, EventKind, HeadsSummary}; -/// Maximum number of `(author, head)` entries accepted from a peer-supplied -/// [`HeadsSummary`] in a single `sync_since` / `history` call. -/// -/// `sync_since` and `history` build their SQL clauses by concatenating one -/// `(author = ? AND seq ?)` fragment per entry. A peer-supplied summary -/// with thousands of entries (still well within the transport envelope cap) -/// would either exceed rusqlite's bind-parameter limit (default 32766) or -/// waste CPU compiling a giant prepared statement. -/// -/// 256 is well above any plausible honest server's distinct-author count -/// while keeping the bind-parameter count (~512 per call) and the SQL -/// expression-tree depth safely below SQLite's defaults -/// (32766 binds, depth 1000). Requests over the cap are rejected at the store layer; -/// the caller in `role.rs` maps the error to a `WorkerResponse::Denied`. -pub const MAX_AUTHORS_PER_SYNC: usize = 256; - /// Ordered list of schema migrations. Each entry is run inside its own /// transaction the first time the database is opened. Once a migration is /// shipped, never edit or reorder it — only append new entries. diff --git a/crates/web/src/app.rs b/crates/web/src/app.rs index b9444aab..da2926b8 100644 --- a/crates/web/src/app.rs +++ b/crates/web/src/app.rs @@ -922,7 +922,10 @@ pub fn App() -> impl IntoView { // Request mic permission SYNCHRONOUSLY in the click handler // to preserve the user gesture chain (required on mobile). - let window = web_sys::window().unwrap(); + let Some(window) = web_sys::window() else { + tracing::error!("No browser window context for getUserMedia"); + return; + }; let navigator = window.navigator(); let Ok(media_devices) = navigator.media_devices() else { tracing::error!("No media devices available"); diff --git a/crates/web/src/components/lifecycle.rs b/crates/web/src/components/lifecycle.rs index aaea2921..64e6e470 100644 --- a/crates/web/src/components/lifecycle.rs +++ b/crates/web/src/components/lifecycle.rs @@ -52,15 +52,68 @@ pub const fn advance(state: LifecycleState) -> LifecycleState { } } +/// Maximum transition duration (in seconds) that we treat as effectively +/// zero — i.e. "skip the wait, snap synchronously". 1ms is the threshold +/// because the global reduced-motion override at `style.css` writes +/// `transition-duration: 0.01ms !important` for every element, and we must +/// honour that as the authoritative reduced-motion contract. Any user CSS +/// using ≤1ms transitions is also indistinguishable from "no transition" +/// for the purpose of `transitionend` firing reliably across engines. +const ZERO_DURATION_EPSILON_SECONDS: f64 = 0.001; + +/// Parse a single CSS `