fix(memory-conversations): release STORE_LOCK during cold index rebuild (#2849)#2881
Conversation
Before this fix, `search_cross_thread_messages` held `CONVERSATION_STORE_LOCK` for the entire duration of the cold inverted index rebuild (`populate_index_unlocked`), which reads every per-thread JSONL file. On large workspaces this blocked every concurrent `append_message` / `get_messages` call for multiple seconds — the exact contention reported in tinyhumansai#2849. Introduce `prime_index_if_cold()` which: 1. Takes `CONVERSATION_INDEX_CACHE` lock to fast-path if already warm. 2. Takes `CONVERSATION_STORE_LOCK` briefly to snapshot the live thread list via `list_threads_unlocked()`, then immediately releases it. 3. Reads all per-thread JSONL files with *no lock held* (safe — files are append-only; the worst case is one concurrently-written message is absent from this initial build, but `append_message` always updates a warm index so it becomes queryable on the next write). 4. Inserts the built index with `entry().or_insert()` so a concurrent prime that finished first wins. `search_cross_thread_messages` now calls `prime_index_if_cold()` before acquiring `CONVERSATION_STORE_LOCK`, so the long JSONL read happens outside both locks. `with_index` retains its existing cold-build fallback as a safety net for any future callers. Adds a regression test (`search_cold_rebuild_does_not_block_concurrent_append`) that races a cold search against a concurrent append and asserts the append completes within 5 s — a timeout that would be violated under the old serialised code on large workspaces. Fixes tinyhumansai#2849.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPrimes per-workspace inverted indexes outside the store lock via ChangesLock surgery for cross-thread index cold rebuild
Sequence Diagram(s)sequenceDiagram
participant Client
participant ConversationStore
participant CONVERSATION_STORE_LOCK
participant FileSystem
participant CONVERSATION_INDEX_CACHE
Client->>ConversationStore: search_cross_thread_messages()
ConversationStore->>CONVERSATION_INDEX_CACHE: fast-path cache check
alt cache miss
ConversationStore->>CONVERSATION_STORE_LOCK: acquire (snapshot thread IDs)
ConversationStore-->>CONVERSATION_STORE_LOCK: release
ConversationStore->>FileSystem: read per-thread JSONL (rebuild index)
ConversationStore->>CONVERSATION_INDEX_CACHE: insert index if absent
end
CONVERSATION_INDEX_CACHE-->>ConversationStore: cache ready
ConversationStore-->>Client: return search results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/openhuman/memory_conversations/store_tests.rs (1)
908-961: ⚡ Quick winThis regression test likely passes on the unfixed code too.
Rebuilding 200 small messages takes milliseconds, so even the old behavior (rebuild fully under
CONVERSATION_STORE_LOCK) would let the concurrent append finish far inside the 5 s budget — the assertion can't distinguish fixed from broken. The comment at Lines 955-957 anticipates ">1 s on large workspaces," but the seed is nowhere near that. A wall-clock timeout is also inherently flaky under CI load.Consider seeding a substantially larger corpus (and/or longer per-message content) so a fully-serialized rebuild would credibly exceed the timeout, or assert interleaving deterministically rather than on elapsed time. Note this is compounded by the backfill-under-lock issue flagged in
store.rs: with that present, the "fixed" path still holds the store lock during the snapshot backfill, so this test does not exercise the lock-free walk it intends to guard.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory_conversations/store_tests.rs` around lines 908 - 961, The test currently seeds only 200 tiny messages so a serialized rebuild under CONVERSATION_STORE_LOCK finishes too fast and the 5s timeout can't distinguish fixed vs broken behavior; either (A) make the rebuild work credibly long by seeding a much larger corpus and/or larger per-message payloads (e.g., increase the loop in the test that calls store.append_message to tens of thousands or repeat a large string in ConversationMessage.content) so a full serialized rebuild would exceed the 5s window, or (B) make the test deterministic by adding a synchronization hook around the rebuild path (expose a rebuild_started barrier/event in the code under test or instrument the backfill path in store.rs) and use that barrier to force the concurrent append_message call to occur while search_cross_thread_messages is mid-rebuild; reference the test’s use of append_message and search_cross_thread_messages and the lock named CONVERSATION_STORE_LOCK/backfill in store.rs to locate where to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory_conversations/store.rs`:
- Around line 47-53: Update the INVARIANT comment to correctly state that
prime_index_if_cold briefly holds both locks: it first acquires
CONVERSATION_STORE_LOCK and while that guard is held it later acquires
CONVERSATION_INDEX_CACHE (in that order), so although both may be held briefly
the acquisition order preserves the deadlock invariant; reference
prime_index_if_cold, CONVERSATION_STORE_LOCK and CONVERSATION_INDEX_CACHE in the
text and remove the inaccurate "never holds both at the same time" claim.
- Around line 231-262: The comment above the lock-free index build is incorrect:
a message appended during the build window can be permanently omitted from the
in-memory index until a restart due to the race between the prime block (which
reads files and then cache.entry(key).or_insert(idx)) and append_message (which
may write the file but skip updating the cache if it’s still cold); update the
comment in src/openhuman/memory_conversations/store.rs (the prime block that
constructs InvertedIndex and inserts via
CONVERSATION_INDEX_CACHE.lock().entry(key).or_insert(idx)) to explicitly state
that a raced message may be missed in-process until a restart OR alter
append_message to ensure it always merges new messages into an existing index
when present (i.e., after writing the JSONL, check CONVERSATION_INDEX_CACHE and
insert/merge the single message into the index if the cache is now present) so
no message can be permanently omitted.
- Around line 219-230: prime_index_if_cold currently holds
CONVERSATION_STORE_LOCK while calling list_threads_unlocked(), which triggers
per-thread backfill work (measure_messages_unlocked and appending
ThreadLogEntry::Stats) and causes long stalls; change it to acquire
CONVERSATION_STORE_LOCK only to snapshot the thread index via
thread_index_unlocked() and to re-check CONVERSATION_INDEX_CACHE (keeping the
existing cache re-check), then release the lock and perform the expensive
per-thread work (calling list_threads_unlocked() or measure_messages_unlocked
and appending Stats) outside the lock so CONVERSATION_STORE_LOCK is not held
during backfill.
---
Nitpick comments:
In `@src/openhuman/memory_conversations/store_tests.rs`:
- Around line 908-961: The test currently seeds only 200 tiny messages so a
serialized rebuild under CONVERSATION_STORE_LOCK finishes too fast and the 5s
timeout can't distinguish fixed vs broken behavior; either (A) make the rebuild
work credibly long by seeding a much larger corpus and/or larger per-message
payloads (e.g., increase the loop in the test that calls store.append_message to
tens of thousands or repeat a large string in ConversationMessage.content) so a
full serialized rebuild would exceed the 5s window, or (B) make the test
deterministic by adding a synchronization hook around the rebuild path (expose a
rebuild_started barrier/event in the code under test or instrument the backfill
path in store.rs) and use that barrier to force the concurrent append_message
call to occur while search_cross_thread_messages is mid-rebuild; reference the
test’s use of append_message and search_cross_thread_messages and the lock named
CONVERSATION_STORE_LOCK/backfill in store.rs to locate where to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 37f1e5f7-f9ee-4837-9fd2-bbd84bea2ae4
📒 Files selected for processing (2)
src/openhuman/memory_conversations/store.rssrc/openhuman/memory_conversations/store_tests.rs
graycyrus
left a comment
There was a problem hiding this comment.
@staimoorulhassan — CI is still pending on this, so I'll hold off on a formal approval. I did spot one thing while reading through, and it's blocking.
The major CodeRabbit finding about list_threads_unlocked (line 230) is correct and needs to be fixed before this lands. The problem: prime_index_if_cold calls self.list_threads_unlocked() while holding CONVERSATION_STORE_LOCK. On workspaces where any thread has a None message count (common pre-Stats workspaces), list_threads_unlocked triggers measure_messages_unlocked per-thread — which reads every per-thread JSONL and appends ThreadLogEntry::Stats to threads.jsonl. All of that runs under the store lock, reintroducing exactly the multi-second stall this PR set out to fix. The fix is to snapshot only thread IDs via thread_index_unlocked() (JSONL-header-only, no per-thread reads) under the lock, then call list_threads_unlocked (or equivalent backfill logic) outside the lock before building the inverted index.
The regression test won't catch this either — it uses a fresh TempDir with threads created via ensure_thread, so message_count is always populated from the start. The backfill path never triggers in that setup. A test that exercises an existing workspace with None message counts (simulating a pre-Stats workspace) would catch this.
The rest of the PR is solid. The prime_index_if_cold design is correct for the normal path, the entry().or_insert() idempotency is right, the double-check pattern is correct, and the test structure is good. The two minor comment fixes CodeRabbit flagged (lines 53 and 262) are accurate — worth addressing while you're in there.
Fix the list_threads_unlocked call + add a backfill-path test case and this should be good to go.
…f_cold Fixes the backfill regression flagged in review. prime_index_if_cold was calling list_threads_unlocked() under CONVERSATION_STORE_LOCK. On pre-Stats workspaces (threads.jsonl with only Upsert entries, no MessageAppended/Stats history — common for data written before the Stats log), list_threads_unlocked triggers measure_messages_unlocked per-thread (reads every per-thread JSONL file) and appends a Stats entry to threads.jsonl, all while holding the outer lock. This reintroduced exactly the multi-second stall the PR was designed to eliminate. Fix: replace list_threads_unlocked with thread_index_unlocked inside the locked snapshot. thread_index_unlocked reads only threads.jsonl (O(threads), no per-thread I/O), so CONVERSATION_STORE_LOCK is held only for the fast header scan. The per-thread JSONL reads happen outside the lock as intended. Also updates the lock-ordering comment in the CONVERSATION_INDEX_CACHE static to document the list_threads_unlocked constraint explicitly, so future readers know not to revert this choice. Adds two new tests: - prime_index_cold_build_works_on_legacy_workspace_without_stats: seeds a workspace with Upsert-only threads.jsonl (no Stats/ MessageAppended), verifies the cold build indexes all messages. - legacy_workspace_cold_rebuild_does_not_block_concurrent_append: races a cold search on a pre-Stats workspace against a concurrent append, asserts the append completes within 5 s (would have blocked under the old list_threads_unlocked path). The original concurrency test (search_cold_rebuild_does_not_block_ concurrent_append) used ensure_thread + append_message, which always produce MessageAppended entries — the legacy backfill path was never exercised, so it could not catch this regression.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/memory_conversations/store_tests.rs (1)
975-976: ⚖️ Poor tradeoffFile now substantially exceeds the ~500-line guideline.
This test module is ~1,166 lines after the addition. Consider splitting the search/concurrency/legacy regression tests into a dedicated submodule (e.g., a
store_tests/directory with focused files) to keep each file within the size budget.As per coding guidelines: "File size: prefer ≤ ~500 lines per source file; split modules when growing to maintain readability and single responsibility".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory_conversations/store_tests.rs` around lines 975 - 976, The test module src/openhuman/memory_conversations/store_tests.rs has grown far beyond the ~500-line guideline; split it into focused submodules (e.g., create a store_tests/ directory with files like search_tests.rs, concurrency_tests.rs, legacy_regression_tests.rs) and move the corresponding test blocks (the "search" tests, the "concurrency" tests, and the "legacy workspace (pre-Stats backfill path)" regression tests) into those files; then update the parent module to declare pub mod search_tests; pub mod concurrency_tests; pub mod legacy_regression_tests; (or use mod ...; and re-export helpers) and ensure shared fixtures/helpers from store_tests.rs are either moved to a common helpers.rs in the new directory or made pub so the split files can import them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/memory_conversations/store_tests.rs`:
- Around line 975-976: The test module
src/openhuman/memory_conversations/store_tests.rs has grown far beyond the
~500-line guideline; split it into focused submodules (e.g., create a
store_tests/ directory with files like search_tests.rs, concurrency_tests.rs,
legacy_regression_tests.rs) and move the corresponding test blocks (the "search"
tests, the "concurrency" tests, and the "legacy workspace (pre-Stats backfill
path)" regression tests) into those files; then update the parent module to
declare pub mod search_tests; pub mod concurrency_tests; pub mod
legacy_regression_tests; (or use mod ...; and re-export helpers) and ensure
shared fixtures/helpers from store_tests.rs are either moved to a common
helpers.rs in the new directory or made pub so the split files can import them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f4585840-f367-4bca-b1e3-6e0cf0f6367a
📒 Files selected for processing (2)
src/openhuman/memory_conversations/store.rssrc/openhuman/memory_conversations/store_tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/memory_conversations/store.rs
graycyrus
left a comment
There was a problem hiding this comment.
@staimoorulhassan hey! the code looks good to me — the third commit switching to thread_index_unlocked in prime_index_if_cold properly addresses the backfill issue, and the two new legacy-workspace tests cover exactly the scenario I was worried about. CI still has some pending checks, so once those are green i'll come back and approve this. let me know if you need any help!
staimoorulhassan
left a comment
There was a problem hiding this comment.
resolved
…ith_index
The with_index function had stale code from the pre-prime_index_if_cold
implementation left in after merging main. Lines that belonged to the
old search_cross_thread_messages fast-path+cold-rebuild block were
left inside with_index, leaving the if-block unclosed and the function
missing its return statement — causing a compile error.
Remove the orphaned fast-path block and restore the correct closing:
}
let idx = cache.get_mut(&key).expect("inserted above if absent");
Ok(f(idx))
}
|
Fixed compile error from bad merge. The merge commit }
let idx = cache.get_mut(&key).expect("inserted above if absent");
Ok(f(idx))
}All other CI checks (TypeScript, Frontend tests, Rust tests) should now pass. |
Summary
search_cross_thread_messagescall on a large workspace blocked all concurrentappend_message/get_messagescalls for multiple seconds becauseCONVERSATION_STORE_LOCKwas held for the entire cold JSONL rebuild.prime_index_if_cold(), which snapshots the thread list underCONVERSATION_STORE_LOCK(fast, releases immediately), then reads all per-thread JSONL files with no lock held (safe — append-only), then inserts withentry().or_insert()(idempotent against concurrent primes).search_cross_thread_messagescallsprime_index_if_cold()before acquiringCONVERSATION_STORE_LOCK, so the slow JSONL walk never blocks writers.with_indexretains its existing cold-build fallback as a safety net.Root cause
Before this fix the call graph was:
Fix
After the fix:
Test plan
search_cold_rebuild_does_not_block_concurrent_append— races a cold search against a concurrent append behind aBarrier, asserts the append completes within 5 s (would block under old code on large workspaces)store_testscontinue to passcargo checkclean (CI)Summary by CodeRabbit
Bug Fixes
Documentation
Tests