You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On every restart, OpenViking's SemanticProcessor._process_memory_directory() detects all memory files as changed (reused 0 cached) and launches all pending files at once via asyncio.gather(). On a single-core machine with 148 files, this creates 148+ concurrent coroutines competing for the event loop, causing the Python process to spin at 100% CPU indefinitely — without ever actually issuing VLM API calls.
The process stays in this state permanently. No summaries are generated, no API requests reach the backend, and no error logs are produced. The health endpoint continues to respond (200 OK), masking the deadlock.
Each msg triggers _process_memory_directory() which lists 148 files
All 148 files are marked as pending (reused 0 cached — see "Cache invalidation" below)
asyncio.gather(*[_gen(i, fp) for i, fp in pending_indices]) creates 148 coroutines simultaneously
Each coroutine calls _generate_single_file_summary() → _generate_text_summary() → async with llm_sem: await vlm.get_completion_async(prompt)
With max_concurrent_llm=100 (default vlm.max_concurrent), 100 coroutines enter the semaphore, all trying to make HTTP requests through httpx.AsyncClient
The asyncio event loop becomes CPU-bound from coroutine scheduling overhead alone — no actual I/O completes
Result: 100% CPU, 0 API calls sent, 0 summaries generated, no progress forever
Log evidence (note: no lines after the initial batch logging):
13:00:57 - Generating summaries for 148 changed files concurrently (reused 0 cached)
13:00:57 - Generating summaries for 121 changed files concurrently (reused 0 cached)
13:00:57 - Generating summaries for 148 changed files concurrently (reused 0 cached)
... (×20 messages from different agents)
13:00:58 - GET /health HTTP/1.1 200
(no further log output, CPU stays at 100%)
One API (upstream gateway) shows 0 VLM calls received during this period, confirming the requests never leave the Python process.
Expected Behavior
Summaries should be generated in controlled batches
CPU should remain at reasonable levels during startup reindexing
The system should make steady progress through the queue
Root Cause Analysis
The issue is in semantic_processor.py line ~490 (_process_memory_directory):
# Current code: launches ALL pending files at onceawaitasyncio.gather(*[_gen(i, fp) fori, fpinpending_indices])
On a single-core machine, when pending_indices has 148 items and the semaphore allows 100 concurrent, the asyncio event loop is overwhelmed by:
148 coroutines created simultaneously per _process_memory_directory call
Batch asyncio.gather in _process_memory_directory — the fix above. Prevents coroutine scheduling storms regardless of max_concurrent_llm value.
Lower default vlm.max_concurrent — 100 is unsuitable for most deployments. Consider defaulting to 10 or making it proportional to CPU count.
Fix cache invalidation on restart — Every restart shows reused 0 cached for all files, triggering full reindex. If the overview.md content is preserved across restarts, the cache should work. This may be related to how changed_files is populated from msg.changes vs. the actual file diff.
Add a startup reindex rate limiter — Even with batching, 20+ SemanticMsg tasks concurrently reindexing the same directories wastes resources. Consider deduplicating or debouncing startup-time semantic recomputation (partially addressed by Bug: repeated parent-directory semantic recomputation on each new memory write #769, but still occurs).
[Bug]:
_process_memory_directorylaunches hundreds of concurrent coroutines viaasyncio.gather, causing 100% CPU on low-core machinesEnvironment
Bug Description
On every restart, OpenViking's
SemanticProcessor._process_memory_directory()detects all memory files as changed (reused 0 cached) and launches all pending files at once viaasyncio.gather(). On a single-core machine with 148 files, this creates 148+ concurrent coroutines competing for the event loop, causing the Python process to spin at 100% CPU indefinitely — without ever actually issuing VLM API calls.The process stays in this state permanently. No summaries are generated, no API requests reach the backend, and no error logs are produced. The health endpoint continues to respond (200 OK), masking the deadlock.
Actual Behavior
SemanticMsgtasks (from multiple agents: main, designer, default, etc.)_process_memory_directory()which lists 148 filesreused 0 cached— see "Cache invalidation" below)asyncio.gather(*[_gen(i, fp) for i, fp in pending_indices])creates 148 coroutines simultaneously_generate_single_file_summary()→_generate_text_summary()→async with llm_sem: await vlm.get_completion_async(prompt)max_concurrent_llm=100(defaultvlm.max_concurrent), 100 coroutines enter the semaphore, all trying to make HTTP requests throughhttpx.AsyncClientLog evidence (note: no lines after the initial batch logging):
One API (upstream gateway) shows 0 VLM calls received during this period, confirming the requests never leave the Python process.
Expected Behavior
Root Cause Analysis
The issue is in
semantic_processor.pyline ~490 (_process_memory_directory):On a single-core machine, when
pending_indiceshas 148 items and the semaphore allows 100 concurrent, the asyncio event loop is overwhelmed by:_process_memory_directorycallContributing factors
reused 0 cachedstill occurs on every restart because the cache matching logic (file_path not in changed_files and file_name in existing_summaries) fails whenchangescontains the full setvlm.max_concurrent=100is far too high for low-resource deploymentsFix / Workaround
I applied a two-part fix that completely resolves the issue:
1. Configuration change (
ov.conf){ "vlm": { "max_concurrent": 5 } }2. Code patch: batch processing in
_process_memory_directoryReplace the unbounded
asyncio.gatherwith batched execution:Results after fix
Suggested Improvements
Batch
asyncio.gatherin_process_memory_directory— the fix above. Prevents coroutine scheduling storms regardless ofmax_concurrent_llmvalue.Lower default
vlm.max_concurrent— 100 is unsuitable for most deployments. Consider defaulting to 10 or making it proportional to CPU count.Fix cache invalidation on restart — Every restart shows
reused 0 cachedfor all files, triggering full reindex. If the overview.md content is preserved across restarts, the cache should work. This may be related to howchanged_filesis populated frommsg.changesvs. the actual file diff.Add a startup reindex rate limiter — Even with batching, 20+
SemanticMsgtasks concurrently reindexing the same directories wastes resources. Consider deduplicating or debouncing startup-time semantic recomputation (partially addressed by Bug: repeated parent-directory semantic recomputation on each new memory write #769, but still occurs).Related Issues
_max_concurrent_semanticnot used in queue worker (fixed, but increased exposure to this bug)