[Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines

# [Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines

## Environment

- **OpenViking version**: 0.3.3
- **Python**: 3.11
- **OS**: Linux (1 vCPU, 3.6GB RAM)
- **VLM provider**: OpenAI-compatible API (DashScope via gateway)
- **Memory entities**: ~148 entity files, ~121 preference files

## Bug Description

On every restart, OpenViking's `SemanticProcessor._process_memory_directory()` detects all memory files as changed (`reused 0 cached`) and launches **all pending files at once** via `asyncio.gather()`. On a single-core machine with 148 files, this creates 148+ concurrent coroutines competing for the event loop, causing the Python process to spin at **100% CPU indefinitely** — without ever actually issuing VLM API calls.

The process stays in this state permanently. No summaries are generated, no API requests reach the backend, and no error logs are produced. The health endpoint continues to respond (200 OK), masking the deadlock.

## Actual Behavior

1. OpenViking starts and enqueues ~20 `SemanticMsg` tasks (from multiple agents: main, designer, default, etc.)
2. Each msg triggers `_process_memory_directory()` which lists 148 files
3. All 148 files are marked as pending (`reused 0 cached` — see "Cache invalidation" below)
4. `asyncio.gather(*[_gen(i, fp) for i, fp in pending_indices])` creates 148 coroutines simultaneously
5. Each coroutine calls `_generate_single_file_summary()` → `_generate_text_summary()` → `async with llm_sem: await vlm.get_completion_async(prompt)`
6. With `max_concurrent_llm=100` (default `vlm.max_concurrent`), 100 coroutines enter the semaphore, all trying to make HTTP requests through `httpx.AsyncClient`
7. The asyncio event loop becomes CPU-bound from coroutine scheduling overhead alone — no actual I/O completes
8. **Result**: 100% CPU, 0 API calls sent, 0 summaries generated, no progress forever

Log evidence (note: no lines after the initial batch logging):
```
13:00:57 - Generating summaries for 148 changed files concurrently (reused 0 cached)
13:00:57 - Generating summaries for 121 changed files concurrently (reused 0 cached)
13:00:57 - Generating summaries for 148 changed files concurrently (reused 0 cached)
... (×20 messages from different agents)
13:00:58 - GET /health HTTP/1.1 200
(no further log output, CPU stays at 100%)
```

One API (upstream gateway) shows **0 VLM calls** received during this period, confirming the requests never leave the Python process.

## Expected Behavior

- Summaries should be generated in controlled batches
- CPU should remain at reasonable levels during startup reindexing
- The system should make steady progress through the queue

## Root Cause Analysis

The issue is in `semantic_processor.py` line ~490 (`_process_memory_directory`):

```python
# Current code: launches ALL pending files at once
await asyncio.gather(*[_gen(i, fp) for i, fp in pending_indices])
```

On a single-core machine, when `pending_indices` has 148 items and the semaphore allows 100 concurrent, the asyncio event loop is overwhelmed by:
1. **148 coroutines** created simultaneously per `_process_memory_directory` call
2. **~20 SemanticMsg** processed in parallel (after #873 fix enabled queue-level concurrency)
3. Total: **~2000+ live coroutines** all competing for a single CPU core
4. The event loop spends all CPU time on coroutine scheduling, never yielding to actual I/O

### Contributing factors

- **#873 fix** enabled queue-level parallelism for Semantic queue, which amplified this issue (previously the queue worker was hardcoded to 1)
- **#769** addressed dedup of parent-directory recomputation, but `reused 0 cached` still occurs on every restart because the cache matching logic (`file_path not in changed_files and file_name in existing_summaries`) fails when `changes` contains the full set
- Default `vlm.max_concurrent=100` is far too high for low-resource deployments

## Fix / Workaround

I applied a two-part fix that completely resolves the issue:

### 1. Configuration change (`ov.conf`)

```json
{
  "vlm": {
    "max_concurrent": 5
  }
}
```

### 2. Code patch: batch processing in `_process_memory_directory`

Replace the unbounded `asyncio.gather` with batched execution:

```python
# Before (line ~490):
await asyncio.gather(*[_gen(i, fp) for i, fp in pending_indices])

# After:
batch_size = max(1, min(self.max_concurrent_llm, 10))
for batch_start in range(0, len(pending_indices), batch_size):
    batch = pending_indices[batch_start:batch_start + batch_size]
    logger.info(
        f"Processing batch {batch_start // batch_size + 1}/"
        f"{(len(pending_indices) + batch_size - 1) // batch_size} "
        f"({len(batch)} files)"
    )
    await asyncio.gather(*[_gen(i, fp) for i, fp in batch])
```

### Results after fix

| Metric | Before | After |
|--------|--------|-------|
| CPU usage | 100% (stuck forever) | 4-7% (steady progress) |
| VLM API calls (3 min) | 0 | 91 |
| Batch progress | None | batch 5/30 and advancing |
| Summary generation | Never completes | Completes normally |

## Suggested Improvements

1. **Batch `asyncio.gather` in `_process_memory_directory`** — the fix above. Prevents coroutine scheduling storms regardless of `max_concurrent_llm` value.

2. **Lower default `vlm.max_concurrent`** — 100 is unsuitable for most deployments. Consider defaulting to 10 or making it proportional to CPU count.

3. **Fix cache invalidation on restart** — Every restart shows `reused 0 cached` for all files, triggering full reindex. If the overview.md content is preserved across restarts, the cache should work. This may be related to how `changed_files` is populated from `msg.changes` vs. the actual file diff.

4. **Add a startup reindex rate limiter** — Even with batching, 20+ `SemanticMsg` tasks concurrently reindexing the same directories wastes resources. Consider deduplicating or debouncing startup-time semantic recomputation (partially addressed by #769, but still occurs).

## Related Issues

- #873 — `_max_concurrent_semantic` not used in queue worker (fixed, but increased exposure to this bug)
- #769 — Repeated parent-directory semantic recomputation (partially related to cache invalidation)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines #1245

[Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines

Environment

Bug Description

Actual Behavior

Expected Behavior

Root Cause Analysis

Contributing factors

Fix / Workaround

1. Configuration change (`ov.conf`)

2. Code patch: batch processing in `_process_memory_directory`

Results after fix

Suggested Improvements

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Before	After
CPU usage	100% (stuck forever)	4-7% (steady progress)
VLM API calls (3 min)	0	91
Batch progress	None	batch 5/30 and advancing
Summary generation	Never completes	Completes normally

[Bug]: _process_memory_directory launches hundreds of concurrent coroutines via asyncio.gather, causing 100% CPU on low-core machines #1245

Description

[Bug]: _process_memory_directory launches hundreds of concurrent coroutines via asyncio.gather, causing 100% CPU on low-core machines

Environment

Bug Description

Actual Behavior

Expected Behavior

Root Cause Analysis

Contributing factors

Fix / Workaround

1. Configuration change (ov.conf)

2. Code patch: batch processing in _process_memory_directory

Results after fix

Suggested Improvements

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines #1245

[Bug]: `_process_memory_directory` launches hundreds of concurrent coroutines via `asyncio.gather`, causing 100% CPU on low-core machines

1. Configuration change (`ov.conf`)

2. Code patch: batch processing in `_process_memory_directory`