Startup hangs before /health when redo session_memory triggers blocking VLM call

## Summary
OpenViking startup can hang before `/health` is available when crash-recovery redo includes `session_memory` extraction. In this state, the process stays alive, but port `1933` never binds, so host integrations (e.g., OpenClaw plugin) hit startup timeout and kill the process.

## Environment
- OpenViking: `0.3.3`
- OS: Windows 11 (10.0.22621)
- Launch mode: `openviking.server.bootstrap` (also reproduced when launched by OpenClaw plugin in local mode)
- Config: `ov.conf` with VLM/embedding via OpenAI-compatible local endpoint (`http://127.0.0.1:1130/v1`)

## Observed behavior
- Startup logs stop after storage/queue initialization, e.g.:
  - `mounted serverinfofs at /serverinfo`
  - `mounted queuefs at /queue`
  - `mounted localfs at /local`
  - `Created queue 'Embedding' / 'Semantic'`
- `/health` remains unreachable (`connection refused`) even though the Python process is still running.
- When launched by a supervisor/plugin, startup eventually fails with health-check timeout and process termination.

## Root-cause evidence
Await-chain tracing during hang consistently shows startup blocked in redo recovery path:

`OpenVikingService.initialize`
-> `LockManager.start`
-> `LockManager._recover_pending_redo`
-> `LockManager._redo_session_memory`
-> `SessionCompressor.extract_long_term_memories`
-> `MemoryExtractor.extract`
-> `VLMConfig.get_completion_async`
-> `OpenAIVLM.get_completion_async`
-> HTTP wait (`AsyncHTTP11Connection._receive_response_headers`)

In other words, startup waits on an outbound VLM request while replaying redo, before server health endpoint is available.

## Trigger condition
A pending redo marker existed under:
`~/.openviking/data/viking/_system/redo/<task-id>/redo.json`

Example payload:
```json
{
  "archive_uri": "viking://session/default/.../history/archive_004",
  "session_uri": "viking://session/default/...",
  "account_id": "default",
  "user_id": "default",
  "agent_id": "main",
  "role": "root"
}
```

Removing this pending redo marker immediately allowed clean startup and healthy `/health`.

## Expected behavior
- Startup should not block health availability on long/slow external VLM calls during redo replay.
- Redo replay should be bounded/async/deferred so server can become healthy first.

## Suggested fixes
1. **Do not perform blocking LLM extraction in startup path** (`LockManager.start`):
   - Start server + health first.
   - Process redo replay in background worker.
2. Add timeout/circuit-breaker around redo `session_memory` extraction.
3. On timeout/failure, enqueue semantic fallback and continue startup.
4. Add explicit logs/metrics for redo task start/end/fail reason and per-task duration.
5. Consider configurable startup mode: `fast-start` (defer redo) vs `strict-recovery`.

## Impact
This can cause repeated startup failure loops in production supervisors (process appears alive but never healthy), especially when redo payload requires slow/unresponsive model calls.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup hangs before /health when redo session_memory triggers blocking VLM call #1222

Summary

Environment

Observed behavior

Root-cause evidence

Trigger condition

Expected behavior

Suggested fixes

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Startup hangs before /health when redo session_memory triggers blocking VLM call #1222

Description

Summary

Environment

Observed behavior

Root-cause evidence

Trigger condition

Expected behavior

Suggested fixes

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions