Summary
OpenViking startup can hang before /health is available when crash-recovery redo includes session_memory extraction. In this state, the process stays alive, but port 1933 never binds, so host integrations (e.g., OpenClaw plugin) hit startup timeout and kill the process.
Environment
- OpenViking:
0.3.3
- OS: Windows 11 (10.0.22621)
- Launch mode:
openviking.server.bootstrap (also reproduced when launched by OpenClaw plugin in local mode)
- Config:
ov.conf with VLM/embedding via OpenAI-compatible local endpoint (http://127.0.0.1:1130/v1)
Observed behavior
- Startup logs stop after storage/queue initialization, e.g.:
mounted serverinfofs at /serverinfo
mounted queuefs at /queue
mounted localfs at /local
Created queue 'Embedding' / 'Semantic'
/health remains unreachable (connection refused) even though the Python process is still running.
- When launched by a supervisor/plugin, startup eventually fails with health-check timeout and process termination.
Root-cause evidence
Await-chain tracing during hang consistently shows startup blocked in redo recovery path:
OpenVikingService.initialize
-> LockManager.start
-> LockManager._recover_pending_redo
-> LockManager._redo_session_memory
-> SessionCompressor.extract_long_term_memories
-> MemoryExtractor.extract
-> VLMConfig.get_completion_async
-> OpenAIVLM.get_completion_async
-> HTTP wait (AsyncHTTP11Connection._receive_response_headers)
In other words, startup waits on an outbound VLM request while replaying redo, before server health endpoint is available.
Trigger condition
A pending redo marker existed under:
~/.openviking/data/viking/_system/redo/<task-id>/redo.json
Example payload:
{
"archive_uri": "viking://session/default/.../history/archive_004",
"session_uri": "viking://session/default/...",
"account_id": "default",
"user_id": "default",
"agent_id": "main",
"role": "root"
}
Removing this pending redo marker immediately allowed clean startup and healthy /health.
Expected behavior
- Startup should not block health availability on long/slow external VLM calls during redo replay.
- Redo replay should be bounded/async/deferred so server can become healthy first.
Suggested fixes
- Do not perform blocking LLM extraction in startup path (
LockManager.start):
- Start server + health first.
- Process redo replay in background worker.
- Add timeout/circuit-breaker around redo
session_memory extraction.
- On timeout/failure, enqueue semantic fallback and continue startup.
- Add explicit logs/metrics for redo task start/end/fail reason and per-task duration.
- Consider configurable startup mode:
fast-start (defer redo) vs strict-recovery.
Impact
This can cause repeated startup failure loops in production supervisors (process appears alive but never healthy), especially when redo payload requires slow/unresponsive model calls.
Summary
OpenViking startup can hang before
/healthis available when crash-recovery redo includessession_memoryextraction. In this state, the process stays alive, but port1933never binds, so host integrations (e.g., OpenClaw plugin) hit startup timeout and kill the process.Environment
0.3.3openviking.server.bootstrap(also reproduced when launched by OpenClaw plugin in local mode)ov.confwith VLM/embedding via OpenAI-compatible local endpoint (http://127.0.0.1:1130/v1)Observed behavior
mounted serverinfofs at /serverinfomounted queuefs at /queuemounted localfs at /localCreated queue 'Embedding' / 'Semantic'/healthremains unreachable (connection refused) even though the Python process is still running.Root-cause evidence
Await-chain tracing during hang consistently shows startup blocked in redo recovery path:
OpenVikingService.initialize->
LockManager.start->
LockManager._recover_pending_redo->
LockManager._redo_session_memory->
SessionCompressor.extract_long_term_memories->
MemoryExtractor.extract->
VLMConfig.get_completion_async->
OpenAIVLM.get_completion_async-> HTTP wait (
AsyncHTTP11Connection._receive_response_headers)In other words, startup waits on an outbound VLM request while replaying redo, before server health endpoint is available.
Trigger condition
A pending redo marker existed under:
~/.openviking/data/viking/_system/redo/<task-id>/redo.jsonExample payload:
{ "archive_uri": "viking://session/default/.../history/archive_004", "session_uri": "viking://session/default/...", "account_id": "default", "user_id": "default", "agent_id": "main", "role": "root" }Removing this pending redo marker immediately allowed clean startup and healthy
/health.Expected behavior
Suggested fixes
LockManager.start):session_memoryextraction.fast-start(defer redo) vsstrict-recovery.Impact
This can cause repeated startup failure loops in production supervisors (process appears alive but never healthy), especially when redo payload requires slow/unresponsive model calls.