Skip to content

Startup hangs before /health when redo session_memory triggers blocking VLM call #1222

@dikotiledon

Description

@dikotiledon

Summary

OpenViking startup can hang before /health is available when crash-recovery redo includes session_memory extraction. In this state, the process stays alive, but port 1933 never binds, so host integrations (e.g., OpenClaw plugin) hit startup timeout and kill the process.

Environment

  • OpenViking: 0.3.3
  • OS: Windows 11 (10.0.22621)
  • Launch mode: openviking.server.bootstrap (also reproduced when launched by OpenClaw plugin in local mode)
  • Config: ov.conf with VLM/embedding via OpenAI-compatible local endpoint (http://127.0.0.1:1130/v1)

Observed behavior

  • Startup logs stop after storage/queue initialization, e.g.:
    • mounted serverinfofs at /serverinfo
    • mounted queuefs at /queue
    • mounted localfs at /local
    • Created queue 'Embedding' / 'Semantic'
  • /health remains unreachable (connection refused) even though the Python process is still running.
  • When launched by a supervisor/plugin, startup eventually fails with health-check timeout and process termination.

Root-cause evidence

Await-chain tracing during hang consistently shows startup blocked in redo recovery path:

OpenVikingService.initialize
-> LockManager.start
-> LockManager._recover_pending_redo
-> LockManager._redo_session_memory
-> SessionCompressor.extract_long_term_memories
-> MemoryExtractor.extract
-> VLMConfig.get_completion_async
-> OpenAIVLM.get_completion_async
-> HTTP wait (AsyncHTTP11Connection._receive_response_headers)

In other words, startup waits on an outbound VLM request while replaying redo, before server health endpoint is available.

Trigger condition

A pending redo marker existed under:
~/.openviking/data/viking/_system/redo/<task-id>/redo.json

Example payload:

{
  "archive_uri": "viking://session/default/.../history/archive_004",
  "session_uri": "viking://session/default/...",
  "account_id": "default",
  "user_id": "default",
  "agent_id": "main",
  "role": "root"
}

Removing this pending redo marker immediately allowed clean startup and healthy /health.

Expected behavior

  • Startup should not block health availability on long/slow external VLM calls during redo replay.
  • Redo replay should be bounded/async/deferred so server can become healthy first.

Suggested fixes

  1. Do not perform blocking LLM extraction in startup path (LockManager.start):
    • Start server + health first.
    • Process redo replay in background worker.
  2. Add timeout/circuit-breaker around redo session_memory extraction.
  3. On timeout/failure, enqueue semantic fallback and continue startup.
  4. Add explicit logs/metrics for redo task start/end/fail reason and per-task duration.
  5. Consider configurable startup mode: fast-start (defer redo) vs strict-recovery.

Impact

This can cause repeated startup failure loops in production supervisors (process appears alive but never healthy), especially when redo payload requires slow/unresponsive model calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions