Skip to content

[Feature]: Make CircuitBreaker parameters configurable via ov.conf & reduce default reset_timeout #1246

@Jiahao-ZHU

Description

@Jiahao-ZHU

Summary

The CircuitBreaker in openviking/utils/circuit_breaker.py (introduced to fix #729) uses hardcoded defaults (failure_threshold=5, reset_timeout=300). When the embedding provider experiences a transient rate limit (e.g., OpenRouter 429), the breaker trips and stays OPEN for 5 minutes. During this window:

  1. All pending embeddings are re-enqueued every ~30 seconds (retry_after cap), producing heavy log spam
  2. Even if the upstream provider recovers in seconds, the server cannot benefit until the full 300s elapses
  3. If the HALF_OPEN probe also fails, the 300s timer resets, creating a potentially long recovery cycle

This was observed on v0.3.3 with qwen/qwen3-embedding-4b via OpenRouter. The OpenRouter dashboard confirmed the API had recovered within ~1 minute, but the OpenViking circuit breaker remained OPEN for the entire 300s window.

Related issues: #729, #527

Proposal

1. Make circuit breaker configurable via ov.conf

{
  "embedding": {
    "circuit_breaker": {
      "failure_threshold": 5,
      "reset_timeout": 60,
      "max_reset_timeout": 600
    }
  }
}

2. Reduce default reset_timeout from 300s to 60s

For transient rate-limit errors (429), 300s is unnecessarily long. A 60s default better matches typical provider recovery times while still protecting against sustained outages.

3. Consider exponential backoff for repeated HALF_OPEN failures

If the HALF_OPEN probe fails, increase the next reset_timeout exponentially (e.g., 60 → 120 → 240, capped at max_reset_timeout). This avoids fixed-interval probing during extended outages while recovering quickly from short blips.

4. Reduce log verbosity for re-enqueue events

The "Circuit breaker is open, re-enqueueing embedding: {uuid}" message floods the journal at WARNING level. Consider:

  • Logging the first occurrence + a periodic summary (e.g., "N embeddings re-enqueued in last 30s")
  • Or downgrade to DEBUG after the first occurrence

Environment

  • OpenViking: v0.3.3
  • Embedding: qwen/qwen3-embedding-4b via OpenRouter
  • VLM: openai/gpt-oss-20b:free via OpenRouter
  • OS: Ubuntu 22.04 (systemd)

Reproduction

  1. Configure embedding via OpenRouter free/rate-limited model
  2. Store multiple memories in quick succession to trigger rate limiting
  3. Observe circuit breaker trip and remain OPEN despite upstream recovery

Metadata

Metadata

Assignees

Labels

securityfor security and safety issues

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions