Summary
The CircuitBreaker in openviking/utils/circuit_breaker.py (introduced to fix #729) uses hardcoded defaults (failure_threshold=5, reset_timeout=300). When the embedding provider experiences a transient rate limit (e.g., OpenRouter 429), the breaker trips and stays OPEN for 5 minutes. During this window:
- All pending embeddings are re-enqueued every ~30 seconds (
retry_after cap), producing heavy log spam
- Even if the upstream provider recovers in seconds, the server cannot benefit until the full 300s elapses
- If the HALF_OPEN probe also fails, the 300s timer resets, creating a potentially long recovery cycle
This was observed on v0.3.3 with qwen/qwen3-embedding-4b via OpenRouter. The OpenRouter dashboard confirmed the API had recovered within ~1 minute, but the OpenViking circuit breaker remained OPEN for the entire 300s window.
Related issues: #729, #527
Proposal
1. Make circuit breaker configurable via ov.conf
{
"embedding": {
"circuit_breaker": {
"failure_threshold": 5,
"reset_timeout": 60,
"max_reset_timeout": 600
}
}
}
2. Reduce default reset_timeout from 300s to 60s
For transient rate-limit errors (429), 300s is unnecessarily long. A 60s default better matches typical provider recovery times while still protecting against sustained outages.
3. Consider exponential backoff for repeated HALF_OPEN failures
If the HALF_OPEN probe fails, increase the next reset_timeout exponentially (e.g., 60 → 120 → 240, capped at max_reset_timeout). This avoids fixed-interval probing during extended outages while recovering quickly from short blips.
4. Reduce log verbosity for re-enqueue events
The "Circuit breaker is open, re-enqueueing embedding: {uuid}" message floods the journal at WARNING level. Consider:
- Logging the first occurrence + a periodic summary (e.g., "N embeddings re-enqueued in last 30s")
- Or downgrade to DEBUG after the first occurrence
Environment
- OpenViking: v0.3.3
- Embedding:
qwen/qwen3-embedding-4b via OpenRouter
- VLM:
openai/gpt-oss-20b:free via OpenRouter
- OS: Ubuntu 22.04 (systemd)
Reproduction
- Configure embedding via OpenRouter free/rate-limited model
- Store multiple memories in quick succession to trigger rate limiting
- Observe circuit breaker trip and remain OPEN despite upstream recovery
Summary
The
CircuitBreakerinopenviking/utils/circuit_breaker.py(introduced to fix #729) uses hardcoded defaults (failure_threshold=5,reset_timeout=300). When the embedding provider experiences a transient rate limit (e.g., OpenRouter 429), the breaker trips and stays OPEN for 5 minutes. During this window:retry_aftercap), producing heavy log spamThis was observed on v0.3.3 with
qwen/qwen3-embedding-4bvia OpenRouter. The OpenRouter dashboard confirmed the API had recovered within ~1 minute, but the OpenViking circuit breaker remained OPEN for the entire 300s window.Related issues: #729, #527
Proposal
1. Make circuit breaker configurable via
ov.conf{ "embedding": { "circuit_breaker": { "failure_threshold": 5, "reset_timeout": 60, "max_reset_timeout": 600 } } }2. Reduce default
reset_timeoutfrom 300s to 60sFor transient rate-limit errors (429), 300s is unnecessarily long. A 60s default better matches typical provider recovery times while still protecting against sustained outages.
3. Consider exponential backoff for repeated HALF_OPEN failures
If the HALF_OPEN probe fails, increase the next
reset_timeoutexponentially (e.g., 60 → 120 → 240, capped atmax_reset_timeout). This avoids fixed-interval probing during extended outages while recovering quickly from short blips.4. Reduce log verbosity for re-enqueue events
The "Circuit breaker is open, re-enqueueing embedding: {uuid}" message floods the journal at WARNING level. Consider:
Environment
qwen/qwen3-embedding-4bvia OpenRouteropenai/gpt-oss-20b:freevia OpenRouterReproduction