Summary
pi.ruv.io (Cloud Run service ruvbrain) was returning 504 upstream request timeouts on all endpoints. Every HTTP request timed out at exactly 300s (Cloud Run max timeout) while background cognitive cycles continued running normally.
Root Cause
The mcp_brain_server cognitive cycle (runs every ~3 min) processes heavy workloads on the same async runtime as HTTP handlers:
- 200 auto-votes per cycle
- LoRA auto-submissions from SONA patterns
- Inference processing (11-21 inferences per cycle)
- Strange loop calculations
This appears to starve the HTTP handler, preventing it from serving any requests. The process stays alive (cognitive logs continue) but all HTTP responses hang until the 300s Cloud Run timeout.
Evidence
- All requests returned
504 with latency: 299.97s (Cloud Run timeout ceiling)
- App logs showed healthy cognitive cycles (
Cognitive cycle #22-24) running every ~3 min
SIGTERM graceful shutdown observed at 2026-03-27T13:58:50Z, followed by restart
- After restart, cognitive cycles resumed but HTTP remained unresponsive
- Two instance IDs visible in logs — both instances affected
curl confirmed TCP connect succeeds (0.003s) but app never sends response
Resolution (Temporary)
Redeployed fresh revision (ruvbrain-00159-cjl) — service immediately responsive (77ms).
Permanent Fix Needed
- Move cognitive cycle to a separate tokio task with budget/yield — ensure HTTP handlers are never starved by background processing
- Add a
/healthz liveness probe that Cloud Run can use to detect and restart unresponsive instances
- Consider
tokio::task::spawn_blocking for heavy compute (auto-votes, LoRA) to avoid blocking the async executor
- Add request timeout middleware (e.g., 30s) so individual requests fail fast instead of hanging to 300s
Environment
- Service:
ruvbrain / us-central1
- Image:
gcr.io/ruv-dev/ruvbrain:latest
- Revision:
ruvbrain-00158-2lb (affected), ruvbrain-00159-cjl (redeployed)
- Config: 2 CPU, 2Gi memory, minScale=1, maxScale=10, session affinity enabled
Labels
- priority: high
- component: mcp-brain-server
- type: bug
Summary
pi.ruv.io(Cloud Run serviceruvbrain) was returning 504 upstream request timeouts on all endpoints. Every HTTP request timed out at exactly 300s (Cloud Run max timeout) while background cognitive cycles continued running normally.Root Cause
The
mcp_brain_servercognitive cycle (runs every ~3 min) processes heavy workloads on the same async runtime as HTTP handlers:This appears to starve the HTTP handler, preventing it from serving any requests. The process stays alive (cognitive logs continue) but all HTTP responses hang until the 300s Cloud Run timeout.
Evidence
504withlatency: 299.97s(Cloud Run timeout ceiling)Cognitive cycle #22-24) running every ~3 minSIGTERMgraceful shutdown observed at2026-03-27T13:58:50Z, followed by restartcurlconfirmed TCP connect succeeds (0.003s) but app never sends responseResolution (Temporary)
Redeployed fresh revision (
ruvbrain-00159-cjl) — service immediately responsive (77ms).Permanent Fix Needed
/healthzliveness probe that Cloud Run can use to detect and restart unresponsive instancestokio::task::spawn_blockingfor heavy compute (auto-votes, LoRA) to avoid blocking the async executorEnvironment
ruvbrain/us-central1gcr.io/ruv-dev/ruvbrain:latestruvbrain-00158-2lb(affected),ruvbrain-00159-cjl(redeployed)Labels