Skip to content

HTTP listener dies under sustained multiplayer load (process stays alive) #384

@intel352

Description

@intel352

Problem

The workflow engine's HTTP server listener dies under sustained multiplayer load while the Go process remains alive. gRPC port 9090 may still respond but HTTP port 8090 returns connection refused.

Reproduction

Run the cardgame server with multiple concurrent games (5+ simultaneous games across different game types). After ~10-15 minutes of active play with 12+ concurrent MCP clients making gRPC calls (which proxy through HTTP pipelines internally), the HTTP listener stops accepting connections.

Observed Behavior

  • Process alive (ps aux shows it running, ~60MB RSS)
  • HTTP port 8090: connection refused
  • gRPC port 9090: sometimes still responds, sometimes also dead
  • No panic in stdout/stderr logs
  • Happens consistently after sustained load (2 crashes in ~20 minutes)
  • All in-memory game state lost on restart (no persistence)

Suspected Causes

  1. Goroutine leak — each pipeline execution spawns goroutines for step execution. Under high concurrency, leaked goroutines may exhaust the runtime.
  2. Unrecovered panic — a panic in the HTTP handler goroutine kills the listener but not the process.
  3. File descriptor exhaustion — many concurrent HTTP connections + gRPC connections + SQLite stores may hit ulimit.
  4. HTTP server error handler — the module.StandardHTTPServer may have an error channel that fills up and blocks the accept loop.

Impact

  • Server crashes wipe all game state (in-memory only)
  • Multiplayer games interrupted mid-play
  • Clients get connection refused with no error message

Suggested Investigation

  1. Add recover() in the HTTP handler to catch panics
  2. Add goroutine count logging (runtime.NumGoroutine()) on each request
  3. Add graceful shutdown with state persistence
  4. Check if http.Server.ListenAndServe error is being swallowed

Context

Discovered during multiplayer QA of workflow-cardgame (7 game types, 12 concurrent agents). Two crashes in 20 minutes of sustained play.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions