Skip to content

fix: restore RotatingFileHandler to prevent OOM from unbounded log growth#1492

Merged
spomichter merged 1 commit intodevfrom
fix/logging-oom-unbounded-file-handler
Mar 9, 2026
Merged

fix: restore RotatingFileHandler to prevent OOM from unbounded log growth#1492
spomichter merged 1 commit intodevfrom
fix/logging-oom-unbounded-file-handler

Conversation

@spomichter
Copy link
Contributor

Problem

The daemon PR (#1436) replaced RotatingFileHandler with plain FileHandler in logging_config.py to avoid a theoretical race when forkserver workers rotate the same file. However, this removed the 10 MiB × 20 backup size cap entirely.

Blueprints running cameras + LCM at 30 fps write ~100 MB/min of JSON logs. After 5-10 minutes the log file grows to 500 MB–1 GB, triggering the OOM killer and crashing the entire machine. Multiple people have reported this on current dev.

Solution

Restore RotatingFileHandler with the original limits (10 MiB per file, 20 backups = 200 MiB cap). Both the main setup_logger() handler and the set_run_log_dir() migration handler are fixed.

The forkserver rotation race (multiple workers rotating the same file) can theoretically lose a few interleaved log lines, but this never caused issues in practice. Unbounded growth is a production-breaking OOM.

Changes:

  • setup_logger(): FileHandlerRotatingFileHandler(maxBytes=10MiB, backupCount=20)
  • set_run_log_dir(): migrated handlers also use RotatingFileHandler with same limits

Breaking Changes

None. Restores pre-#1436 behavior.

How to Test

cd /home/ubuntu/dimos
git fetch origin && git checkout fix/logging-oom-unbounded-file-handler
source .venv/bin/activate
# Run any blueprint with camera streams for >5 minutes — should no longer OOM
dimos run unitree-go2-basic --simulation
# Check log files stay bounded:
ls -lh ~/.local/state/dimos/logs/*/main.jsonl*

Contributor License Agreement

  • I have read and approved the CLA

…owth

The daemon PR (#1436) replaced RotatingFileHandler with plain FileHandler
to avoid a theoretical race when forkserver workers rotate the same file.
However, this removed the 10 MiB × 20 backup cap entirely, causing
unbounded log growth. Camera + LCM at 30 fps writes ~100 MB/min of JSON
logs, leading to OOM and full system crashes after 5-10 minutes.

Restore RotatingFileHandler with the original limits. The forkserver
rotation race can lose a few interleaved lines but never caused issues
in practice, whereas unbounded growth is a production-breaking OOM.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR restores RotatingFileHandler (10 MiB per file, 20 backups = 200 MiB cap) in logging_config.py, reverting the FileHandler change from #1436 that eliminated the size cap and caused OOM crashes on resource-constrained devices running camera/LCM workloads at high log volumes (~100 MB/min).

Changes:

  • setup_logger() (line 287-293): Replaces logging.FileHandler with logging.handlers.RotatingFileHandler(maxBytes=10 MiB, backupCount=20) and updates the comment to explicitly document the multi-process trade-off.
  • set_run_log_dir() (line 65-71): Handler-migration path similarly creates RotatingFileHandler instances when re-pointing loggers to a new run directory.

The PR correctly bounds total disk usage to 200 MiB, which is the critical property needed to prevent OOM on production devices. The known architectural issue (multiple forkserver workers holding independent handler instances to the same file) is acknowledged in code comments as an acceptable trade-off.

Confidence Score: 5/5

  • Safe to merge — correctly bounds disk usage and restores verified pre-regression behavior.
  • This PR is a minimal, targeted revert of a known regression (feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685) #1436) that caused production-breaking OOM crashes. The change is straightforward and correct: restoring RotatingFileHandler with appropriate size limits (10 MiB × 20 backups) in both handler creation paths. The architectural issue (multi-handler rotation race) is acknowledged in code comments and is an existing design pattern, not a new regression introduced by this PR. No logic errors, security issues, or correctness problems were found.
  • No files require special attention beyond the single changed file.

Last reviewed commit: e600aa4

@spomichter spomichter merged commit 3ba6501 into dev Mar 9, 2026
12 checks passed
@spomichter spomichter deleted the fix/logging-oom-unbounded-file-handler branch March 9, 2026 16:13
@spomichter spomichter mentioned this pull request Mar 11, 2026
1 task
spomichter added a commit that referenced this pull request Mar 12, 2026
Release v0.0.11

82 PRs, 10 contributors, 396 files changed.

This release brings a production CLI, MCP tooling, temporal memory, and first-class support for coding agents. Dask has been removed. The entire stack now runs from `dimos run` through `dimos stop`.

### Agent-Native Development

DimOS is now built to be driven by coding agents. Point OpenClaw, Claude Code, or Cursor at [AGENTS.md](AGENTS.md) and they can build, run, and debug Dimensional applications using the CLI and MCP interfaces directly.

- **AGENTS.md** — comprehensive onboarding doc: architecture, CLI reference, skill rules, blueprint quick-reference. Your agent reads this and starts coding.
- **MCP server** — all `@skill` methods exposed as HTTP tools. External agents call `dimos mcp call relative_move --arg forward=0.5` or connect via JSON-RPC.
- **MCP CLI** — `dimos mcp list-tools`, `dimos mcp call`, `dimos mcp status`, `dimos mcp modules`
- **Agent context logging** — MCP tool calls and agent messages logged to per-run JSONL for debugging and replay.

### CLI & Daemon

Full process lifecycle — no more Ctrl-C in tmux.

- `dimos run --daemon` — background execution with health checks and run registry
- `dimos stop [--force]` — graceful shutdown with SIGTERM → SIGKILL fallback
- `dimos restart` — replays the original CLI arguments
- `dimos status` — PID, blueprint, uptime, MCP port
- `dimos log -f` — structured per-run logs with follow, JSON output, filtering
- `dimos show-config` — resolved GlobalConfig with source tracing

### Temporal-Spatial Memory

Robots in physical space ingest hours of video and lidar. Temporal-spatial memory gives them a human-like understanding of the world — causal object relationships, entity tracking through time and physical space, and the ability to answer complex temporal queries:

*Who spends the most time in the kitchen? What time on average do I wake up? Which set of switches toggles the main lights? Who was at the office at 9am last Thursday?*

Traditional frame-level embeddings (CLIP, ViT) lose temporal context and don't scale beyond a handful of frames. Video transformers are expensive and don't operate in RGB-D. Dimensional agents work with video + lidar natively, tracking entities across hours and days.

```bash
dimos --replay --replay-dir unitree_go2_office_walk2 run unitree-go2-temporal-memory
```

### Interactive Viewer

Custom Rerun fork (`dimos-viewer`) is now the default. Click-to-navigate: click a point in the 3D view → PointStamped → A* planner → robot moves.

- Camera | 3D split layout on Go2, G1, and drone blueprints
- Native keyboard teleop in the viewer
- `--viewer rerun|rerun-web|rerun-connect|foxglove|none`

### Drone Support

Drone blueprints modernized to match Go2 composition pattern. `drone-basic` and `drone-agentic` work with replay, Rerun, and the full CLI.

```bash
dimos --replay run drone-basic
dimos --replay run drone-agentic
```

### More

- **Go2 fleet control** — multi-robot with `--robot-ips` (#1487)
- **Replay `--replay-dir`** — select dataset, loops by default (#1519, #1494)
- **Interactive install** — `curl -fsSL .../install.sh | bash` (#1395)
- **Nix on non-Debian Linux** (#1472)
- **Remove Dask** — native worker pool (#1365)
- **Remove asyncio dependency** (#1367)
- **Perceive loop** — continuous observation module for agents (#1411)
- **Worker resource monitor** — `dtop` TUI (#1378)
- **G1 agent wiring fix** (#1518)
- **Rerun rate limiting** — prevents viewer OOM on continuous streams (#1509, #1521)
- **RotatingFileHandler** — prevents unbounded log growth (#1492)
- **Test coverage** (#1397), draft PR CI skip (#1398), manipulation test fixes (#1522)

### Breaking Changes

- `--viewer-backend` renamed to `--viewer`
- Dask removed — blueprints using Dask workers need migration to native worker pool
- Default viewer changed from `rerun-web` to `rerun` (native dimos-viewer)

### Contributors

@spomichter, @PaulNechifor, @ruthwikdasyam, @summeryang, @MustafaBhadsorawala, @leshy, @sambull, @JeffHykin, @RadientBrain

## Contributor License Agreement

- [x] I have read and approved the [CLA](https://github.com/dimensionalOS/dimos/blob/main/CLA.md).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant