fix: rate-limit rerun bridge to prevent viewer OOM#1509
Conversation
Add per-entity-path rate limiting (default 10 Hz) to _on_message. Without this, 30fps camera streams flood the viewer with data faster than its memory eviction can keep up, causing unbounded memory growth in the renderer's texture cache (not governed by memory_limit). Configurable via min_interval_sec on the bridge Config. Set to 0 to disable rate limiting for low-bandwidth deployments.
Greptile SummaryThis PR adds per-entity-path rate limiting to Key changes:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant PS as PubSub
participant BM as RerunBridgeModule._on_message()
participant RL as Rate Limiter (_last_log)
participant VO as _visual_override_for_entity_path()
participant RR as rerun (viewer)
PS->>BM: message + topic
BM->>BM: _get_entity_path(topic) → entity_path
alt min_interval_sec > 0
BM->>RL: get last timestamp for entity_path
alt now - last < min_interval_sec
BM-->>PS: return (frame dropped)
else interval elapsed
RL-->>BM: update _last_log[entity_path] = now
end
end
BM->>VO: lookup converter for entity_path (lru_cache)
VO-->>BM: converter fn
BM->>BM: converter(msg) → rerun_data
alt rerun_data is None (suppressed)
BM-->>PS: return
else is_rerun_multi(rerun_data)
loop each (path, archetype)
BM->>RR: rr.log(path, archetype)
end
else single Archetype
BM->>RR: rr.log(entity_path, rerun_data)
end
Last reviewed commit: 4b37511 |
dimos/visualization/rerun/bridge.py
Outdated
| if not hasattr(self, "_last_log"): | ||
| self._last_log: dict[str, float] = {} |
There was a problem hiding this comment.
Lazy _last_log init in hot message path
_on_message is called on every incoming pubsub message (potentially hundreds per second). The hasattr(self, "_last_log") check is evaluated on each call, every time rate-limiting is enabled. Beyond the minor per-call overhead, there is a benign but avoidable race: if two pubsub threads deliver a message simultaneously before _last_log is set, both branches execute and one thread's empty dict silently overwrites the other's.
Prefer initialising _last_log once in start(), before subscriptions begin, so the attribute always exists and the hot path is reduced to a single dict lookup:
# inside start(), before the subscribe_all loop:
self._last_log: dict[str, float] = {}Then the hot-path block simplifies to:
| if not hasattr(self, "_last_log"): | |
| self._last_log: dict[str, float] = {} | |
| if self.config.min_interval_sec > 0: | |
| now = time.monotonic() | |
| last = self._last_log.get(entity_path, 0.0) | |
| if now - last < self.config.min_interval_sec: | |
| return | |
| self._last_log[entity_path] = now |
| last = self._last_log.get(entity_path, 0.0) | ||
| if now - last < self.config.min_interval_sec: | ||
| return | ||
| self._last_log[entity_path] = now |
There was a problem hiding this comment.
Timestamp updated even when message is suppressed downstream
_last_log[entity_path] is written here — before _visual_override_for_entity_path is called. If the override converter returns None (i.e., the topic is suppressed), the slot is still "consumed" for this interval: the next arriving message for that entity path will be dropped by the rate limiter even though no frame was actually logged.
For purely suppressed topics this is harmless (nothing ever logs anyway), but for topics whose override sometimes returns None (e.g., conditional filtering), frames can be silently skipped at a lower effective rate than 1/min_interval_sec.
Consider moving the timestamp update to just before the rr.log calls, after confirming rerun_data is not None:
if self.config.min_interval_sec > 0:
now = time.monotonic()
last = self._last_log.get(entity_path, 0.0)
if now - last < self.config.min_interval_sec:
return
# defer: update timestamp only after data is confirmed non-NoneMove rate limiter dict initialization from per-message hasattr check to start(), before subscriptions begin. Removes per-call overhead and a potential race where two threads could overwrite each other's dict.
…1521) * fix(rerun): only rate-limit heavy message types (Image, PointCloud2) The blanket per-entity-path rate limiter (PR #1509) was dropping low-frequency but critical messages like navigation Path and PointStamped (click-to-nav). Only rate-limit types with large payloads that actually cause viewer OOM: Image (~1 MB/frame) and PointCloud2 (~600-800 KB/frame). Light messages (Path, Twist, TF, EntityMarkers, etc.) now pass through unthrottled. * Fix import pointcloud2
Release v0.0.11 82 PRs, 10 contributors, 396 files changed. This release brings a production CLI, MCP tooling, temporal memory, and first-class support for coding agents. Dask has been removed. The entire stack now runs from `dimos run` through `dimos stop`. ### Agent-Native Development DimOS is now built to be driven by coding agents. Point OpenClaw, Claude Code, or Cursor at [AGENTS.md](AGENTS.md) and they can build, run, and debug Dimensional applications using the CLI and MCP interfaces directly. - **AGENTS.md** — comprehensive onboarding doc: architecture, CLI reference, skill rules, blueprint quick-reference. Your agent reads this and starts coding. - **MCP server** — all `@skill` methods exposed as HTTP tools. External agents call `dimos mcp call relative_move --arg forward=0.5` or connect via JSON-RPC. - **MCP CLI** — `dimos mcp list-tools`, `dimos mcp call`, `dimos mcp status`, `dimos mcp modules` - **Agent context logging** — MCP tool calls and agent messages logged to per-run JSONL for debugging and replay. ### CLI & Daemon Full process lifecycle — no more Ctrl-C in tmux. - `dimos run --daemon` — background execution with health checks and run registry - `dimos stop [--force]` — graceful shutdown with SIGTERM → SIGKILL fallback - `dimos restart` — replays the original CLI arguments - `dimos status` — PID, blueprint, uptime, MCP port - `dimos log -f` — structured per-run logs with follow, JSON output, filtering - `dimos show-config` — resolved GlobalConfig with source tracing ### Temporal-Spatial Memory Robots in physical space ingest hours of video and lidar. Temporal-spatial memory gives them a human-like understanding of the world — causal object relationships, entity tracking through time and physical space, and the ability to answer complex temporal queries: *Who spends the most time in the kitchen? What time on average do I wake up? Which set of switches toggles the main lights? Who was at the office at 9am last Thursday?* Traditional frame-level embeddings (CLIP, ViT) lose temporal context and don't scale beyond a handful of frames. Video transformers are expensive and don't operate in RGB-D. Dimensional agents work with video + lidar natively, tracking entities across hours and days. ```bash dimos --replay --replay-dir unitree_go2_office_walk2 run unitree-go2-temporal-memory ``` ### Interactive Viewer Custom Rerun fork (`dimos-viewer`) is now the default. Click-to-navigate: click a point in the 3D view → PointStamped → A* planner → robot moves. - Camera | 3D split layout on Go2, G1, and drone blueprints - Native keyboard teleop in the viewer - `--viewer rerun|rerun-web|rerun-connect|foxglove|none` ### Drone Support Drone blueprints modernized to match Go2 composition pattern. `drone-basic` and `drone-agentic` work with replay, Rerun, and the full CLI. ```bash dimos --replay run drone-basic dimos --replay run drone-agentic ``` ### More - **Go2 fleet control** — multi-robot with `--robot-ips` (#1487) - **Replay `--replay-dir`** — select dataset, loops by default (#1519, #1494) - **Interactive install** — `curl -fsSL .../install.sh | bash` (#1395) - **Nix on non-Debian Linux** (#1472) - **Remove Dask** — native worker pool (#1365) - **Remove asyncio dependency** (#1367) - **Perceive loop** — continuous observation module for agents (#1411) - **Worker resource monitor** — `dtop` TUI (#1378) - **G1 agent wiring fix** (#1518) - **Rerun rate limiting** — prevents viewer OOM on continuous streams (#1509, #1521) - **RotatingFileHandler** — prevents unbounded log growth (#1492) - **Test coverage** (#1397), draft PR CI skip (#1398), manipulation test fixes (#1522) ### Breaking Changes - `--viewer-backend` renamed to `--viewer` - Dask removed — blueprints using Dask workers need migration to native worker pool - Default viewer changed from `rerun-web` to `rerun` (native dimos-viewer) ### Contributors @spomichter, @PaulNechifor, @ruthwikdasyam, @summeryang, @MustafaBhadsorawala, @leshy, @sambull, @JeffHykin, @RadientBrain ## Contributor License Agreement - [x] I have read and approved the [CLA](https://github.com/dimensionalOS/dimos/blob/main/CLA.md).
Problem
After ~1 hour of running
unitree-go2-agenticwith the native Rerun viewer (stock or dimos-viewer), the system locks up:Killing the viewer process recovers the system — proving the viewer process is the memory hog.
Root Cause
Rerun's
memory_limitonly governs the chunk store (data it can GC). It does not control GPU texture memory, render buffers, or texture caches. When 30fps camera images are logged unthrottled, the renderer's texture cache grows unbounded even though the chunk store stays within limits.This is upstream Rerun behavior — affects both stock
rerunanddimos-viewer.Fix
Per-entity-path rate limiting in
RerunBridgeModule._on_message():min_interval_sec=0.1)min_interval_sec=0to disable for low-bandwidth deploymentsChanges
bridge.py: Addmin_interval_secto Config, rate-limit in_on_message()