fix: rate-limit rerun bridge to prevent viewer OOM by spomichter · Pull Request #1509 · dimensionalOS/dimos

spomichter · 2026-03-10T11:43:10Z

Problem

After ~1 hour of running unitree-go2-agentic with the native Rerun viewer (stock or dimos-viewer), the system locks up:

All CPU workers at 80%+
RAM 30GB/31GB, swap 30GB/30GB
Renderer crashpad handler messages

Killing the viewer process recovers the system — proving the viewer process is the memory hog.

Root Cause

Rerun's memory_limit only governs the chunk store (data it can GC). It does not control GPU texture memory, render buffers, or texture caches. When 30fps camera images are logged unthrottled, the renderer's texture cache grows unbounded even though the chunk store stays within limits.

This is upstream Rerun behavior — affects both stock rerun and dimos-viewer.

Fix

Per-entity-path rate limiting in RerunBridgeModule._on_message():

Default: 10 Hz max per entity path (min_interval_sec=0.1)
Drops frames between intervals instead of logging every one
Configurable: set min_interval_sec=0 to disable for low-bandwidth deployments
30fps camera → 10fps to viewer = 3x less texture memory pressure

Changes

bridge.py: Add min_interval_sec to Config, rate-limit in _on_message()
No viewer rebuild needed
No breaking changes (new config field has default value)

Add per-entity-path rate limiting (default 10 Hz) to _on_message. Without this, 30fps camera streams flood the viewer with data faster than its memory eviction can keep up, causing unbounded memory growth in the renderer's texture cache (not governed by memory_limit). Configurable via min_interval_sec on the bridge Config. Set to 0 to disable rate limiting for low-bandwidth deployments.

greptile-apps · 2026-03-10T11:45:22Z

Greptile Summary

This PR adds per-entity-path rate limiting to RerunBridgeModule._on_message() to prevent the Rerun viewer's GPU texture cache from growing unbounded when high-frequency camera streams are logged, which was causing OOM crashes after ~1 hour of operation.

Key changes:

New min_interval_sec: float = 0.1 field on Config (default 10 Hz cap per entity path; set to 0 to disable)
Guard in _on_message that drops incoming frames for an entity path if the elapsed time since its last logged frame is less than min_interval_sec
No breaking changes — new config field defaults to the rate-limited behaviour

Issues found:

The _last_log dict is lazily created inside the hot message path via hasattr(self, "_last_log"), which is evaluated on every single incoming message. Initialising the dict once in start() (before subscriptions begin) removes the per-call check and eliminates a benign but unnecessary concurrent-write window.
_last_log[entity_path] is written before _visual_override_for_entity_path is called, so if an override converter conditionally returns None, the rate-limit slot is consumed even though no frame was actually logged to the viewer. For always-suppressed topics this is harmless, but for conditional filters it means the effective output rate can be lower than 1/min_interval_sec.

Confidence Score: 4/5

Safe to merge — the change is a targeted, well-documented fix with a sound default and no breaking changes.
The logic is correct and the OOM root cause is well understood. The only concerns are a minor hot-path hasattr pattern and a pre-update timestamp ordering issue that only affects conditional suppression scenarios; neither will cause incorrect behaviour in the primary use case.
No files require special attention beyond the two style-level suggestions on bridge.py.

Important Files Changed

Filename	Overview
dimos/visualization/rerun/bridge.py	Adds per-entity-path rate limiting (default 10 Hz) to `_on_message` via a new `min_interval_sec` config field; implementation is correct and non-breaking, with minor style concerns around lazy dict initialisation in the hot path and updating the rate-limit timestamp before confirming data is non-null.

Sequence Diagram

sequenceDiagram
    participant PS as PubSub
    participant BM as RerunBridgeModule._on_message()
    participant RL as Rate Limiter (_last_log)
    participant VO as _visual_override_for_entity_path()
    participant RR as rerun (viewer)

    PS->>BM: message + topic
    BM->>BM: _get_entity_path(topic) → entity_path

    alt min_interval_sec > 0
        BM->>RL: get last timestamp for entity_path
        alt now - last < min_interval_sec
            BM-->>PS: return (frame dropped)
        else interval elapsed
            RL-->>BM: update _last_log[entity_path] = now
        end
    end

    BM->>VO: lookup converter for entity_path (lru_cache)
    VO-->>BM: converter fn

    BM->>BM: converter(msg) → rerun_data

    alt rerun_data is None (suppressed)
        BM-->>PS: return
    else is_rerun_multi(rerun_data)
        loop each (path, archetype)
            BM->>RR: rr.log(path, archetype)
        end
    else single Archetype
        BM->>RR: rr.log(entity_path, rerun_data)
    end

_{Last reviewed commit: 4b37511}

greptile-apps · 2026-03-10T11:45:26Z

dimos/visualization/rerun/bridge.py

+            if not hasattr(self, "_last_log"):
+                self._last_log: dict[str, float] = {}


Lazy _last_log init in hot message path

_on_message is called on every incoming pubsub message (potentially hundreds per second). The hasattr(self, "_last_log") check is evaluated on each call, every time rate-limiting is enabled. Beyond the minor per-call overhead, there is a benign but avoidable race: if two pubsub threads deliver a message simultaneously before _last_log is set, both branches execute and one thread's empty dict silently overwrites the other's.

Prefer initialising _last_log once in start(), before subscriptions begin, so the attribute always exists and the hot path is reduced to a single dict lookup:

# inside start(), before the subscribe_all loop: self._last_log: dict[str, float] = {}

Then the hot-path block simplifies to:

Suggested change

if not hasattr(self, "_last_log"):

self._last_log: dict[str, float] = {}

if self.config.min_interval_sec > 0:

now = time.monotonic()

last = self._last_log.get(entity_path, 0.0)

if now - last < self.config.min_interval_sec:

return

self._last_log[entity_path] = now

greptile-apps · 2026-03-10T11:45:27Z

dimos/visualization/rerun/bridge.py

+            last = self._last_log.get(entity_path, 0.0)
+            if now - last < self.config.min_interval_sec:
+                return
+            self._last_log[entity_path] = now


Timestamp updated even when message is suppressed downstream

_last_log[entity_path] is written here — before _visual_override_for_entity_path is called. If the override converter returns None (i.e., the topic is suppressed), the slot is still "consumed" for this interval: the next arriving message for that entity path will be dropped by the rate limiter even though no frame was actually logged.

For purely suppressed topics this is harmless (nothing ever logs anyway), but for topics whose override sometimes returns None (e.g., conditional filtering), frames can be silently skipped at a lower effective rate than 1/min_interval_sec.

Consider moving the timestamp update to just before the rr.log calls, after confirming rerun_data is not None:

if self.config.min_interval_sec > 0: now = time.monotonic() last = self._last_log.get(entity_path, 0.0) if now - last < self.config.min_interval_sec: return # defer: update timestamp only after data is confirmed non-None

Move rate limiter dict initialization from per-message hasattr check to start(), before subscriptions begin. Removes per-call overhead and a potential race where two threads could overwrite each other's dict.

…1521) * fix(rerun): only rate-limit heavy message types (Image, PointCloud2) The blanket per-entity-path rate limiter (PR #1509) was dropping low-frequency but critical messages like navigation Path and PointStamped (click-to-nav). Only rate-limit types with large payloads that actually cause viewer OOM: Image (~1 MB/frame) and PointCloud2 (~600-800 KB/frame). Light messages (Path, Twist, TF, EntityMarkers, etc.) now pass through unthrottled. * Fix import pointcloud2

@spomichter

Release v0.0.11 82 PRs, 10 contributors, 396 files changed. This release brings a production CLI, MCP tooling, temporal memory, and first-class support for coding agents. Dask has been removed. The entire stack now runs from `dimos run` through `dimos stop`. ### Agent-Native Development DimOS is now built to be driven by coding agents. Point OpenClaw, Claude Code, or Cursor at [AGENTS.md](AGENTS.md) and they can build, run, and debug Dimensional applications using the CLI and MCP interfaces directly. - **AGENTS.md** — comprehensive onboarding doc: architecture, CLI reference, skill rules, blueprint quick-reference. Your agent reads this and starts coding. - **MCP server** — all `@skill` methods exposed as HTTP tools. External agents call `dimos mcp call relative_move --arg forward=0.5` or connect via JSON-RPC. - **MCP CLI** — `dimos mcp list-tools`, `dimos mcp call`, `dimos mcp status`, `dimos mcp modules` - **Agent context logging** — MCP tool calls and agent messages logged to per-run JSONL for debugging and replay. ### CLI & Daemon Full process lifecycle — no more Ctrl-C in tmux. - `dimos run --daemon` — background execution with health checks and run registry - `dimos stop [--force]` — graceful shutdown with SIGTERM → SIGKILL fallback - `dimos restart` — replays the original CLI arguments - `dimos status` — PID, blueprint, uptime, MCP port - `dimos log -f` — structured per-run logs with follow, JSON output, filtering - `dimos show-config` — resolved GlobalConfig with source tracing ### Temporal-Spatial Memory Robots in physical space ingest hours of video and lidar. Temporal-spatial memory gives them a human-like understanding of the world — causal object relationships, entity tracking through time and physical space, and the ability to answer complex temporal queries: *Who spends the most time in the kitchen? What time on average do I wake up? Which set of switches toggles the main lights? Who was at the office at 9am last Thursday?* Traditional frame-level embeddings (CLIP, ViT) lose temporal context and don't scale beyond a handful of frames. Video transformers are expensive and don't operate in RGB-D. Dimensional agents work with video + lidar natively, tracking entities across hours and days. ```bash dimos --replay --replay-dir unitree_go2_office_walk2 run unitree-go2-temporal-memory ``` ### Interactive Viewer Custom Rerun fork (`dimos-viewer`) is now the default. Click-to-navigate: click a point in the 3D view → PointStamped → A* planner → robot moves. - Camera | 3D split layout on Go2, G1, and drone blueprints - Native keyboard teleop in the viewer - `--viewer rerun|rerun-web|rerun-connect|foxglove|none` ### Drone Support Drone blueprints modernized to match Go2 composition pattern. `drone-basic` and `drone-agentic` work with replay, Rerun, and the full CLI. ```bash dimos --replay run drone-basic dimos --replay run drone-agentic ``` ### More - **Go2 fleet control** — multi-robot with `--robot-ips` (#1487) - **Replay `--replay-dir`** — select dataset, loops by default (#1519, #1494) - **Interactive install** — `curl -fsSL .../install.sh | bash` (#1395) - **Nix on non-Debian Linux** (#1472) - **Remove Dask** — native worker pool (#1365) - **Remove asyncio dependency** (#1367) - **Perceive loop** — continuous observation module for agents (#1411) - **Worker resource monitor** — `dtop` TUI (#1378) - **G1 agent wiring fix** (#1518) - **Rerun rate limiting** — prevents viewer OOM on continuous streams (#1509, #1521) - **RotatingFileHandler** — prevents unbounded log growth (#1492) - **Test coverage** (#1397), draft PR CI skip (#1398), manipulation test fixes (#1522) ### Breaking Changes - `--viewer-backend` renamed to `--viewer` - Dask removed — blueprints using Dask workers need migration to native worker pool - Default viewer changed from `rerun-web` to `rerun` (native dimos-viewer) ### Contributors @spomichter, @PaulNechifor, @ruthwikdasyam, @summeryang, @MustafaBhadsorawala, @leshy, @sambull, @JeffHykin, @RadientBrain ## Contributor License Agreement - [x] I have read and approved the [CLA](https://github.com/dimensionalOS/dimos/blob/main/CLA.md).

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

fix: init _last_log in start() instead of lazy hasattr check

72c3fc6

Move rate limiter dict initialization from per-message hasattr check to start(), before subscriptions begin. Removes per-call overhead and a potential race where two threads could overwrite each other's dict.

spomichter merged commit 1ea9500 into dev Mar 10, 2026
12 checks passed

spomichter deleted the fix/rerun-bridge-rate-limit branch March 10, 2026 14:42

spomichter mentioned this pull request Mar 11, 2026

fix(rerun): only rate-limit heavy message types (Image, PointCloud2) #1521

Merged

spomichter mentioned this pull request Mar 11, 2026

Release v0.0.11 #1526

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rate-limit rerun bridge to prevent viewer OOM#1509

fix: rate-limit rerun bridge to prevent viewer OOM#1509
spomichter merged 2 commits intodevfrom
fix/rerun-bridge-rate-limit

spomichter commented Mar 10, 2026

Uh oh!

greptile-apps bot commented Mar 10, 2026

Uh oh!

greptile-apps bot Mar 10, 2026

Uh oh!

greptile-apps bot Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if not hasattr(self, "_last_log"):
		self._last_log: dict[str, float] = {}

Conversation

spomichter commented Mar 10, 2026

Problem

Root Cause

Fix

Changes

Uh oh!

greptile-apps bot commented Mar 10, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant