Skip to content

fix: rate-limit rerun bridge to prevent viewer OOM#1509

Merged
spomichter merged 2 commits intodevfrom
fix/rerun-bridge-rate-limit
Mar 10, 2026
Merged

fix: rate-limit rerun bridge to prevent viewer OOM#1509
spomichter merged 2 commits intodevfrom
fix/rerun-bridge-rate-limit

Conversation

@spomichter
Copy link
Contributor

Problem

After ~1 hour of running unitree-go2-agentic with the native Rerun viewer (stock or dimos-viewer), the system locks up:

  • All CPU workers at 80%+
  • RAM 30GB/31GB, swap 30GB/30GB
  • Renderer crashpad handler messages

Killing the viewer process recovers the system — proving the viewer process is the memory hog.

Root Cause

Rerun's memory_limit only governs the chunk store (data it can GC). It does not control GPU texture memory, render buffers, or texture caches. When 30fps camera images are logged unthrottled, the renderer's texture cache grows unbounded even though the chunk store stays within limits.

This is upstream Rerun behavior — affects both stock rerun and dimos-viewer.

Fix

Per-entity-path rate limiting in RerunBridgeModule._on_message():

  • Default: 10 Hz max per entity path (min_interval_sec=0.1)
  • Drops frames between intervals instead of logging every one
  • Configurable: set min_interval_sec=0 to disable for low-bandwidth deployments
  • 30fps camera → 10fps to viewer = 3x less texture memory pressure

Changes

  • bridge.py: Add min_interval_sec to Config, rate-limit in _on_message()
  • No viewer rebuild needed
  • No breaking changes (new config field has default value)

Add per-entity-path rate limiting (default 10 Hz) to _on_message.
Without this, 30fps camera streams flood the viewer with data faster
than its memory eviction can keep up, causing unbounded memory growth
in the renderer's texture cache (not governed by memory_limit).

Configurable via min_interval_sec on the bridge Config. Set to 0 to
disable rate limiting for low-bandwidth deployments.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR adds per-entity-path rate limiting to RerunBridgeModule._on_message() to prevent the Rerun viewer's GPU texture cache from growing unbounded when high-frequency camera streams are logged, which was causing OOM crashes after ~1 hour of operation.

Key changes:

  • New min_interval_sec: float = 0.1 field on Config (default 10 Hz cap per entity path; set to 0 to disable)
  • Guard in _on_message that drops incoming frames for an entity path if the elapsed time since its last logged frame is less than min_interval_sec
  • No breaking changes — new config field defaults to the rate-limited behaviour

Issues found:

  • The _last_log dict is lazily created inside the hot message path via hasattr(self, "_last_log"), which is evaluated on every single incoming message. Initialising the dict once in start() (before subscriptions begin) removes the per-call check and eliminates a benign but unnecessary concurrent-write window.
  • _last_log[entity_path] is written before _visual_override_for_entity_path is called, so if an override converter conditionally returns None, the rate-limit slot is consumed even though no frame was actually logged to the viewer. For always-suppressed topics this is harmless, but for conditional filters it means the effective output rate can be lower than 1/min_interval_sec.

Confidence Score: 4/5

  • Safe to merge — the change is a targeted, well-documented fix with a sound default and no breaking changes.
  • The logic is correct and the OOM root cause is well understood. The only concerns are a minor hot-path hasattr pattern and a pre-update timestamp ordering issue that only affects conditional suppression scenarios; neither will cause incorrect behaviour in the primary use case.
  • No files require special attention beyond the two style-level suggestions on bridge.py.

Important Files Changed

Filename Overview
dimos/visualization/rerun/bridge.py Adds per-entity-path rate limiting (default 10 Hz) to _on_message via a new min_interval_sec config field; implementation is correct and non-breaking, with minor style concerns around lazy dict initialisation in the hot path and updating the rate-limit timestamp before confirming data is non-null.

Sequence Diagram

sequenceDiagram
    participant PS as PubSub
    participant BM as RerunBridgeModule._on_message()
    participant RL as Rate Limiter (_last_log)
    participant VO as _visual_override_for_entity_path()
    participant RR as rerun (viewer)

    PS->>BM: message + topic
    BM->>BM: _get_entity_path(topic) → entity_path

    alt min_interval_sec > 0
        BM->>RL: get last timestamp for entity_path
        alt now - last < min_interval_sec
            BM-->>PS: return (frame dropped)
        else interval elapsed
            RL-->>BM: update _last_log[entity_path] = now
        end
    end

    BM->>VO: lookup converter for entity_path (lru_cache)
    VO-->>BM: converter fn

    BM->>BM: converter(msg) → rerun_data

    alt rerun_data is None (suppressed)
        BM-->>PS: return
    else is_rerun_multi(rerun_data)
        loop each (path, archetype)
            BM->>RR: rr.log(path, archetype)
        end
    else single Archetype
        BM->>RR: rr.log(entity_path, rerun_data)
    end
Loading

Last reviewed commit: 4b37511

Comment on lines +264 to +265
if not hasattr(self, "_last_log"):
self._last_log: dict[str, float] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy _last_log init in hot message path

_on_message is called on every incoming pubsub message (potentially hundreds per second). The hasattr(self, "_last_log") check is evaluated on each call, every time rate-limiting is enabled. Beyond the minor per-call overhead, there is a benign but avoidable race: if two pubsub threads deliver a message simultaneously before _last_log is set, both branches execute and one thread's empty dict silently overwrites the other's.

Prefer initialising _last_log once in start(), before subscriptions begin, so the attribute always exists and the hot path is reduced to a single dict lookup:

# inside start(), before the subscribe_all loop:
self._last_log: dict[str, float] = {}

Then the hot-path block simplifies to:

Suggested change
if not hasattr(self, "_last_log"):
self._last_log: dict[str, float] = {}
if self.config.min_interval_sec > 0:
now = time.monotonic()
last = self._last_log.get(entity_path, 0.0)
if now - last < self.config.min_interval_sec:
return
self._last_log[entity_path] = now

last = self._last_log.get(entity_path, 0.0)
if now - last < self.config.min_interval_sec:
return
self._last_log[entity_path] = now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp updated even when message is suppressed downstream

_last_log[entity_path] is written here — before _visual_override_for_entity_path is called. If the override converter returns None (i.e., the topic is suppressed), the slot is still "consumed" for this interval: the next arriving message for that entity path will be dropped by the rate limiter even though no frame was actually logged.

For purely suppressed topics this is harmless (nothing ever logs anyway), but for topics whose override sometimes returns None (e.g., conditional filtering), frames can be silently skipped at a lower effective rate than 1/min_interval_sec.

Consider moving the timestamp update to just before the rr.log calls, after confirming rerun_data is not None:

if self.config.min_interval_sec > 0:
    now = time.monotonic()
    last = self._last_log.get(entity_path, 0.0)
    if now - last < self.config.min_interval_sec:
        return
    # defer: update timestamp only after data is confirmed non-None

Move rate limiter dict initialization from per-message hasattr check
to start(), before subscriptions begin. Removes per-call overhead and
a potential race where two threads could overwrite each other's dict.
@spomichter spomichter merged commit 1ea9500 into dev Mar 10, 2026
12 checks passed
@spomichter spomichter deleted the fix/rerun-bridge-rate-limit branch March 10, 2026 14:42
spomichter added a commit that referenced this pull request Mar 11, 2026
…1521)

* fix(rerun): only rate-limit heavy message types (Image, PointCloud2)

The blanket per-entity-path rate limiter (PR #1509) was dropping
low-frequency but critical messages like navigation Path and
PointStamped (click-to-nav).

Only rate-limit types with large payloads that actually cause viewer
OOM: Image (~1 MB/frame) and PointCloud2 (~600-800 KB/frame).
Light messages (Path, Twist, TF, EntityMarkers, etc.) now pass
through unthrottled.

* Fix import pointcloud2
@spomichter spomichter mentioned this pull request Mar 11, 2026
1 task
spomichter added a commit that referenced this pull request Mar 12, 2026
Release v0.0.11

82 PRs, 10 contributors, 396 files changed.

This release brings a production CLI, MCP tooling, temporal memory, and first-class support for coding agents. Dask has been removed. The entire stack now runs from `dimos run` through `dimos stop`.

### Agent-Native Development

DimOS is now built to be driven by coding agents. Point OpenClaw, Claude Code, or Cursor at [AGENTS.md](AGENTS.md) and they can build, run, and debug Dimensional applications using the CLI and MCP interfaces directly.

- **AGENTS.md** — comprehensive onboarding doc: architecture, CLI reference, skill rules, blueprint quick-reference. Your agent reads this and starts coding.
- **MCP server** — all `@skill` methods exposed as HTTP tools. External agents call `dimos mcp call relative_move --arg forward=0.5` or connect via JSON-RPC.
- **MCP CLI** — `dimos mcp list-tools`, `dimos mcp call`, `dimos mcp status`, `dimos mcp modules`
- **Agent context logging** — MCP tool calls and agent messages logged to per-run JSONL for debugging and replay.

### CLI & Daemon

Full process lifecycle — no more Ctrl-C in tmux.

- `dimos run --daemon` — background execution with health checks and run registry
- `dimos stop [--force]` — graceful shutdown with SIGTERM → SIGKILL fallback
- `dimos restart` — replays the original CLI arguments
- `dimos status` — PID, blueprint, uptime, MCP port
- `dimos log -f` — structured per-run logs with follow, JSON output, filtering
- `dimos show-config` — resolved GlobalConfig with source tracing

### Temporal-Spatial Memory

Robots in physical space ingest hours of video and lidar. Temporal-spatial memory gives them a human-like understanding of the world — causal object relationships, entity tracking through time and physical space, and the ability to answer complex temporal queries:

*Who spends the most time in the kitchen? What time on average do I wake up? Which set of switches toggles the main lights? Who was at the office at 9am last Thursday?*

Traditional frame-level embeddings (CLIP, ViT) lose temporal context and don't scale beyond a handful of frames. Video transformers are expensive and don't operate in RGB-D. Dimensional agents work with video + lidar natively, tracking entities across hours and days.

```bash
dimos --replay --replay-dir unitree_go2_office_walk2 run unitree-go2-temporal-memory
```

### Interactive Viewer

Custom Rerun fork (`dimos-viewer`) is now the default. Click-to-navigate: click a point in the 3D view → PointStamped → A* planner → robot moves.

- Camera | 3D split layout on Go2, G1, and drone blueprints
- Native keyboard teleop in the viewer
- `--viewer rerun|rerun-web|rerun-connect|foxglove|none`

### Drone Support

Drone blueprints modernized to match Go2 composition pattern. `drone-basic` and `drone-agentic` work with replay, Rerun, and the full CLI.

```bash
dimos --replay run drone-basic
dimos --replay run drone-agentic
```

### More

- **Go2 fleet control** — multi-robot with `--robot-ips` (#1487)
- **Replay `--replay-dir`** — select dataset, loops by default (#1519, #1494)
- **Interactive install** — `curl -fsSL .../install.sh | bash` (#1395)
- **Nix on non-Debian Linux** (#1472)
- **Remove Dask** — native worker pool (#1365)
- **Remove asyncio dependency** (#1367)
- **Perceive loop** — continuous observation module for agents (#1411)
- **Worker resource monitor** — `dtop` TUI (#1378)
- **G1 agent wiring fix** (#1518)
- **Rerun rate limiting** — prevents viewer OOM on continuous streams (#1509, #1521)
- **RotatingFileHandler** — prevents unbounded log growth (#1492)
- **Test coverage** (#1397), draft PR CI skip (#1398), manipulation test fixes (#1522)

### Breaking Changes

- `--viewer-backend` renamed to `--viewer`
- Dask removed — blueprints using Dask workers need migration to native worker pool
- Default viewer changed from `rerun-web` to `rerun` (native dimos-viewer)

### Contributors

@spomichter, @PaulNechifor, @ruthwikdasyam, @summeryang, @MustafaBhadsorawala, @leshy, @sambull, @JeffHykin, @RadientBrain

## Contributor License Agreement

- [x] I have read and approved the [CLA](https://github.com/dimensionalOS/dimos/blob/main/CLA.md).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant