Rerun issue956 by Nabla7 · Pull Request #959 · dimensionalOS/dimos

Nabla7 · 2026-01-07T12:52:38Z

This PR addresses issue #956.
Changes:
TF visualization now polls the TF buffer (self.tf.buffers) at a configurable rate and logs a snapshot of the latest transforms. No direct LCM subscription inside the TF viz module.
GO2Connection no longer manually logs robot/camera transforms to Rerun (no composing transforms, no hand-built rr.Transform3D). It only publishes TF via self.tf.publish(*transforms) and logs static assets (URDF, axes, camera pinhole) + camera images.
TF logs are written under world/tf/{child_frame} so they’re visible under the default world origin without mixing semantic entity paths with TF.
Removed queue/thread “async rerun worker” patterns from CostMapper and VoxelGridMapper; logging is synchronous again. Freezing/perf is expected to be handled by voxel size (O(n) costmap/mesh generation), not background threads.
Added/updated documentation: inventory of all Rerun usage in the codebase and a short “how to use Rerun on dev” section with TF/entity path conventions.
Notes:
URDF loading remains (intentionally) for future robot model usage.
Camera distortion: Rerun pinhole uses intrinsics; distortion must be handled by undistorting images upstream if needed.

greptile-apps

Greptile Overview

Greptile Summary

This PR refactors Rerun visualization architecture to use polling-based TF snapshots and removes async background threads from mappers. The changes address issue #956 by simplifying the visualization pipeline and improving maintainability.

Key changes:

TF visualization now polls the TF buffer at 30Hz (configurable) instead of subscribing to /tf LCM topic directly, providing rate-limited stable updates
GO2Connection no longer manually composes and logs transforms to Rerun; robot and camera motion is now driven entirely through TF system attachments to named frames (base_link, camera_optical)
TF visualization logs are isolated under world/tf/{child} paths to separate debugging views from semantic entity paths
Removed async queue/thread patterns from CostMapper and VoxelGridMapper; Rerun logging is now synchronous (performance controlled via voxel size tuning)
Added comprehensive documentation covering all Rerun usage, TF conventions, and entity path organization

Issues found:

VoxelGridMapper has unguarded rr.log() calls for metrics (lines 177-179) that will fail when Rerun backend is not active
TFRerunModule accesses internal self.tf.buffers attribute; consider documenting this as intentional or using a public API

Confidence Score: 4/5

This PR is safe to merge with one critical bug fix needed for non-Rerun backends
The refactoring simplifies the architecture and is well-documented, but contains one logic bug in VoxelGridMapper where metrics logging is unguarded and will crash when Rerun backend is not selected. The architectural changes are sound and remove complexity, but the unguarded rr.log() calls need fixing before deployment.
dimos/mapping/voxels.py requires immediate attention for the unguarded metrics logging bug

Important Files Changed

File Analysis

Filename	Score	Overview
dimos/dashboard/tf_rerun_module.py	4/5	Replaced LCM subscription with polling-based TF buffer snapshot at configurable rate (30Hz default), logs to `world/tf/{child}` paths
dimos/robot/unitree/connection/go2.py	5/5	Removed manual TF/transform logging from odometry callback; robot/camera motion now driven entirely via TF system attachments
dimos/mapping/costmapper.py	4/5	Removed async Rerun queue/thread; switched to synchronous logging with 2D image panel added to views
dimos/mapping/voxels.py	4/5	Removed async Rerun queue/thread; switched to synchronous logging directly in data pipeline

Sequence Diagram

sequenceDiagram
    participant GO2 as GO2Connection
    participant TFSys as TF System (self.tf)
    participant TFBuf as TF Buffer (MultiTBuffer)
    participant TFViz as TFRerunModule
    participant RR as Rerun

    Note over GO2,RR: Initialization Phase
    GO2->>RR: Log static world coordinates
    GO2->>RR: Log URDF at world/robot
    GO2->>RR: Attach world → world frame
    GO2->>RR: Attach world/robot → base_link frame
    GO2->>RR: Attach world/robot/camera → camera_optical frame
    GO2->>RR: Log static camera pinhole
    TFViz->>TFSys: start(sub=True)
    TFSys->>TFBuf: Subscribe to /tf topic
    TFViz->>TFViz: Start polling thread (30Hz)

    Note over GO2,RR: Runtime Phase (per odometry update)
    GO2->>GO2: Receive odometry message
    GO2->>TFSys: publish(world→base_link, base_link→camera_link, camera_link→camera_optical)
    TFSys->>TFBuf: Store transforms in buffer
    TFSys->>TFSys: Publish TFMessage to /tf topic
    GO2->>RR: Log camera image at world/robot/camera/rgb

    Note over TFViz,RR: TF Visualization Loop (30Hz polling)
    loop Every 33ms
        TFViz->>TFBuf: Snapshot self.tf.buffers.items()
        loop For each (parent, child) edge
            TFViz->>TFBuf: buffer.get() - get latest transform
            alt New or updated transform
                TFViz->>RR: rr.log(f"world/tf/{child}", transform.to_rerun())
                TFViz->>TFViz: Update _last_ts_by_edge[(parent, child)]
            else Same timestamp
                TFViz->>TFViz: Skip (no change)
            end
        end
    end

    Note over GO2,RR: Result
    Note right of RR: world/robot moves via TF attachment to base_link<br/>world/robot/camera moves via TF attachment to camera_optical<br/>world/tf/* shows transform tree for debugging

greptile-apps · 2026-01-07T12:56:44Z

dimos/dashboard/tf_rerun_module.py

+        while not self._stop_event.is_set():
+            try:
+                # Snapshot keys to avoid concurrent modification while TF buffer updates.
+                items = list(self.tf.buffers.items())  # type: ignore[attr-defined]


Accessing self.tf.buffers directly relies on implementation details of the TF service. Consider using a public API if available, or add a comment explaining this is intentional internal access.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

What greptile wrote makes sense, but I'm not sure what the proper fix for this is.

Maybe the best way would be to add a get_buffers method in MultiTBuffer which returns a clone, but such a method would require a lock as well and none of the code there uses locks, so it's a big change. But it looks like the only way.

You're absolutely right that adding a get_buffers() method would be the cleanest solution, but as you noted, it would require introducing locking to the MultiTBuffer class, which is a significant change.

Looking at the current code, there are a few interim approaches you could take:

Document the intentional access with a detailed comment:

Suggested change

items = list(self.tf.buffers.items()) # type: ignore[attr-defined]

# INTENTIONAL: Direct access to self.tf.buffers internal state.

# This is necessary to snapshot the current TF buffer state for visualization.

# TODO: Consider adding a public get_buffers() method to MultiTBuffer with proper locking.

items = list(self.tf.buffers.items()) # type: ignore[attr-defined]

Wrap the access in a try-catch to handle potential concurrent modification issues:

Suggested change

items = list(self.tf.buffers.items()) # type: ignore[attr-defined]

try:

# Snapshot keys to avoid concurrent modification while TF buffer updates.

# Note: Direct buffer access - consider adding public API in future

items = list(self.tf.buffers.items()) # type: ignore[attr-defined]

except RuntimeError:

# Handle potential concurrent modification during iteration

logger.debug("TFRerunModule: concurrent buffer modification, skipping poll")

continue

Add a method to the TF service that returns a safe snapshot (if you control that interface).

Given that this is visualization code that can tolerate occasional missed updates, option 2 might be the most pragmatic short-term solution while you plan the proper locking implementation for MultiTBuffer.

What's your preference? Would you like me to create a rule to remember this pattern for future reviews?

What greptile wrote makes sense, but I'm not sure what the proper fix for this is.

Maybe the best way would be to add a get_buffers method in MultiTBuffer which returns a clone, but such a method would require a lock as well and none of the code there uses locks, so it's a big change. But it looks like the only way.

I am also unsure, is there a public API that is available? before I was doing some hacky stuff but this may be equally hacky.

greptile-apps · 2026-01-07T12:56:46Z

Additional Comments (1)

dimos/mapping/voxels.py
Metrics logging calls rr.log() without checking if Rerun backend is active. If viewer_backend is not Rerun, these calls will fail.

        # Log detailed timing breakdown to Rerun
        if self._global_config.viewer_backend.startswith("rerun"):
            total_ms = (time.perf_counter() - start_total) * 1000
            rr.log("metrics/voxel_map/publish_ms", rr.Scalars(total_ms))
            rr.log("metrics/voxel_map/extract_ms", rr.Scalars(extract_ms))
            rr.log("metrics/voxel_map/transport_ms", rr.Scalars(publish_ms))

spomichter · 2026-01-07T18:30:12Z

docs/VIEWER_BACKENDS.md

-# or explicitly:
+
+# Explicitly select the viewer backend:
+dimos --viewer-backend rerun-native run unitree-go2


yeah cleaner for sure

paul-nechifor · 2026-01-08T02:09:29Z

dimos/dashboard/tf_rerun_module.py

+            except Exception as e:
+                logger.warning("TFRerunModule: TF poll loop error", error=str(e))


What exception are you expecting here? It's probably not a good idea to wrap things in try-except just in case.

will remove, no good reason indeed for being there.

spomichter · 2026-01-08T10:33:21Z

dimos/robot/unitree/connection/go2.py

-        video_store = TimedSensorReplay(f"{self.dir_name}/video")  # type: ignore[var-annotated]
+        # Legacy Unitree recordings can have RGB bytes that were tagged/assumed as BGR.
+        # Fix at replay-time by coercing everything to RGB before publishing/logging.
+        def _autocast_video(x):  # type: ignore[no-untyped-def]


Cant go here

greptile-apps · 2026-01-08T13:59:47Z