Skip to content

Rerun issue956#959

Merged
Nabla7 merged 11 commits intodevfrom
rerun-issue956
Jan 13, 2026
Merged

Rerun issue956#959
Nabla7 merged 11 commits intodevfrom
rerun-issue956

Conversation

@Nabla7
Copy link
Contributor

@Nabla7 Nabla7 commented Jan 7, 2026

This PR addresses issue #956.
Changes:
TF visualization now polls the TF buffer (self.tf.buffers) at a configurable rate and logs a snapshot of the latest transforms. No direct LCM subscription inside the TF viz module.
GO2Connection no longer manually logs robot/camera transforms to Rerun (no composing transforms, no hand-built rr.Transform3D). It only publishes TF via self.tf.publish(*transforms) and logs static assets (URDF, axes, camera pinhole) + camera images.
TF logs are written under world/tf/{child_frame} so they’re visible under the default world origin without mixing semantic entity paths with TF.
Removed queue/thread “async rerun worker” patterns from CostMapper and VoxelGridMapper; logging is synchronous again. Freezing/perf is expected to be handled by voxel size (O(n) costmap/mesh generation), not background threads.
Added/updated documentation: inventory of all Rerun usage in the codebase and a short “how to use Rerun on dev” section with TF/entity path conventions.
Notes:
URDF loading remains (intentionally) for future robot model usage.
Camera distortion: Rerun pinhole uses intrinsics; distortion must be handled by undistorting images upstream if needed.

@Nabla7 Nabla7 requested a review from a team January 7, 2026 12:52
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR refactors Rerun visualization architecture to use polling-based TF snapshots and removes async background threads from mappers. The changes address issue #956 by simplifying the visualization pipeline and improving maintainability.

Key changes:

  • TF visualization now polls the TF buffer at 30Hz (configurable) instead of subscribing to /tf LCM topic directly, providing rate-limited stable updates
  • GO2Connection no longer manually composes and logs transforms to Rerun; robot and camera motion is now driven entirely through TF system attachments to named frames (base_link, camera_optical)
  • TF visualization logs are isolated under world/tf/{child} paths to separate debugging views from semantic entity paths
  • Removed async queue/thread patterns from CostMapper and VoxelGridMapper; Rerun logging is now synchronous (performance controlled via voxel size tuning)
  • Added comprehensive documentation covering all Rerun usage, TF conventions, and entity path organization

Issues found:

  • VoxelGridMapper has unguarded rr.log() calls for metrics (lines 177-179) that will fail when Rerun backend is not active
  • TFRerunModule accesses internal self.tf.buffers attribute; consider documenting this as intentional or using a public API

Confidence Score: 4/5

  • This PR is safe to merge with one critical bug fix needed for non-Rerun backends
  • The refactoring simplifies the architecture and is well-documented, but contains one logic bug in VoxelGridMapper where metrics logging is unguarded and will crash when Rerun backend is not selected. The architectural changes are sound and remove complexity, but the unguarded rr.log() calls need fixing before deployment.
  • dimos/mapping/voxels.py requires immediate attention for the unguarded metrics logging bug

Important Files Changed

File Analysis

Filename Score Overview
dimos/dashboard/tf_rerun_module.py 4/5 Replaced LCM subscription with polling-based TF buffer snapshot at configurable rate (30Hz default), logs to world/tf/{child} paths
dimos/robot/unitree/connection/go2.py 5/5 Removed manual TF/transform logging from odometry callback; robot/camera motion now driven entirely via TF system attachments
dimos/mapping/costmapper.py 4/5 Removed async Rerun queue/thread; switched to synchronous logging with 2D image panel added to views
dimos/mapping/voxels.py 4/5 Removed async Rerun queue/thread; switched to synchronous logging directly in data pipeline

Sequence Diagram

sequenceDiagram
    participant GO2 as GO2Connection
    participant TFSys as TF System (self.tf)
    participant TFBuf as TF Buffer (MultiTBuffer)
    participant TFViz as TFRerunModule
    participant RR as Rerun

    Note over GO2,RR: Initialization Phase
    GO2->>RR: Log static world coordinates
    GO2->>RR: Log URDF at world/robot
    GO2->>RR: Attach world → world frame
    GO2->>RR: Attach world/robot → base_link frame
    GO2->>RR: Attach world/robot/camera → camera_optical frame
    GO2->>RR: Log static camera pinhole
    TFViz->>TFSys: start(sub=True)
    TFSys->>TFBuf: Subscribe to /tf topic
    TFViz->>TFViz: Start polling thread (30Hz)

    Note over GO2,RR: Runtime Phase (per odometry update)
    GO2->>GO2: Receive odometry message
    GO2->>TFSys: publish(world→base_link, base_link→camera_link, camera_link→camera_optical)
    TFSys->>TFBuf: Store transforms in buffer
    TFSys->>TFSys: Publish TFMessage to /tf topic
    GO2->>RR: Log camera image at world/robot/camera/rgb

    Note over TFViz,RR: TF Visualization Loop (30Hz polling)
    loop Every 33ms
        TFViz->>TFBuf: Snapshot self.tf.buffers.items()
        loop For each (parent, child) edge
            TFViz->>TFBuf: buffer.get() - get latest transform
            alt New or updated transform
                TFViz->>RR: rr.log(f"world/tf/{child}", transform.to_rerun())
                TFViz->>TFViz: Update _last_ts_by_edge[(parent, child)]
            else Same timestamp
                TFViz->>TFViz: Skip (no change)
            end
        end
    end

    Note over GO2,RR: Result
    Note right of RR: world/robot moves via TF attachment to base_link<br/>world/robot/camera moves via TF attachment to camera_optical<br/>world/tf/* shows transform tree for debugging
Loading

while not self._stop_event.is_set():
try:
# Snapshot keys to avoid concurrent modification while TF buffer updates.
items = list(self.tf.buffers.items()) # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing self.tf.buffers directly relies on implementation details of the TF service. Consider using a public API if available, or add a comment explaining this is intentional internal access.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What greptile wrote makes sense, but I'm not sure what the proper fix for this is.

Maybe the best way would be to add a get_buffers method in MultiTBuffer which returns a clone, but such a method would require a lock as well and none of the code there uses locks, so it's a big change. But it looks like the only way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that adding a get_buffers() method would be the cleanest solution, but as you noted, it would require introducing locking to the MultiTBuffer class, which is a significant change.

Looking at the current code, there are a few interim approaches you could take:

  1. Document the intentional access with a detailed comment:
Suggested change
items = list(self.tf.buffers.items()) # type: ignore[attr-defined]
# INTENTIONAL: Direct access to self.tf.buffers internal state.
# This is necessary to snapshot the current TF buffer state for visualization.
# TODO: Consider adding a public get_buffers() method to MultiTBuffer with proper locking.
items = list(self.tf.buffers.items()) # type: ignore[attr-defined]
  1. Wrap the access in a try-catch to handle potential concurrent modification issues:
Suggested change
items = list(self.tf.buffers.items()) # type: ignore[attr-defined]
try:
# Snapshot keys to avoid concurrent modification while TF buffer updates.
# Note: Direct buffer access - consider adding public API in future
items = list(self.tf.buffers.items()) # type: ignore[attr-defined]
except RuntimeError:
# Handle potential concurrent modification during iteration
logger.debug("TFRerunModule: concurrent buffer modification, skipping poll")
continue
  1. Add a method to the TF service that returns a safe snapshot (if you control that interface).

Given that this is visualization code that can tolerate occasional missed updates, option 2 might be the most pragmatic short-term solution while you plan the proper locking implementation for MultiTBuffer.

What's your preference? Would you like me to create a rule to remember this pattern for future reviews?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What greptile wrote makes sense, but I'm not sure what the proper fix for this is.

Maybe the best way would be to add a get_buffers method in MultiTBuffer which returns a clone, but such a method would require a lock as well and none of the code there uses locks, so it's a big change. But it looks like the only way.

I am also unsure, is there a public API that is available? before I was doing some hacky stuff but this may be equally hacky.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 7, 2026

Additional Comments (1)

dimos/mapping/voxels.py
Metrics logging calls rr.log() without checking if Rerun backend is active. If viewer_backend is not Rerun, these calls will fail.

        # Log detailed timing breakdown to Rerun
        if self._global_config.viewer_backend.startswith("rerun"):
            total_ms = (time.perf_counter() - start_total) * 1000
            rr.log("metrics/voxel_map/publish_ms", rr.Scalars(total_ms))
            rr.log("metrics/voxel_map/extract_ms", rr.Scalars(extract_ms))
            rr.log("metrics/voxel_map/transport_ms", rr.Scalars(publish_ms))

# or explicitly:

# Explicitly select the viewer backend:
dimos --viewer-backend rerun-native run unitree-go2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah cleaner for sure

Comment on lines +115 to +116
except Exception as e:
logger.warning("TFRerunModule: TF poll loop error", error=str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exception are you expecting here? It's probably not a good idea to wrap things in try-except just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove, no good reason indeed for being there.

video_store = TimedSensorReplay(f"{self.dir_name}/video") # type: ignore[var-annotated]
# Legacy Unitree recordings can have RGB bytes that were tagged/assumed as BGR.
# Fix at replay-time by coercing everything to RGB before publishing/logging.
def _autocast_video(x): # type: ignore[no-untyped-def]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant go here

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Too many files changed for review.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Too many files changed for review.

1 similar comment
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Too many files changed for review.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Too many files changed for review.

1 similar comment
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Too many files changed for review.

Nabla7 and others added 9 commits January 9, 2026 12:04
TF visualization now uses snapshot polling (self.tf.buffers) instead of
LCM subscriptions.

Changes:
- TFRerunModule: poll TF buffers at configurable Hz, log changed transforms
- CostMapper: remove _rerun_worker thread/queue, log synchronously
- VoxelGridMapper: remove _rerun_worker thread/queue, log synchronously
- GO2Connection: remove manual transform logging (robot/camera poses),
  keep only static axes + pinhole; TF module handles dynamic transforms
- TFMessage.to_rerun: log to 'world/tf/{child}' instead of 'world/{child}'

The queue/thread patterns were removed because:
- costmapper is O(n) and voxel_size tuning addresses perf, not async magic
- threads added complexity without benefit for visualization logging


Former-commit-id: 9c14a09 [formerly 6128ee1]
Former-commit-id: f7b2453
Former-commit-id: 6a5144c [formerly 813db61]
Former-commit-id: 19b7b2d
Former-commit-id: 6b7e848 [formerly aec0555]
Former-commit-id: 286e212
Former-commit-id: a6892c0 [formerly 7d5e7b6]
Former-commit-id: 2b8742a
- Add ImageFormat autocast to ReplayConnection.video_stream() that
  relabels BGR-tagged frames as RGB (fixes red/blue swap in Rerun)


Former-commit-id: db94ef6 [formerly d678e64]
Former-commit-id: fdf08c0
Former-commit-id: 51b8e37 [formerly 4d3438d]
Former-commit-id: 663e48a
@leshy
Copy link
Contributor

leshy commented Jan 12, 2026

who do we still have rerun transforms manually published in go2 connection.py?

go2 should do go2 stuff, rerun transform module should do rerun transforms stuff

        # Attach semantic entity paths to the named TF frame graph without explicitly
        # introducing `tf#/...` strings into the code.
        #
        # - `world` entity path: its implicit frame (tf#/world) is a child of the named frame "world".
        # - `world/robot` entity path: its implicit frame (tf#/world/robot) is a child of the named frame "base_link".
        rr.log(
            "world",
            rr.Transform3D(
                translation=[0.0, 0.0, 0.0],
                rotation=rr.Quaternion(xyzw=[0.0, 0.0, 0.0, 1.0]),
                parent_frame="world",  # type: ignore[call-arg]
            ),
            static=True,
        )
        rr.log(
            "world/robot",
            rr.Transform3D(
                translation=[0.0, 0.0, 0.0],
                rotation=rr.Quaternion(xyzw=[0.0, 0.0, 0.0, 1.0]),
                parent_frame="base_link",  # type: ignore[call-arg]
            ),
            static=True,
        )

        # Load robot URDF
        if _GO2_URDF.exists():
            rr.log_file_from_path(
                str(_GO2_URDF),
                entity_path_prefix="world/robot",
                static=True,
            )
            logger.info(f"Loaded URDF from {_GO2_URDF}")

- Add RerunSceneWiringModule for static Rerun setup (view coords, entity
  attachments, URDF, axes, camera pinholes)
- Refactor tf_rerun() to compose TF polling + scene wiring via autoconnect
- Remove _init_rerun_world() from GO2Connection (now only TF + sensor data)
- Update GO2 blueprint to pass URDF + camera config via tf_rerun(...)
- Update VIEWER_BACKENDS.md to document the new architecture
@Nabla7
Copy link
Contributor Author

Nabla7 commented Jan 13, 2026

This new commit moves most of the transform logic out of go2, this had to do with the fact that rerun expects an entity to be defined together with a rerun primitive in order to log it. Ended up in go2 connection because a lot of stuff came together there. Moved it to blueprint instead. Usage in blueprint now looks like:

basic = autoconnect(
    go2_connection(),
    linux if platform.system() == "Linux" else mac,
    websocket_vis(),
    tf_rerun(
        urdf_path=str(_GO2_URDF),
        cameras=[
            ("world/robot/camera", "camera_optical", GO2Connection.camera_info_static),
        ],
    ),
).global_config(n_dask_workers=4, robot_model="unitree_go2")

This pattern will allow us to attach multiple cameras to a robot on rerun as well in the future

@Nabla7 Nabla7 merged commit 70a3fe8 into dev Jan 13, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants