Skip to content

Dimos Unity Simulator#1539

Open
jeff-hykin wants to merge 6 commits intodevfrom
jeff/feat/untiy_sim
Open

Dimos Unity Simulator#1539
jeff-hykin wants to merge 6 commits intodevfrom
jeff/feat/untiy_sim

Conversation

@jeff-hykin
Copy link
Member

Problem

Testing g1 stuff is hard and the mujoco sim for the g1 is bad. (It would be nice to have automated tests using the unity sim from the ros nav stack)

Solution

Port the unity simulator as a DimOS module

  • Auto downloads the executable if needed
  • Has tests

Breaking Changes

None

How to Test

On linux x86 only:

dimos run unity-sim

Should download the unity simulator with a big message about it, then should open up the graphical sim window and a rerun window. Clicking won't navigate (its just a sim not the full nav stack)

Contributor License Agreement

  • I have read and approved the CLA.

Replace gdown/Google Drive auto-download with get_data() LFS asset
(cmu_unity_sim_x86, 128MB compressed). Simplify config by removing
unity_scene, unity_cache_dir, auto_download fields. Clean up blueprint
(remove __main__.py, rename to unity_sim, remove resolve_unity_binary
requirement hook).
module=module_name,
cmd=" ".join(cmd),
cwd=cwd,
)
Copy link
Member Author

@jeff-hykin jeff-hykin Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to native modules here cause I'm testing the unity sim with the livox native modules and got VERY undescriptive error messages

# "visual_override": {"world/camera_info": UnityBridgeModule.rerun_suppress_camera_info},
# }
@staticmethod
def rerun_static_pinhole(rr: Any) -> list[Any]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whenever our rerun API gets improved this will need to be changed

# See the License for the specific language governing permissions and
# limitations under the License.

"""ROS1 binary message deserialization — no ROS1 installation required.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unity sim was made for ROS1 so we needed some tooling to convert those messages

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR ports the CMU VLA Challenge Unity simulator as a first-class DimOS module (UnityBridgeModule). It implements the ROS-TCP-Endpoint binary protocol from scratch in pure Python (no ROS installation required), auto-downloads the Linux x86_64 binary via LFS on first use, integrates kinematic odometry at 200 Hz, and wires everything into a Rerun viewer via a one-liner dimos run unity-sim blueprint. A new dimos/utils/ros1.py utility adds reusable ROS1 binary serialization helpers.

Key issues found:

  • Broken crash stderr in native_module.py — The new last_stderr collection reads from a pipe that has already been fully consumed and closed by _read_log_stream. The stderr_t.join() waits for the reader thread to finish draining the stream, so the subsequent .read() always returns empty bytes. Crash logs will always show last_stderr=None, making the new diagnostic field a no-op.
  • Server socket leak in _unity_loop — The TCP server socket is not guarded by a try/finally, so if bind() fails (e.g. port 10000 already in use) the file descriptor leaks and the port stays occupied until the process exits.
  • Stale messages on Unity reconnect — The unbounded _send_queue is shared across reconnections. If Unity crashes and reconnects, the queued poses and system commands from the prior session are delivered to the new connection before its handshake completes.
  • Minor race condition in _on_terrainself._x and self._y are read without _state_lock in the terrain callback while _sim_loop writes them in a separate thread.

Confidence Score: 2/5

  • Needs fixes before merge — one new diagnostic feature is silently broken and the socket/queue issues can cause problems under reconnect scenarios.
  • The core Unity bridge functionality is well-structured and the test coverage is solid. However, three logic bugs were found: the crash-report last_stderr in native_module.py will always be empty (the stderr pipe is already closed when the read is attempted), the TCP server socket in _unity_loop lacks a try/finally and will leak on bind failure, and stale queue messages will be replayed to a reconnected Unity session. These bugs are not catastrophic for the happy path but do undermine reliability and the intended diagnostic value of the changes.
  • dimos/core/native_module.py (broken stderr collection) and dimos/simulation/unity/module.py (socket leak + stale queue) need the most attention before merge.

Important Files Changed

Filename Overview
dimos/core/native_module.py Enriches crash-report logging with module name, exe name, and last stderr — but the last_stderr collection is broken: the stderr reader thread fully consumes and closes the pipe before the read attempt, so crash logs will always show last_stderr=None.
dimos/simulation/unity/module.py Core Unity TCP bridge module. Has three issues: server socket not wrapped in try/finally (fd leak on bind failure), send queue not drained between reconnections (stale messages to new session), and position state read without lock in terrain callback.
dimos/utils/ros1.py Clean pure-Python ROS1 binary serialization/deserialization utilities. Handles PointCloud2, CompressedImage, and PoseStamped correctly with both fast (standard XYZI layout) and slow (arbitrary offsets) paths.
dimos/simulation/unity/test_unity_sim.py Comprehensive test suite covering config, platform validation, pickle round-trip, TCP bridge, kinematic sim, and live Unity integration. Minor fragility from time.sleep-based synchronization in TCP and kinematic tests.
dimos/simulation/unity/blueprint.py Standalone blueprint wiring UnityBridgeModule to a Rerun viewer with correct static pinhole and camera info suppression. Clean and straightforward.
dimos/robot/all_blueprints.py Registers the new unity-sim blueprint entry point. Single-line addition, alphabetically ordered correctly.
data/.lfs/cmu_unity_sim_x86.tar.gz Git LFS pointer for the 127 MB Unity simulator binary. Correctly formatted LFS stub; binary is fetched on demand via get_data().

Sequence Diagram

sequenceDiagram
    participant User
    participant UnityBridgeModule
    participant TCPServer as TCP Server (_unity_loop)
    participant SimLoop as Kinematic Sim (_sim_loop)
    participant UnityBinary as Unity Binary Process
    participant Rerun

    User->>UnityBridgeModule: start()
    UnityBridgeModule->>SimLoop: spawn thread
    UnityBridgeModule->>TCPServer: spawn thread
    UnityBridgeModule->>UnityBinary: Popen (auto-download via get_data if needed)
    TCPServer->>TCPServer: bind(:10000) + listen

    UnityBinary-->>TCPServer: TCP connect
    TCPServer->>UnityBinary: __handshake {version, protocol}
    UnityBinary->>TCPServer: __topic_list request
    TCPServer->>UnityBinary: __topic_list response [/unity_sim/set_model_state, /tf]

    loop Sensor data stream
        UnityBinary->>TCPServer: /registered_scan (ROS1 PointCloud2)
        TCPServer->>UnityBridgeModule: registered_scan.publish()
        UnityBridgeModule->>Rerun: lidar point cloud

        UnityBinary->>TCPServer: /color/image_raw/compressed (ROS1 CompressedImage)
        TCPServer->>UnityBridgeModule: color_image.publish()
        UnityBridgeModule->>Rerun: decoded RGB image
    end

    loop Kinematic sim at 200 Hz
        SimLoop->>SimLoop: integrate cmd_vel → (x, y, z, yaw)
        SimLoop->>UnityBridgeModule: odometry.publish()
        SimLoop->>UnityBridgeModule: tf.publish()
        SimLoop->>TCPServer: enqueue /unity_sim/set_model_state (PoseStamped)
        TCPServer->>UnityBinary: send pose update
    end

    User->>UnityBridgeModule: stop()
    UnityBridgeModule->>UnityBinary: SIGTERM
    UnityBridgeModule->>SimLoop: _running = False (join)
    UnityBridgeModule->>TCPServer: _running = False (join)
Loading

Last reviewed commit: bcc33ef

Comment on lines +210 to +218
# Collect any remaining stderr for the crash report
last_stderr = ""
if self._process.stderr and not self._process.stderr.closed:
try:
remaining = self._process.stderr.read()
if remaining:
last_stderr = remaining.decode("utf-8", errors="replace").strip()
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_stderr will always be empty — crash report has no stderr

The stderr is already fully consumed and closed by the time this read is attempted. _read_log_stream (called via _start_reader) iterates over the entire stream and explicitly calls stream.close() at the end. After stderr_t.join(timeout=2) completes, the stream is exhausted and closed, so self._process.stderr.read() will always return b"".

As a result, last_stderr will always be None in the crash log, making the new crash-report feature a no-op.

A common fix for this pattern is to buffer the last N lines inside the reader thread itself and expose them via a shared variable, rather than trying to re-read the already-closed pipe.

# Example fix: buffer the last few lines in _read_log_stream
# and expose via an instance variable, e.g. self._last_stderr_lines

Comment on lines +438 to +465
def _unity_loop(self) -> None:
server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_sock.bind((self.config.unity_host, self.config.unity_port))
server_sock.listen(1)
server_sock.settimeout(2.0)
logger.info(f"TCP server on :{self.config.unity_port}")

while self._running:
try:
conn, addr = server_sock.accept()
logger.info(f"Unity connected from {addr}")
try:
self._bridge_connection(conn)
except Exception as e:
logger.info(f"Unity connection ended: {e}")
finally:
with self._state_lock:
self._unity_connected = False
conn.close()
except TimeoutError:
continue
except Exception as e:
if self._running:
logger.warning(f"TCP server error: {e}")
time.sleep(1.0)

server_sock.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server socket leaked if bind() or listen() raises

The server socket is created and set up before the while self._running: loop, but server_sock.close() only appears at the very bottom (line 465). If bind() raises (e.g., OSError: [Errno 98] Address already in use on port 10000) or any other exception is thrown during setup, the socket's file descriptor is leaked and the port may remain occupied.

Wrap the socket in a try/finally:

def _unity_loop(self) -> None:
    server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((self.config.unity_host, self.config.unity_port))
    server_sock.listen(1)
    server_sock.settimeout(2.0)
    logger.info(f"TCP server on :{self.config.unity_port}")

    try:
        while self._running:
            ...
    finally:
        server_sock.close()

Comment on lines +495 to +499
finally:
halt.set()
sender.join(timeout=2.0)
with self._state_lock:
self._unity_connected = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale send-queue messages delivered to a reconnected Unity session

The _send_queue is a shared, module-level Queue that is never cleared when a connection ends. _unity_loop accepts new connections in a loop, so if Unity disconnects and reconnects (e.g., after a crash), any messages that were queued during the old session — including stale odometry poses and topic-list responses — will be dequeued and sent to the fresh connection before it has completed its handshake.

The sender thread reads from the queue in _unity_sender, which runs per-connection (created in _bridge_connection). When a connection ends and the sender thread terminates, queued items remain in the shared queue. The next time _bridge_connection spawns a sender, those leftovers are delivered immediately.

A simple fix is to drain the queue at the start of _bridge_connection:

def _bridge_connection(self, sock: socket.socket) -> None:
    # Drain any stale messages from a previous session
    while not self._send_queue.empty():
        try:
            self._send_queue.get_nowait()
        except Empty:
            break
    ...

Comment on lines +425 to +434
def _on_terrain(self, cloud: PointCloud2) -> None:
points, _ = cloud.as_numpy()
if len(points) == 0:
return
dx = points[:, 0] - self._x
dy = points[:, 1] - self._y
near = points[np.sqrt(dx * dx + dy * dy) < 0.5]
if len(near) >= 10:
with self._state_lock:
self._terrain_z = 0.8 * self._terrain_z + 0.2 * near[:, 2].mean()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._x / self._y read without lock in terrain callback

_on_terrain reads self._x and self._y at lines 429–430 directly (without _state_lock), while _sim_loop writes them at lines 612–613 in a separate thread. The GIL prevents true corruption for individual float reads/writes, but the pair (self._x, self._y) is not read atomically — _on_terrain could observe an _x from one tick and a _y from the next, producing a slightly off distance calculation for the terrain Z filter.

Consider snapping both values under _state_lock, or expanding _state_lock to also cover the position state as it already does for _terrain_z and _unity_connected:

def _on_terrain(self, cloud: PointCloud2) -> None:
    points, _ = cloud.as_numpy()
    if len(points) == 0:
        return
    with self._state_lock:
        cur_x, cur_y = self._x, self._y
    dx = points[:, 0] - cur_x
    dy = points[:, 1] - cur_y
    ...

@jeff-hykin jeff-hykin changed the title Jeff/feat/untiy sim Dimos Unity Simulator Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant