Person follow skill with EdgeTAM by paul-nechifor · Pull Request #1042 · dimensionalOS/dimos

paul-nechifor · 2026-01-16T06:11:21Z

Usage

uv sync --all-extras
uv run dimos --robot-ip 192.168.X.XXX run unitree-go2-agentic
uv run humancli

Write "follow the man in blue" or some description of yourself.

E2E

Run the end-to-end test:

uv run pytest dimos/e2e_tests/test_person_follow.py

How it works

Uses EdgeTAM for detection.
Estimates the person's distance based on the width of the detection. (I'm using width instead of height since people's height isn't always visible from the low angle of the robot.)
Rotates or moves forward/back to keep the detection in the center.
Does not check for free space. This is especially bad when you make it move backwards.
Cannot follow you around tight corners, either because it loses track,or because it tries to move through the corner.

pyproject.toml

greptile-apps · 2026-01-16T06:13:48Z

Greptile Summary

This PR adds a person-following skill using EdgeTAM (SAM2-based video tracking) for visual servoing control on the Unitree GO2 robot. The implementation includes a new skill container, EdgeTAM processor for frame tracking, and integration into the agentic blueprint system.

Key Changes:

PersonFollowSkillContainer: Implements visual servoing with proportional control for distance and heading
EdgeTAMProcessor: Wraps SAM2 video predictor with frame buffer management to prevent OOM
Dependencies: Added edgetam, omegaconf, hydra-core for model configuration and video tracking
Blueprint integration: Wired person_follow_skill into agentic configurations

Known Limitations (Per PR Description):

No obstacle avoidance or free space checking
Cannot follow around tight corners
May lose track at unexpected angles

Issues Found:

Line 202: Assert will crash the control loop if image streaming pauses (should use error handling)
Line 215: Potential ValueError if detections.detections is empty despite len(detections) == 0 check
Line 316: Magic number for angular gain should be parameterized for tuning

Confidence Score: 3/5

This PR introduces person-following functionality with video tracking, but has critical logic issues that could cause runtime crashes during control loop execution.
The score reflects two critical logic issues in the control loop: (1) An assert statement that will crash if image streaming pauses, and (2) A potential ValueError when calling max() on detections without proper list validation. These issues will cause the follow_person skill to fail during normal operation. The EdgeTAM processor and blueprint integration are solid, but the skill container has insufficient error handling for asynchronous image streams.
dimos/agents/skills/person_follow.py requires attention for error handling in the control loop, particularly around image availability assertions and detection list validation.

Important Files Changed

Filename	Overview
dimos/agents/skills/person_follow.py	New skill container implementing person-following using EdgeTAM for tracking. Uses visual servoing with proportional control. Has potential issues: assert statement at line 202 could fail without proper handling, and no validation that detections.detections is not empty before calling max() at line 215.
dimos/models/segmentation/edge_tam.py	New EdgeTAM processor for video object tracking. Properly handles CUDA requirement, model loading with state dict validation, and frame buffer management. Memory cleanup logic is sound.
dimos/robot/unitree_webrtc/unitree_go2_blueprints.py	Integrated person_follow_skill into the agentic blueprints. Passes camera_info_static correctly. No issues found.
pyproject.toml	Added EdgeTAM, omegaconf, and hydra-core dependencies. Dependencies are appropriate for the new ML features.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant PersonFollowSkill
    participant EdgeTAMProcessor
    participant GO2Connection
    participant ColorImageStream
    
    Agent->>PersonFollowSkill: "follow_person(query)"
    PersonFollowSkill->>PersonFollowSkill: "detect person via QwenVL"
    PersonFollowSkill->>PersonFollowSkill: "spawn daemon thread"
    
    Note over PersonFollowSkill: "_follow_loop thread"
    PersonFollowSkill->>EdgeTAMProcessor: "init_track(image, bbox)"
    EdgeTAMProcessor->>EdgeTAMProcessor: "SAM2 initializes"
    
    loop "Control Loop (20 Hz)"
        ColorImageStream->>PersonFollowSkill: "new frame"
        PersonFollowSkill->>EdgeTAMProcessor: "process_image()"
        
        alt "Tracking Success"
            EdgeTAMProcessor-->>PersonFollowSkill: "detections"
            PersonFollowSkill->>PersonFollowSkill: "compute twist"
            PersonFollowSkill->>GO2Connection: "move(twist)"
        else "Lost Track"
            EdgeTAMProcessor-->>PersonFollowSkill: "empty detections"
            PersonFollowSkill->>GO2Connection: "move(zero)"
        end
        
        PersonFollowSkill->>PersonFollowSkill: "sleep"
    end
    
    Agent->>PersonFollowSkill: "stop_following()"
    PersonFollowSkill->>GO2Connection: "move(zero)"

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

dimos/agents/skills/person_follow.py

greptile-apps · 2026-01-16T06:13:47Z

dimos/agents/skills/person_follow.py

+                    return
+            else:
+                lost_count = 0
+                best_detection = max(detections.detections, key=lambda d: d.bbox_2d_volume())


logic: Calling max(detections.detections, ...) without checking if the list is empty will raise ValueError. Even though len(detections) == 0 is checked, the len() behavior on ImageDetections2D may differ from checking detections.detections length. Add explicit check: if detections.detections: before calling max().

dimos/agents/skills/person_follow.py

leshy

looks ok as a first sketch, would like to understand global planner issues etc.
will need to plug in reid also, Alex found some good model

leshy · 2026-01-16T06:57:54Z

dimos/agents/skills/person_follow.py

+        self._stop_following()
+
+        try:
+            move_rpc = self.get_rpc_calls("GO2Connection.move")


any reason we aren't just emitting twist from this as OUT? since this would makae this work cross embodyment

It slipped my mind. 😅 Fixed now.

leshy · 2026-01-16T07:04:22Z

dimos/models/segmentation/edge_tam.py

+    cached_features: dict[int, Any]
+
+
+class EdgeTAMProcessor(Detector):


isn't this some new model, for tracking only? so would be good to have maybe new dir for it, since it's not segmentation? and new type for trackers, since this is not strictly a Detector so we can change trackers? not sure, what do you think?

Actually the way I implemented before was I conformed it to the same structure as YOLO and YOLOE, also in detection/types/detection3d/seg.py, there is a from_sam2_result() that is for this edgeTAM output. It should output ImageDetection2D[Detection2DSeg). I think it's better to put the general EdgeTAMTracker code (conforms to the same detector type) under perception/, keep the inference and model loading code as separate in models/ like I did before.

leshy · 2026-01-16T07:05:46Z

dimos/agents/skills/person_follow.py

+    def __init__(self, camera_info: CameraInfo, global_config: GlobalConfig) -> None:
+        super().__init__()
+        self._global_config: GlobalConfig = global_config
+        self._latest_image: Image | None = None


we have this fancy system of hot/colr getters

https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/advanced_streams.md#getter_hot---background-subscription-instant-reads

so you can do getter = geter_hot(self.image().observable())

idk if you want this, just letting you know it exists. also getter is not disposable, should make it one

leshy · 2026-01-16T07:09:28Z

dimos/agents/skills/person_follow.py

+        super().stop()
+
+    @skill()
+    def follow_person(self, query: str) -> str:


generally I'd much prefer this as a reactive pipeline in which you

self.image_frame.observable().pipe( ops.map(track) ops.map(detection_to_twist) ).subscribe(self.twist.publish)

so you don't need to do this self._image thing, thread management etc etc. much cleaner, just letting you know, this is an ok first test

But doing it that way ties the frequency of the cmd_vel to the frequency of the camera, right? And you still need some state like counting how many frames are without a detection, or knowing when to stop.

there are nice tools for all of this - like sample() maybe? samples latest value at fixed hz, so you can emit twist at a faster fixed rate then input image

https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/reactivex.md#sampleinterval---emit-latest-value-every-n-seconds

you still need some state like counting how many frames are without a detection, or knowing when to stop.

yeah you either start propagading more complex types (worst case but still nice)

{ Image :image Detection2d: ... frames: n }

and this gives you full power, and clear typing also, since you can transform between those complex types at each step

def bla(in: ComplexType1) -> ComplexType2

ops.map(bla)

but as an also easier shortcut before going for complex types you can have a map that has a sideeffect that counts empty Detection2Ds and starts returning false/exiting early.

detection_to_twist in above example can have a sideeffect of counting empty detections it got, and starting to output twist(0,0,0) - after initial few twist(0,0,0) you can also choose to terminate tracking loop as well, observable just finishes

to be clear Im not asking for a rewrite, just mentioning how this can be structured in a way in which it's easy and fun to quickly iterate (add twist easing etc), rotation towards a person and walking towards a person can be separate maps() that layer onto the same Twist, etc

leshy · 2026-01-16T07:11:40Z

dimos/agents/skills/person_follow.py

+        fx = self._camera_info.K[0]  # focal length x in pixels
+        estimated_distance = (self._assumed_person_width * fx) / bbox_width
+
+        return estimated_distance


again first sketch so all good, we should make sure that API for merging this with lidar and estimating segmentation distance (turning into detection3d) is super easy easy. detections/ stuff implements this for projecting 2d into 3d

perception/detection/reid/module.py

this is for re-assembling fancy Detection2D types from ros msgs + image, but you'd do the same for aligning lidar + camera image

def detections_stream(self) -> Observable[ImageDetections2D]: return backpressure( align_timestamped( self.image.pure_observable(), self.detections.pure_observable().pipe( ops.filter(lambda d: d.detections_length > 0) # type: ignore[attr-defined] ), match_tolerance=0.0, buffer_size=2.0, ).pipe(ops.map(lambda pair: ImageDetections2D.from_ros_detection2d_array(*pair))) # type: ignore[misc, arg-type] )

documented at https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/temporal_alignment.md

also this might be interesting
https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/quality_filter.md

Yeah, makes sense. I'll use the lidar to estimate the distance.

no need for first merge as well, your call how to proceed with this, it could turn out that lidar method is less robust also

paul-nechifor · 2026-01-17T07:10:03Z

@leshy

will need to plug in reid also, Alex found some good model

EdgeTAM does not require ReID.

@leshy

… Unitree Go2 Navigation & Exploration Beta Pre-Release v0.0.8: Unitree Go2 Navigation & Exploration Beta, Transport Updates, Documentation updates, Rerun fixes, Person follow, Readme updates ## What's Changed * Small docs clarification about stream getters by @leshy in #1043 * Fix split view on wide monitors by @jeff-hykin in #1048 * Docs: Install & Develop by @jeff-hykin in #1022 * Add uv to nix and fix resulting problems by @jeff-hykin in #1021 * v0.0.8 by @paul-nechifor in #1050 * Style changes in docs by @paul-nechifor in #1051 * Revert "Add uv to nix and fix resulting problems" by @leshy in #1053 * Transport benchmarks + Raw ros transport by @leshy in #1038 * feat: default to rerun-web and auto-open browser on startup (browser … by @Nabla7 in #1019 * bbox detections visual check by @leshy in #1017 * fix: only auto-open browser for rerun-web viewer backend by @Nabla7 in #1066 * move slow tests to integration by @paul-nechifor in #1063 * Streamline transport start/stop methods by @Kaweees in #1062 * Person follow skill with EdgeTAM by @paul-nechifor in #1042 * fix: increase costmap floor z_offset to avoid z-fighting by @Nabla7 in #1073 * Fixed issue #1074 by @alexlin2 in #1075 * ROS transports initial by @leshy in #1057 * Fix System Config Values for LCM on MacOS and Refactor by @jeff-hykin in #1065 * SHM Transport basic fixes by @leshy in #1041 * commented out Mem Transport test case by @leshy in #1077 * Docs/advanced streams update 2 by @leshy in #1078 * Fix more tests by @paul-nechifor in #1071 * feat: navigation docker updates from bona_local_dev by @baishibona in #1081 * Fix missing dependencies by @Kaweees in #1085 * Release readme fixes by @spomichter in #1076 ## New Contributors * @baishibona made their first contribution in #1081 **Full Changelog**: v0.0.7...v0.0.8

paul-nechifor requested a review from a team January 16, 2026 06:11

paul-nechifor commented Jan 16, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

leshy previously approved these changes Jan 16, 2026

View reviewed changes

leshy reviewed Jan 16, 2026

View reviewed changes

paul-nechifor dismissed leshy’s stale review via bb9e8e4 January 17, 2026 06:50

paul-nechifor added 6 commits January 20, 2026 01:02

person follow skill with EdgeTAM

745dc0f

add person and do some code review comments

b9d3971

try 3d nav

c4702b1

extract to class

f2301e0

try

2a247ab

use 2d version and edgetam-dimos

5d02818

paul-nechifor force-pushed the pauln-person-follow-edgetam branch from 8f5eb1a to 5d02818 Compare January 19, 2026 23:03

leshy approved these changes Jan 20, 2026

View reviewed changes

leshy merged commit af69f26 into dev Jan 20, 2026
13 checks passed

		cached_features: dict[int, Any]


		class EdgeTAMProcessor(Detector):

Conversation

paul-nechifor commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

E2E

How it works

Uh oh!

Uh oh!

greptile-apps bot commented Jan 16, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leshy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leshy Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leshy Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leshy Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paul-nechifor commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paul-nechifor commented Jan 16, 2026 •

edited

Loading

leshy Jan 18, 2026 •

edited

Loading

leshy Jan 16, 2026 •

edited

Loading

leshy Jan 18, 2026 •

edited

Loading