Skip to content

Person follow skill with EdgeTAM#1042

Merged
leshy merged 6 commits intodevfrom
pauln-person-follow-edgetam
Jan 20, 2026
Merged

Person follow skill with EdgeTAM#1042
leshy merged 6 commits intodevfrom
pauln-person-follow-edgetam

Conversation

@paul-nechifor
Copy link
Contributor

@paul-nechifor paul-nechifor commented Jan 16, 2026

Usage

uv sync --all-extras
uv run dimos --robot-ip 192.168.X.XXX run unitree-go2-agentic
uv run humancli

Write "follow the man in blue" or some description of yourself.

E2E

Run the end-to-end test:

uv run pytest dimos/e2e_tests/test_person_follow.py

How it works

  • Uses EdgeTAM for detection.
  • Estimates the person's distance based on the width of the detection. (I'm using width instead of height since people's height isn't always visible from the low angle of the robot.)
  • Rotates or moves forward/back to keep the detection in the center.
  • Does not check for free space. This is especially bad when you make it move backwards.
  • Cannot follow you around tight corners, either because it loses track,or because it tries to move through the corner.

@paul-nechifor paul-nechifor requested a review from a team January 16, 2026 06:11
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile Summary

This PR adds a person-following skill using EdgeTAM (SAM2-based video tracking) for visual servoing control on the Unitree GO2 robot. The implementation includes a new skill container, EdgeTAM processor for frame tracking, and integration into the agentic blueprint system.

Key Changes:

  • PersonFollowSkillContainer: Implements visual servoing with proportional control for distance and heading
  • EdgeTAMProcessor: Wraps SAM2 video predictor with frame buffer management to prevent OOM
  • Dependencies: Added edgetam, omegaconf, hydra-core for model configuration and video tracking
  • Blueprint integration: Wired person_follow_skill into agentic configurations

Known Limitations (Per PR Description):

  • No obstacle avoidance or free space checking
  • Cannot follow around tight corners
  • May lose track at unexpected angles

Issues Found:

  • Line 202: Assert will crash the control loop if image streaming pauses (should use error handling)
  • Line 215: Potential ValueError if detections.detections is empty despite len(detections) == 0 check
  • Line 316: Magic number for angular gain should be parameterized for tuning

Confidence Score: 3/5

  • This PR introduces person-following functionality with video tracking, but has critical logic issues that could cause runtime crashes during control loop execution.
  • The score reflects two critical logic issues in the control loop: (1) An assert statement that will crash if image streaming pauses, and (2) A potential ValueError when calling max() on detections without proper list validation. These issues will cause the follow_person skill to fail during normal operation. The EdgeTAM processor and blueprint integration are solid, but the skill container has insufficient error handling for asynchronous image streams.
  • dimos/agents/skills/person_follow.py requires attention for error handling in the control loop, particularly around image availability assertions and detection list validation.

Important Files Changed

Filename Overview
dimos/agents/skills/person_follow.py New skill container implementing person-following using EdgeTAM for tracking. Uses visual servoing with proportional control. Has potential issues: assert statement at line 202 could fail without proper handling, and no validation that detections.detections is not empty before calling max() at line 215.
dimos/models/segmentation/edge_tam.py New EdgeTAM processor for video object tracking. Properly handles CUDA requirement, model loading with state dict validation, and frame buffer management. Memory cleanup logic is sound.
dimos/robot/unitree_webrtc/unitree_go2_blueprints.py Integrated person_follow_skill into the agentic blueprints. Passes camera_info_static correctly. No issues found.
pyproject.toml Added EdgeTAM, omegaconf, and hydra-core dependencies. Dependencies are appropriate for the new ML features.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant PersonFollowSkill
    participant EdgeTAMProcessor
    participant GO2Connection
    participant ColorImageStream
    
    Agent->>PersonFollowSkill: "follow_person(query)"
    PersonFollowSkill->>PersonFollowSkill: "detect person via QwenVL"
    PersonFollowSkill->>PersonFollowSkill: "spawn daemon thread"
    
    Note over PersonFollowSkill: "_follow_loop thread"
    PersonFollowSkill->>EdgeTAMProcessor: "init_track(image, bbox)"
    EdgeTAMProcessor->>EdgeTAMProcessor: "SAM2 initializes"
    
    loop "Control Loop (20 Hz)"
        ColorImageStream->>PersonFollowSkill: "new frame"
        PersonFollowSkill->>EdgeTAMProcessor: "process_image()"
        
        alt "Tracking Success"
            EdgeTAMProcessor-->>PersonFollowSkill: "detections"
            PersonFollowSkill->>PersonFollowSkill: "compute twist"
            PersonFollowSkill->>GO2Connection: "move(twist)"
        else "Lost Track"
            EdgeTAMProcessor-->>PersonFollowSkill: "empty detections"
            PersonFollowSkill->>GO2Connection: "move(zero)"
        end
        
        PersonFollowSkill->>PersonFollowSkill: "sleep"
    end
    
    Agent->>PersonFollowSkill: "stop_following()"
    PersonFollowSkill->>GO2Connection: "move(zero)"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

return
else:
lost_count = 0
best_detection = max(detections.detections, key=lambda d: d.bbox_2d_volume())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Calling max(detections.detections, ...) without checking if the list is empty will raise ValueError. Even though len(detections) == 0 is checked, the len() behavior on ImageDetections2D may differ from checking detections.detections length. Add explicit check: if detections.detections: before calling max().

leshy
leshy previously approved these changes Jan 16, 2026
Copy link
Contributor

@leshy leshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok as a first sketch, would like to understand global planner issues etc.
will need to plug in reid also, Alex found some good model

self._stop_following()

try:
move_rpc = self.get_rpc_calls("GO2Connection.move")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we aren't just emitting twist from this as OUT? since this would makae this work cross embodyment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It slipped my mind. 😅 Fixed now.

cached_features: dict[int, Any]


class EdgeTAMProcessor(Detector):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this some new model, for tracking only? so would be good to have maybe new dir for it, since it's not segmentation? and new type for trackers, since this is not strictly a Detector so we can change trackers? not sure, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the way I implemented before was I conformed it to the same structure as YOLO and YOLOE, also in detection/types/detection3d/seg.py, there is a from_sam2_result() that is for this edgeTAM output. It should output ImageDetection2D[Detection2DSeg). I think it's better to put the general EdgeTAMTracker code (conforms to the same detector type) under perception/, keep the inference and model loading code as separate in models/ like I did before.

def __init__(self, camera_info: CameraInfo, global_config: GlobalConfig) -> None:
super().__init__()
self._global_config: GlobalConfig = global_config
self._latest_image: Image | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this fancy system of hot/colr getters

https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/advanced_streams.md#getter_hot---background-subscription-instant-reads

so you can do getter = geter_hot(self.image().observable())

idk if you want this, just letting you know it exists. also getter is not disposable, should make it one

super().stop()

@skill()
def follow_person(self, query: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally I'd much prefer this as a reactive pipeline in which you

self.image_frame.observable().pipe(
  ops.map(track)
  ops.map(detection_to_twist)
).subscribe(self.twist.publish)

so you don't need to do this self._image thing, thread management etc etc. much cleaner, just letting you know, this is an ok first test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doing it that way ties the frequency of the cmd_vel to the frequency of the camera, right? And you still need some state like counting how many frames are without a detection, or knowing when to stop.

Copy link
Contributor

@leshy leshy Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are nice tools for all of this - like sample() maybe? samples latest value at fixed hz, so you can emit twist at a faster fixed rate then input image

https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/reactivex.md#sampleinterval---emit-latest-value-every-n-seconds

you still need some state like counting how many frames are without a detection, or knowing when to stop.

yeah you either start propagading more complex types (worst case but still nice)

{
  Image :image
  Detection2d: ...
  frames: n
}

and this gives you full power, and clear typing also, since you can transform between those complex types at each step

def bla(in: ComplexType1) -> ComplexType2

ops.map(bla)

but as an also easier shortcut before going for complex types you can have a map that has a sideeffect that counts empty Detection2Ds and starts returning false/exiting early.

detection_to_twist in above example can have a sideeffect of counting empty detections it got, and starting to output twist(0,0,0) - after initial few twist(0,0,0) you can also choose to terminate tracking loop as well, observable just finishes

to be clear Im not asking for a rewrite, just mentioning how this can be structured in a way in which it's easy and fun to quickly iterate (add twist easing etc), rotation towards a person and walking towards a person can be separate maps() that layer onto the same Twist, etc

fx = self._camera_info.K[0] # focal length x in pixels
estimated_distance = (self._assumed_person_width * fx) / bbox_width

return estimated_distance
Copy link
Contributor

@leshy leshy Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again first sketch so all good, we should make sure that API for merging this with lidar and estimating segmentation distance (turning into detection3d) is super easy easy. detections/ stuff implements this for projecting 2d into 3d

perception/detection/reid/module.py

this is for re-assembling fancy Detection2D types from ros msgs + image, but you'd do the same for aligning lidar + camera image

    def detections_stream(self) -> Observable[ImageDetections2D]:
        return backpressure(
            align_timestamped(
                self.image.pure_observable(),
                self.detections.pure_observable().pipe(
                    ops.filter(lambda d: d.detections_length > 0)  # type: ignore[attr-defined]
                ),
                match_tolerance=0.0,
                buffer_size=2.0,
            ).pipe(ops.map(lambda pair: ImageDetections2D.from_ros_detection2d_array(*pair)))  # type: ignore[misc, arg-type]
        )

documented at https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/temporal_alignment.md

also this might be interesting
https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/quality_filter.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. I'll use the lidar to estimate the distance.

Copy link
Contributor

@leshy leshy Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for first merge as well, your call how to proceed with this, it could turn out that lidar method is less robust also

@paul-nechifor
Copy link
Contributor Author

@leshy

will need to plug in reid also, Alex found some good model

EdgeTAM does not require ReID.

@paul-nechifor paul-nechifor force-pushed the pauln-person-follow-edgetam branch from 8f5eb1a to 5d02818 Compare January 19, 2026 23:03
@leshy leshy merged commit af69f26 into dev Jan 20, 2026
13 checks passed
spomichter added a commit that referenced this pull request Jan 23, 2026
… Unitree Go2 Navigation & Exploration Beta

Pre-Release v0.0.8: Unitree Go2 Navigation & Exploration Beta, Transport Updates, Documentation updates, Rerun fixes, Person follow, Readme updates

## What's Changed
* Small docs clarification about stream getters by @leshy in #1043
* Fix split view on wide monitors by @jeff-hykin in #1048
* Docs: Install & Develop  by @jeff-hykin in #1022
* Add uv to nix and fix resulting problems by @jeff-hykin in #1021
* v0.0.8 by @paul-nechifor in #1050
* Style changes in docs by @paul-nechifor in #1051
* Revert "Add uv to nix and fix resulting problems" by @leshy in #1053
* Transport benchmarks + Raw ros transport by @leshy in #1038
* feat: default to rerun-web and auto-open browser on startup (browser … by @Nabla7 in #1019
* bbox detections visual check by @leshy in #1017
* fix: only auto-open browser for rerun-web viewer backend by @Nabla7 in #1066
* move slow tests to integration by @paul-nechifor in #1063
* Streamline transport start/stop methods by @Kaweees in #1062
* Person follow skill with EdgeTAM by @paul-nechifor in #1042
* fix: increase costmap floor z_offset to avoid z-fighting by @Nabla7 in #1073
* Fixed issue #1074 by @alexlin2 in #1075
* ROS transports initial by @leshy in #1057
* Fix System Config Values for LCM on MacOS and Refactor by @jeff-hykin in #1065
* SHM Transport basic fixes by @leshy in #1041
* commented out Mem Transport test case by @leshy in #1077
* Docs/advanced streams update 2 by @leshy in #1078
* Fix more tests by @paul-nechifor in #1071
* feat: navigation docker updates from bona_local_dev by @baishibona in #1081
* Fix missing dependencies by @Kaweees in #1085
* Release readme fixes by @spomichter in #1076

## New Contributors
* @baishibona made their first contribution in #1081

**Full Changelog**: v0.0.7...v0.0.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants