Conversation
Greptile SummaryThis PR adds a person-following skill using EdgeTAM (SAM2-based video tracking) for visual servoing control on the Unitree GO2 robot. The implementation includes a new skill container, EdgeTAM processor for frame tracking, and integration into the agentic blueprint system. Key Changes:
Known Limitations (Per PR Description):
Issues Found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent
participant PersonFollowSkill
participant EdgeTAMProcessor
participant GO2Connection
participant ColorImageStream
Agent->>PersonFollowSkill: "follow_person(query)"
PersonFollowSkill->>PersonFollowSkill: "detect person via QwenVL"
PersonFollowSkill->>PersonFollowSkill: "spawn daemon thread"
Note over PersonFollowSkill: "_follow_loop thread"
PersonFollowSkill->>EdgeTAMProcessor: "init_track(image, bbox)"
EdgeTAMProcessor->>EdgeTAMProcessor: "SAM2 initializes"
loop "Control Loop (20 Hz)"
ColorImageStream->>PersonFollowSkill: "new frame"
PersonFollowSkill->>EdgeTAMProcessor: "process_image()"
alt "Tracking Success"
EdgeTAMProcessor-->>PersonFollowSkill: "detections"
PersonFollowSkill->>PersonFollowSkill: "compute twist"
PersonFollowSkill->>GO2Connection: "move(twist)"
else "Lost Track"
EdgeTAMProcessor-->>PersonFollowSkill: "empty detections"
PersonFollowSkill->>GO2Connection: "move(zero)"
end
PersonFollowSkill->>PersonFollowSkill: "sleep"
end
Agent->>PersonFollowSkill: "stop_following()"
PersonFollowSkill->>GO2Connection: "move(zero)"
|
| return | ||
| else: | ||
| lost_count = 0 | ||
| best_detection = max(detections.detections, key=lambda d: d.bbox_2d_volume()) |
There was a problem hiding this comment.
logic: Calling max(detections.detections, ...) without checking if the list is empty will raise ValueError. Even though len(detections) == 0 is checked, the len() behavior on ImageDetections2D may differ from checking detections.detections length. Add explicit check: if detections.detections: before calling max().
leshy
left a comment
There was a problem hiding this comment.
looks ok as a first sketch, would like to understand global planner issues etc.
will need to plug in reid also, Alex found some good model
dimos/agents/skills/person_follow.py
Outdated
| self._stop_following() | ||
|
|
||
| try: | ||
| move_rpc = self.get_rpc_calls("GO2Connection.move") |
There was a problem hiding this comment.
any reason we aren't just emitting twist from this as OUT? since this would makae this work cross embodyment
There was a problem hiding this comment.
It slipped my mind. 😅 Fixed now.
| cached_features: dict[int, Any] | ||
|
|
||
|
|
||
| class EdgeTAMProcessor(Detector): |
There was a problem hiding this comment.
isn't this some new model, for tracking only? so would be good to have maybe new dir for it, since it's not segmentation? and new type for trackers, since this is not strictly a Detector so we can change trackers? not sure, what do you think?
There was a problem hiding this comment.
Actually the way I implemented before was I conformed it to the same structure as YOLO and YOLOE, also in detection/types/detection3d/seg.py, there is a from_sam2_result() that is for this edgeTAM output. It should output ImageDetection2D[Detection2DSeg). I think it's better to put the general EdgeTAMTracker code (conforms to the same detector type) under perception/, keep the inference and model loading code as separate in models/ like I did before.
| def __init__(self, camera_info: CameraInfo, global_config: GlobalConfig) -> None: | ||
| super().__init__() | ||
| self._global_config: GlobalConfig = global_config | ||
| self._latest_image: Image | None = None |
There was a problem hiding this comment.
we have this fancy system of hot/colr getters
so you can do getter = geter_hot(self.image().observable())
idk if you want this, just letting you know it exists. also getter is not disposable, should make it one
| super().stop() | ||
|
|
||
| @skill() | ||
| def follow_person(self, query: str) -> str: |
There was a problem hiding this comment.
generally I'd much prefer this as a reactive pipeline in which you
self.image_frame.observable().pipe(
ops.map(track)
ops.map(detection_to_twist)
).subscribe(self.twist.publish)so you don't need to do this self._image thing, thread management etc etc. much cleaner, just letting you know, this is an ok first test
There was a problem hiding this comment.
But doing it that way ties the frequency of the cmd_vel to the frequency of the camera, right? And you still need some state like counting how many frames are without a detection, or knowing when to stop.
There was a problem hiding this comment.
there are nice tools for all of this - like sample() maybe? samples latest value at fixed hz, so you can emit twist at a faster fixed rate then input image
you still need some state like counting how many frames are without a detection, or knowing when to stop.
yeah you either start propagading more complex types (worst case but still nice)
{
Image :image
Detection2d: ...
frames: n
}and this gives you full power, and clear typing also, since you can transform between those complex types at each step
def bla(in: ComplexType1) -> ComplexType2
ops.map(bla)
but as an also easier shortcut before going for complex types you can have a map that has a sideeffect that counts empty Detection2Ds and starts returning false/exiting early.
detection_to_twist in above example can have a sideeffect of counting empty detections it got, and starting to output twist(0,0,0) - after initial few twist(0,0,0) you can also choose to terminate tracking loop as well, observable just finishes
to be clear Im not asking for a rewrite, just mentioning how this can be structured in a way in which it's easy and fun to quickly iterate (add twist easing etc), rotation towards a person and walking towards a person can be separate maps() that layer onto the same Twist, etc
dimos/agents/skills/person_follow.py
Outdated
| fx = self._camera_info.K[0] # focal length x in pixels | ||
| estimated_distance = (self._assumed_person_width * fx) / bbox_width | ||
|
|
||
| return estimated_distance |
There was a problem hiding this comment.
again first sketch so all good, we should make sure that API for merging this with lidar and estimating segmentation distance (turning into detection3d) is super easy easy. detections/ stuff implements this for projecting 2d into 3d
perception/detection/reid/module.py
this is for re-assembling fancy Detection2D types from ros msgs + image, but you'd do the same for aligning lidar + camera image
def detections_stream(self) -> Observable[ImageDetections2D]:
return backpressure(
align_timestamped(
self.image.pure_observable(),
self.detections.pure_observable().pipe(
ops.filter(lambda d: d.detections_length > 0) # type: ignore[attr-defined]
),
match_tolerance=0.0,
buffer_size=2.0,
).pipe(ops.map(lambda pair: ImageDetections2D.from_ros_detection2d_array(*pair))) # type: ignore[misc, arg-type]
)documented at https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/temporal_alignment.md
also this might be interesting
https://github.com/dimensionalOS/dimos/blob/dev/docs/api/sensor_streams/quality_filter.md
There was a problem hiding this comment.
Yeah, makes sense. I'll use the lidar to estimate the distance.
There was a problem hiding this comment.
no need for first merge as well, your call how to proceed with this, it could turn out that lidar method is less robust also
EdgeTAM does not require ReID. |
8f5eb1a to
5d02818
Compare
… Unitree Go2 Navigation & Exploration Beta Pre-Release v0.0.8: Unitree Go2 Navigation & Exploration Beta, Transport Updates, Documentation updates, Rerun fixes, Person follow, Readme updates ## What's Changed * Small docs clarification about stream getters by @leshy in #1043 * Fix split view on wide monitors by @jeff-hykin in #1048 * Docs: Install & Develop by @jeff-hykin in #1022 * Add uv to nix and fix resulting problems by @jeff-hykin in #1021 * v0.0.8 by @paul-nechifor in #1050 * Style changes in docs by @paul-nechifor in #1051 * Revert "Add uv to nix and fix resulting problems" by @leshy in #1053 * Transport benchmarks + Raw ros transport by @leshy in #1038 * feat: default to rerun-web and auto-open browser on startup (browser … by @Nabla7 in #1019 * bbox detections visual check by @leshy in #1017 * fix: only auto-open browser for rerun-web viewer backend by @Nabla7 in #1066 * move slow tests to integration by @paul-nechifor in #1063 * Streamline transport start/stop methods by @Kaweees in #1062 * Person follow skill with EdgeTAM by @paul-nechifor in #1042 * fix: increase costmap floor z_offset to avoid z-fighting by @Nabla7 in #1073 * Fixed issue #1074 by @alexlin2 in #1075 * ROS transports initial by @leshy in #1057 * Fix System Config Values for LCM on MacOS and Refactor by @jeff-hykin in #1065 * SHM Transport basic fixes by @leshy in #1041 * commented out Mem Transport test case by @leshy in #1077 * Docs/advanced streams update 2 by @leshy in #1078 * Fix more tests by @paul-nechifor in #1071 * feat: navigation docker updates from bona_local_dev by @baishibona in #1081 * Fix missing dependencies by @Kaweees in #1085 * Release readme fixes by @spomichter in #1076 ## New Contributors * @baishibona made their first contribution in #1081 **Full Changelog**: v0.0.7...v0.0.8
Usage
Write "follow the man in blue" or some description of yourself.
E2E
Run the end-to-end test:
How it works