Skip to content

Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility#612

Merged
mdaiter merged 35 commits intodevfrom
matt-image-upgrades
Oct 8, 2025
Merged

Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility#612
mdaiter merged 35 commits intodevfrom
matt-image-upgrades

Conversation

@mdaiter
Copy link
Contributor

@mdaiter mdaiter commented Sep 10, 2025

@paul-nechifor is leading this integration, according to @spomichter .

Full on abstraction for Image away from a locked-in platform.

Added all functions @alexlin2 requested in native CUDA + OpenCV numpy fallbacks.

Also, CUDA IPC and CPU IPC are now native.

# PnP – Gauss–Newton (no distortion in batch), iterative per-instance
def solve_pnp(
self,
object_points: np.ndarray,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes, I know, there are a few object points accepted in np.ndarray and put on-device. gonna go after this in a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason being that you might not need these parameters on the GPU by the time you're calling this function. It's basically user-convenience. Can shift to GPU or keep them on CPU. @alexlin2, feedback please.

msg.header.stamp.nsec = int((now - int(now)) * 1e9)

arr = (
self.to_opencv()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: repeat

r = rgb[..., 0].astype(cp.float32) # type: ignore
g = rgb[..., 1].astype(cp.float32) # type: ignore
b = rgb[..., 2].astype(cp.float32) # type: ignore
y = 0.299 * r + 0.587 * g + 0.114 * b
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it's a YUV conversion, taking the Y channel. It's luminosity. lemme know if I should drop a comment here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could mention that's the BT.601 standard YUV conversion to make the numbers less magical.

@mdaiter
Copy link
Contributor Author

mdaiter commented Sep 16, 2025

Stats:
run.py
CUDA run:
Wall clock (real): 1m09s → a full 30 seconds faster than CPU.
User CPU time: 3m56s → slightly lower than CPU (≈ −25s).
System CPU time: 0m42s → dramatically lower (≈ −1m25s vs CPU).
CPU run:
Wall clock (real): 1m39s
User CPU time: 4m22s
System CPU time: 2m07s
What it shows:
Throughput win: CUDA pipeline finishes ~30% faster wall-time.
CPU relief:
User time drops modestly (Python + preprocessing still dominate).
System time drops by ~⅔, which is big — kernel overhead from memcopies, etc., is being offloaded to the GPU stack.
Total CPU time:
CPU: ~6m28s
CUDA: ~4m38s
→ That’s ~30% less CPU consumed overall.
unitree_go2.py
CUDA: Real: 1m12s User: 2m26s Sys: 0m34s
CPU: Real: 1m38s User: 3m56s Sys: 2m25s
^to break these down:
CUDA run
Real (wall time): 1m12s: finished faster overall
User CPU time: 2m26s: much less CPU spent than CPU version
System CPU time: 0m34s: also lower (fewer memory copies / syscalls bottlenecking)
CPU run
Real (wall time): 1m38s -> slower
User CPU time: 3m56s -> almost 2x the CPU load of CUDA
System CPU time: 2m25s -> kernel overhead way higher than CUDA

@spomichter
Copy link
Contributor

@paul-nechifor I am reviewing as well but can you take a look at this specifically the changes to the Image type

np.testing.assert_array_equal(cp.asnumpy(out_gpu), out_cpu)


def test_draw_bounding_box_cpu():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. What's your opinion on using actually images for testing instead of zeros or noise?

That is:

  1. add a small sample image
  2. perform the modification on the image and save it as expected.png
  3. in the test assert the result image is similar to expected.png

This way we can see what it's supposed to look like. And if we change the implementation and a test fails, we can compare what it should have been and what the new code is generating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that! Hit me with an image, happy to include.


def sharpness(self) -> float:
if cp is None:
return 0.0
Copy link
Contributor

@leshy leshy Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should throw an error here, not return a magic number, is there a situation in which cp is None here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a diff algo then cpu image sharpness? does this return the same value? is it fair to compare in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed, that makes sense. Technically, a zero sharp image is just blurry. But I understand where this comment is derivative from.
  2. Check the parity tests. It's explicitly compared.

_ = cpu.resize(320, 240)
cpu_t = time.perf_counter() - t0
t0 = time.perf_counter()
for _ in range(5):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to measure aloc time as well, not init once then measure calculation, and this needs to be done for all tests, since every module will need to initialize image before doing operations on it, and init will happen on every frame on ingest

@mdaiter mdaiter merged commit 42f410f into dev Oct 8, 2025
@mdaiter mdaiter deleted the matt-image-upgrades branch October 8, 2025 20:18
@paul-nechifor paul-nechifor restored the matt-image-upgrades branch October 8, 2025 21:27
spomichter added a commit that referenced this pull request Oct 28, 2025
Release v0.0.5


## What's Changed
* Unitree WebRTC implementation on rebased dev by @leshy in #277
* Update ros_observable_topic timeout to 100s by @leshy in #273
* Updated README, more clear on API key requirements and updated go2_ros2_sdk remote by @spomichter in #272
* Release v0.0.4 Patch: readme changes by @spomichter in #292
* Readme patch v0.0.4 by @spomichter in #293
* Development container & CI by @leshy in #278
* env/devcontainer ruff formatting/typing by @leshy in #294
* Global reformat 100 line length  by @spomichter in #300
* Global code reformat with ruff by @leshy in #295
* Position/Vector type cleanup & tests by @leshy in #297
* Linelength100 by @leshy in #301
* Auto-delivery of binary data files for testing, rewrite of dev script by @leshy in #298
* pre-commit hooks in dev container & CI, automatic LFS upload by @leshy in #303
* Removed all submodules - Testing by @spomichter in #306
* Fixed v0.0.4 Unitree ROS runfile broken by WebRTC development, Vector.py fixes by @spomichter in #307
* test/mapper by @leshy in #305
* Reduced CI cleanup frequency to PRs only into dev/main by @spomichter in #312
* DimOS Manipulation Framework, ObjectDetectionStream Changes by @spomichter in #308
* Added auto-license header to pre-commit by @spomichter in #336
* Move thread fix for alex planner by @leshy in #334
* base typing cleanup, sensor reply tests+docs by @leshy in #309
* devcontainer docs by @leshy in #338
* ci docs by @leshy in #339
* Add Cerebras Agent by @joshuajerin in #310
* Repo cleanup by @leshy in #340
* noros builds by @leshy in #341
* Update testing_stream_reply.md by @leshy in #342
* ONNX conversions for YOLOv11 and FastSAM by @mdaiter in #350
* Test cicd fake ros change by @spomichter in #361
* Reverted cleanup workflow frequency to on any PUSH due to CICD docker workflow issues by @spomichter in #360
* Trigger docker ros rerun by @spomichter in #363
* Ros CI change detection by @leshy in #364
* trigger full rebuild by @leshy in #365
* Add CLIP ONNX conversion and support, with passing vision and text tests by @mdaiter in #353
* CI fix 3 by @leshy in #367
* ONNX Support for YOLO, SAM2 + Unit tests for CLIP, YOLO, SAM2 by @spomichter in #345
* LFS moved to utils from testing by @leshy in #368
* Contact graspnet integration on pytorch and pyproject build processes setup with cuda/manipulation tags by @spomichter in #370
* data/* deletions by @leshy in #369
* Ci pre-commit and docker builds run in parallel by @leshy in #372
* Ci shared docker cache by @leshy in #371
* Unitree WebRTC integrated with full functionality, remove all ROS dependency, refactored entire robot base class and connection interface, added explore skill by @alexlin2 in #279
* Unitree WebRTC only implementation, Exploration skills [Staging --> Dev] by @spomichter in #379
* Dask lcm multiprocess by @leshy in #377
* DimOS Packaging & Build Improvements for CPU-only, CUDA, Manipulation installations by @spomichter in #394
* Multitree go2 by @leshy in #381
* better LCM system checks, fixes bin/lfs_push by @leshy in #382
* UnitreeSpeak skill over webrtc, Voice Interface added on localhost, Voice interface on mobile device on network by @spomichter in #400
* FIX: multiprocess by @leshy in #402
* Lcmspy cli by @leshy in #404
* changed position type name to pose by @alexlin2 in #358
* WIP: foxglove bridge stub by @leshy in #411
* Create running_without_devcontainer.md by @leshy in #405
* new LCM class format support by @leshy in #417
* Fixed PoseStamped ros_msgs error in dimos-lcm by @spomichter in #457
* Fixes move stream issue, Odom receive issue by @leshy in #456
* Small stream/type fixes for unitree by @leshy in #460
* Local planner, Global Planner, Explore, SpatialMemory working via LCM/Dask Multiprocess by @spomichter in #467
* Added working runfile to Unitreego2Light class by @spomichter in #474
* Point Cloud Filtering and Segmentation, Full 6DOF Object pose estimation, Grasp generation, ZED driver support, Hosted grasp integration by @spomichter in #458
* Stream fixes, Twist, Pose, Quaternion updates by @leshy in #471
* Added self-hosted runner to full CICD by @spomichter in #484
* Full Unitree (Local planner, Explore, SpatialMemory) FakeRTC/WebRTC LCM modules working in self-hosted devcontainer  by @spomichter in #487
* Porting types/ LCM msgs/ new LCM types, Transform visualization by @leshy in #477
* Tracking streams lcm dask refactor by @spomichter in #488
* Pytransforms by @leshy in #491
* Fix python and dev docker builds for CICD by @spomichter in #489
* Remove PIL Image Usage by @alexlin2 in #490
* Added missing __init__.py's to transforms  by @spomichter in #493
* Added tofix pytest tag back to addopts by @spomichter in #494
* Added module docs by @spomichter in #495
* SpatialMemory converted to Dask module, input LCM odom and video streams by @spomichter in #481
* Run modules tests only on 16gb runner by @spomichter in #499
* Trigger CI only on PR or push to main/dev by @spomichter in #500
* Added more aggressive cleanup workflows by @spomichter in #501
* Visual Servoing for Pick and Place Demo by @alexlin2 in #476
* Testing run-tests container pull fix and removed modules tests by @spomichter in #505
* Fix permissions in pre-build-cleanup by @spomichter in #508
* Moved pre-build cleanup to build template by @spomichter in #509
* dimos lcm update to main branch latest commit by @leshy in #498
* RPC Kwargs by @leshy in #503
* Transform system, stream convinience features, type checking by @leshy in #504
* Dimoslcm bump by @leshy in #510
* Testing UV builds in docker by @spomichter in #513
* OccupancyGrid, Path types by @leshy in #511
* subscribing to transports/streams from main loop by @leshy in #524
* Alex Lin's version of ROS Nav2 by @alexlin2 in #514
* Agent refactor conversation history by @spomichter in #541
* Exposed optional memory_limit param in dimos core by @spomichter in #540
* Agent refactor by @spomichter in #535
* Validating transforms with ros examples by @leshy in #538
* rpc timeout by @leshy in #542
* MuJoCo Simulation by @paul-nechifor in #539
* Revert "MuJoCo Simulation" by @spomichter in #548
* perception refactor to be on parity with old architecture by @alexlin2 in #534
* Skill coordinator by @leshy in #536
* WIP Mujoco simulation by @paul-nechifor in #549
* Fix event loop leak by @paul-nechifor in #547
* Correct way to build package directly in non-editable mode, no manife… by @spomichter in #551
* Office environment mujoco by @paul-nechifor in #554
* Less bandwidth usage on LCM, bug fixed with navigation by @alexlin2 in #559
* disabled old agent tests by @leshy in #563
* Camera Module Refactor, added image rectification by @alexlin2 in #566
* long rpc timeout by @leshy in #569
* Twist message for all move command, added keyboard teleop for easy robot control in sim by @alexlin2 in #570
* numerical sort for sensor replay by @leshy in #564
* 2d detection module by @leshy in #567
* Stream timestamp alignment by @leshy in #557
* Sharpness for Images by @leshy in #560
* Jetson humanoid integration by @spomichter in #590
* 2d detection module + Agent2 - yolo demo by @leshy in #582
* jetson.md cleanup by @spomichter in #602
* Unitree b1 integration with continuous cmd_vel Twist interface, joystick control for testing, C++ UDP server for onboard B1 by @spomichter in #601
* Joystick integrated g1 humanoid by @spomichter in #603
* Unitree b1 manipulation pose integration by @spomichter in #604
* use SHM in Foxglove by @paul-nechifor in #607
* CPU isolated shared mem by @mdaiter in #589
* silence unnecessary unitree go 2 tricks by @paul-nechifor in #615
* Pshm to lcm by @paul-nechifor in #616
* Unitree agents2 skill integration paul by @paul-nechifor in #617
* Unitree go2 runfile integration tool call issues by @spomichter in #605
* gstreamer camera by @paul-nechifor in #613
* zed local node by @leshy in #623
* ROS Bridge for Unitree G1 and B1 Navigation, Working G1 navigation by @spomichter in #610
* B1 ros navigation rebase by @spomichter in #626
* Added build directory to gitignore by @yashas-salankimatt in #628
* 2D detection module + Pointcloud localization by @leshy in #583
* Camera calibration loading by @leshy in #629
* Agent2 nav skills by @paul-nechifor in #630
* WIP shared mem again by @paul-nechifor in #650
* Fix leaks by @paul-nechifor in #649
* Fix SHM leak by @paul-nechifor in #652
* Suppress echos with counter by @paul-nechifor in #653
* Removing websocket vis causing crazy lag by @spomichter in #656
* Suppress with UUID by @paul-nechifor in #655
* Modules navigate object bbox by @spomichter in #654
* Ros bridge test fix by @alexlin2 in #660
* video g1 spatial mem + detection - tomerge by @leshy in #651
* Update README.md by @spomichter in #664
* Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility by @mdaiter in #612
* Revert "Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility" by @leshy in #665
* Detection second pass by @leshy in #662
* CudaImage by @spomichter in #671
* Add start/stop to all modules and other resources by @paul-nechifor in #675
* forgotten context managers by @paul-nechifor in #676
* CUDAImage, NumpyImage, Image implementations with robust backend tests for image operations by @spomichter in #680
* CudaImage by @spomichter in #677
* alibaba env var fix by @leshy in #673
* Rename FakeRTC --> ReplayRTC by @spomichter in #681
* Fix websocketvis performance rebase by @spomichter in #682
* Alexl ros nav intergration by @alexlin2 in #632
* detection pipeline rewrite, embedding, vl model standardization, reid system by @leshy in #674
* cli tooling theme by @leshy in #687
* Fix spatial memory bug in g1  by @spomichter in #689
* Add autoconnect back2 by @paul-nechifor in #684
* Add ability to remap module connections name. by @paul-nechifor in #698
* Add transport which encodes images as JPEG to improve performance. by @paul-nechifor in #693
* New Ruff autofixes by @paul-nechifor in #694

## New Contributors
* @joshuajerin made their first contribution in #310
* @mdaiter made their first contribution in #350
* @yashas-salankimatt made their first contribution in #628

**Full Changelog**: https://github.com/dimensionalOS/dimos/commits/v0.0.5
spomichter pushed a commit that referenced this pull request Jan 8, 2026
Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility

Former-commit-id: 42f410f
paul-nechifor pushed a commit that referenced this pull request Jan 8, 2026
Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility

Former-commit-id: cae9d1b [formerly 42f410f]
Former-commit-id: 7b38a68
jeff-hykin pushed a commit that referenced this pull request Jan 9, 2026
Image upgrades! Impls for CUDA + numpy, along with an abstraction and full backwards compatibility

Former-commit-id: 967b376 [formerly 42f410f]
Former-commit-id: 7b38a68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants