This pipeline pairs four synchronized cameras with 2D hand pose detection (via MMPose or MediaPipe) to reconstruct 3D hand landmarks and visualize reprojection quality. Camera 0 defines the world frame; all coordinates are reported in meters relative to that camera.
- 4 cameras mounted in the diamond layout (Cam0 bottom, Cam1 top-left, Cam2 top-right, Cam3 bottom-right)
- Rigid mounts with matching heights and slight inward tilt (≈10–15°)
- 9×6 inner-corner chessboard (square size 23 mm unless you change the scripts)
- Even, diffuse lighting across the workspace
pip install -r requirements.txt
mkdir -p video/camera video/handsDownload (or symlink) an MMDetection hand detector config/checkpoint and an MMPose top-down hand pose config/checkpoint; you will pass their paths to hand_inference.py via CLI flags.
-
Record calibration videos
Place the chessboard throughout the capture volume while all four cameras record simultaneously. Save asvideo/camera/cam0.mp4…cam3.mp4. -
Run calibration
python calibration.py
Produces
output/calibration/multi_camera_calib.npzandmulti_camera_rectify.npz. Target per-camera RMS < 0.5 px and stereo RMS < 1.0 px. -
Record hand-motion videos
Capture synchronized hand footage and store invideo/hand/0.mp4…3.mp4(or pass--sequenceto change the folder). -
Run 2D hand inference
Pick the variant that best fits your use case:- MMPose (default / highest accuracy)
python hand_inference.py \ --det-config models/det/rtmdet_tiny_8xb32-300e_coco.py \ --det-checkpoint models/det/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth \ --pose-config models/pose/rtmpose-m_8xb256-210e_hand5-256x256.py \ --pose-checkpoint models/pose/rtmpose-m_simcc-hand5_pt-aic-coco_210e-256x256-74fb594_20230320.pth \ --device cuda:0 \ --sequence handhand_inference.pyis a thin wrapper aroundhand_inference_mmpose.py, so you can call either script with the same flags. - MediaPipe (no configs/checkpoints required)
Useful for quick tests on CPU-only machines. Produces the same pickle format, so downstream steps remain unchanged.
python hand_inference_mediapipe.py --sequence hand
Both variants write cached detections to
output/detections/<sequence>_2d_detections.pkl. During inference with--preview, both show real-time visualization of detected hand skeletons (21 keypoints + connections) overlaid on each camera view. - MMPose (default / highest accuracy)
-
Triangulate 3D hands
python hand_triangulation.py \ --detections output/detections/hand_2d_detections.pkl \ --displayWrites multi- and single-hand 3D trajectories under
output/tracking/. -
Evaluate results (optional but recommended)
python evaluate.py python check_hand_consistency.py
Generates a reprojection video and diagnostic plots under
output/evaluation/andoutput/visualization/. -
Calibrate quality (optional)
python checkerboard_eval.pysummarizes checkerboard reprojection errors for sanity checking.
--det-config/--det-checkpoint: MMDetection hand detector (e.g., RTMDet hand). Set--det-cat-idif your detector uses a different class index (default0).--pose-config/--pose-checkpoint: MMPose top-down hand pose model (e.g., RTMPose). Make sure the model predicts 21 keypoints that follow the MediaPipe ordering.--device:cpu,cuda:0, etc. Defaults tocuda:0if available, otherwisecpu.- Optional
--det-score-thr/--pose-score-thrtune per-camera detection filtering;--max-hands-per-viewlimits per-camera tracking. - Use
--sequenceto target a differentvideo/<sequence>/cam.mp4folder, and--outputto rename the cached detection pickle. - MediaPipe variant ignores detector/pose config flags; tune its behavior via
--det-score-thr,--pose-score-thr,--max-hands-per-view, and--max-frames.
Hand_MoCap/
├── calibration.py # Step 1: chessboard-based calibration
├── hand_inference_mmpose.py # Step 2a: MMPose detection + pose estimation
├── hand_inference_mediapipe.py # Step 2b: MediaPipe detection + pose estimation
├── hand_triangulation.py # Step 3: multi-view matching & 3D reconstruction
├── evaluate.py # Step 4: reprojection video generation
├── checkerboard_eval.py # Optional: calibration quality diagnostics
├── video_utils.py # Utility: video discovery helpers
├── video/ # Input footage (camera + hand recordings)
│ ├── camera/ # Calibration videos
│ └── hand/ # Hand motion videos
├── output/ # Generated artifacts (auto-created)
│ ├── calibration/ # Calibration parameters
│ ├── detections/ # Cached 2D detections
│ ├── tracking/ # 3D hand trajectories
│ └── evaluation/ # Reprojection videos
├── models/ # MMPose/MMDetection model files
│ ├── det/ # Detection model configs/checkpoints
│ └── pose/ # Pose model configs/checkpoints
└── requirements.txt # Python dependencies
output/calibration/multi_camera_calib.npz– intrinsics (K0–K3), distortion (D0–D3), rotations (R1–R3), and translations (T1–T3) expressed from camera 0multi_camera_rectify.npz– rectification transforms (R1_01,P2_03,Q_02, …) for stereo matching
output/detections/<sequence>_2d_detections.pkl– cached per-frame, per-camera 2D keypoints produced byhand_inference.py
output/tracking/hand_3d_positions_multi.pkl– list of frames; each frame contains 0–2 hands with(21, 3)arrays in metershand_3d_positions.npy– single-hand array (first hand per frame) with NaNs when no hand is present
output/evaluation/reprojection_4cam.mp4– 2×2 grid showing original footage with reprojected landmarks
output/visualization/hand_consistency_check.png– coverage, motion, and ID-consistency plots
Delete output/ to reset the workspace; scripts recreate folders as needed.
Top view (diamond layout):
Cam1 Cam2
\ /
\ /
45° \ / 45°
\ /
[Hand Workspace]
/ \
45° / \ 45°
/ \
/ \
Cam0 Cam3
- Positioning: keep all cameras ~0.5 m from the workspace center at a common height (~30 cm above the surface) and tilt inward by ~10–15°.
- Baselines: expect ~0.7 m between adjacent cameras; Camera 1↔3 forms the widest pair (~1 m) and provides strong depth cues.
- Coverage: the most reliable capture volume is a 20 cm cube at the center; quality remains good out to ≈35 cm before dropping to two-camera coverage near the edges.
- Mounts are rigid and heights match
- Lighting is uniform with minimal glare or shadows
- Cameras share the same resolution and frame rate (≥30 FPS)
- Auto-exposure/white balance are consistent or locked
- Recording start times are tightly synchronized (<100 ms skew)
- Defaults:
chessboard_size = (9, 6)inner corners,square_size = 0.023m. - Samples every
sample_every_n_frames(default 50) up tomax_framessets. - Prints per-camera RMS and stereo RMS errors plus baselines. Recalibrate if RMS > 1.0 px or baselines are inconsistent.
- Runs 2D hand detection per camera view to cache 2D joints for every frame in a synchronized sequence.
- MMPose variant: Uses MMDetection + MMPose pipeline; accepts
--det-config,--pose-config,--device, and related flags. - MediaPipe variant: CPU-friendly alternative requiring no model downloads; tune via
--det-score-thrand--pose-score-thr. - Both support
--previewfor real-time skeleton visualization (green lines + keypoints) and--sequenceto select input folder. - Writes a pickle containing per-frame, per-camera detections (keypoints + confidences) under
output/detections/.
- Consumes the cached detections, camera calibration, and (optionally) the raw videos to match hands across views and triangulate them.
- Performs bundle-adjusted triangulation with per-landmark outlier rejection; enable
--debug-matchingfor verbose pairing logs and--displayfor reprojected overlays. - Saves both multi-hand (pickle) and single-hand (NumPy) trajectories under
output/tracking/.
- Reprojects tracked 3D landmarks into all cameras to visually validate alignment.
- Video layout is
Cam0 | Cam1overCam2 | Cam3; green landmarks mark the first hand, magenta the second.
- Aggregates statistics such as per-frame hand counts, wrist trajectories, and inter-hand distances to spot ID swaps or dropouts.
-
multi_camera_calib.npz
Load withnp.load, accessK*,D*,R*,T*,E*,F*. Rotations/Translations map camera 0 coordinates into other camera frames (P_i = R_i @ P_0 + T_i). -
hand_3d_positions_multi.pklimport pickle with open("output/tracking/hand_3d_positions_multi.pkl", "rb") as f: frames = pickle.load(f) # frames[frame_idx][hand_idx][landmark_idx] -> (x, y, z) in meters
Landmarks follow MediaPipe ordering (0 wrist, 4 thumb tip, 8 index tip, 12 middle tip, 16 ring tip, 20 pinky tip).
-
hand_3d_positions.npy
Shape(num_frames, 21, 3); NaN rows indicate frames without a detected hand.
- Chessboard not detected: improve lighting, slow down board motion, confirm
chessboard_size/square_sizematch the physical board. - High calibration error: capture more diverse poses (cover corners and tilt angles), ensure cameras remain fixed, clean lenses.
- Hands appear gray or unmatched: check synchronization, lighting balance, and detection thresholds in
hand_inference.py(--det-score-thr,--pose-score-thr). - Jittery trajectories: recalibrate, verify camera mounts, or apply temporal smoothing to the exported data.
- Large reprojection error: re-run
checkerboard_eval.pyto locate problematic frames/cameras; recalibrate if mean error exceeds a few pixels.
- Python 3.8+ with PyTorch, OpenCV, NumPy, and Matplotlib (see
requirements.txt). - For MMPose: Requires MMPose, MMDetection, MMEngine, and MMCV packages plus model checkpoints.
- For MediaPipe: Only requires the
mediapipepackage (installed viarequirements.txt). - Typical run times on a modern GPU laptop: calibration ≈1 min (25 frames), tracking 5–10 FPS for 4 cameras, evaluation ≈30 FPS for rendering. CPU-only inference works but is significantly slower.
- Use the pickle output for gesture recognition, biomechanics analysis, or downstream machine learning.
- Tune detection settings via
hand_inference.pyflags (score thresholds, max hands) and triangulation heuristics withhand_triangulation.py(--max-hands-total,--reproj-rejection) to balance robustness and speed. - Extend the evaluator or diagnostics scripts to suit your application (e.g., export CSV summaries or integrate temporal filters).