Skip to content

Multi-Modality-Tracking/QuadTrack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

Towards General Multimodal Visual Tracking

Motivation

Existing multimodal tracking studies focus on bimodal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark.

image

About QuadTrack600 benchmark

image

QuadTrack600 comprises 600 video sequences, which total over 384.7K high-resolution (640×480) frame groups. It is the first dataset to integrate four distinct modalities: RGB, thermal infrared, event, and language. In each frame group, all visual modalities are spatially aligned and meticulously annotated with bounding boxes. Additionally, each sequence includes a language description of the target, and 21 sequence-level challenge attributes are provided for detailed performance analysis. QuadTrack600 is highly diverse, featuring 41 different object categories and capturing a wide array of challenging scenarios, including extreme lighting, occlusion, and adverse weather conditions.

Comparision of QuadTrack600 with existing bi-modal tracking benchmarks

image

Data collection and alignment

image

To simultaneously collect video sequences from all three modalities, we construct a handheld imaging system. The system consists of two parts: a Hikvision binocular thermal camera for acquiring paired visible and thermal infrared video sequences, and a DVS (Dynamic Vision Sensor) for acquiring event streaming data. By manually adjusting the imaging optical axis of the device, a common view field is available for all three modalities.

Due to differences in imaging hardware among sensors, we must register multimodal video sequences both temporally and spatially. For temporal registration, since the visible and thermal infrared sequences are pre-calibrated by the imaging hardware, we can synchronize the three modalities by adjusting the event modality data. Specifically, we set a fixed time window for the event stream data, based on the recording frame rate (25 Hz) of the binocular thermal camera, and map event points to a common plane for alignment with the other modalities. By manually processing each video sequence individually, we achieve accurate temporal registration among the three modalities. For spatial registration, we convert the temporally registered three-modality video into image frames and select the thermal infrared image with the lowest resolution (640 × 480) as the registration target. We then use professional image editing software to crop and scale the visible and event images to ensure precise alignment with the thermal infrared images.

Attributes

Attr Description
01. PO Partial Occlusion - the target object is partially occluded
02. TO Total Occlusion - the target object is totally occluded
03. HO Hyaline Occlusion - the target is occluded by hyaline object
04. OV Out-of-View - the target leaves the camera field of view
05. VC Viewpoint Change - changes of viewpoint of the target
06. CM Camera Motion - the target object is captured by moving camera
07. BC Background Clutter - the background information which includes the target is messy
08. SA Similar Appearance - there are objects of similar appearance near the target
09. LI Low Illumination - the illumination in the target region is low
10. OE Over Exposure - the target object is in an overexposed environment
11. IV Illumination Variations - Illumination Variations in the background of the target object
12. LR Low Resolution - low resolution of target objects in images
13. DEF Deformation - non-rigid object deformation
14. TC Thermal Crossover - the target object has the same temperature as its surroundings or other objects
15. FL Frame Lost - some thermal infrared frames are lost
16. FM Fast Motion - the motion of the ground truth between two adjacent frames is large than 20 pixels
17. NM No Motion - the target object is in a no motion state
18. MB Motion Blur - motion of the target object causes blurring of the picture
19. SV Scale Variation - the ratio of the first bounding box and the current bounding box is out of the range [0.5,2]
20.ARC Aspect Ratio Change - the ratio of bounding box aspect is outside the range [0.5,2]
21.BOM Background Object Motion - influence of background object motion for Event camera

Dataset file structure

sequence
├─event
│────000001.jpg
│────000002.jpg
│────000003.jpg

├─infrared
│────000001.jpg
│────000002.jpg
│────000003.jpg

├─visible
│────000001.jpg
│────000002.jpg
│────000003.jpg

├─event.txt
├─infrared.txt
├─init.txt
├─query.txt
├─target_i.jpg
├─target_v.jpg
└─visible.txt

Dataset

  • Download QuadTrack600 from BaiduNetdisk(Password:Quad)
  • Toolkit for evaluation on QuadTrack600 :BaiduNetdisk(Password:Quad)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors