Existing multimodal tracking studies focus on bimodal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark.
QuadTrack600 comprises 600 video sequences, which total over 384.7K high-resolution (640×480) frame groups. It is the first dataset to integrate four distinct modalities: RGB, thermal infrared, event, and language. In each frame group, all visual modalities are spatially aligned and meticulously annotated with bounding boxes. Additionally, each sequence includes a language description of the target, and 21 sequence-level challenge attributes are provided for detailed performance analysis. QuadTrack600 is highly diverse, featuring 41 different object categories and capturing a wide array of challenging scenarios, including extreme lighting, occlusion, and adverse weather conditions.
To simultaneously collect video sequences from all three modalities, we construct a handheld imaging system. The system consists of two parts: a Hikvision binocular thermal camera for acquiring paired visible and thermal infrared video sequences, and a DVS (Dynamic Vision Sensor) for acquiring event streaming data. By manually adjusting the imaging optical axis of the device, a common view field is available for all three modalities.
Due to differences in imaging hardware among sensors, we must register multimodal video sequences both temporally and spatially. For temporal registration, since the visible and thermal infrared sequences are pre-calibrated by the imaging hardware, we can synchronize the three modalities by adjusting the event modality data. Specifically, we set a fixed time window for the event stream data, based on the recording frame rate (25 Hz) of the binocular thermal camera, and map event points to a common plane for alignment with the other modalities. By manually processing each video sequence individually, we achieve accurate temporal registration among the three modalities. For spatial registration, we convert the temporally registered three-modality video into image frames and select the thermal infrared image with the lowest resolution (640 × 480) as the registration target. We then use professional image editing software to crop and scale the visible and event images to ensure precise alignment with the thermal infrared images.
| Attr | Description |
|---|---|
| 01. PO | Partial Occlusion - the target object is partially occluded |
| 02. TO | Total Occlusion - the target object is totally occluded |
| 03. HO | Hyaline Occlusion - the target is occluded by hyaline object |
| 04. OV | Out-of-View - the target leaves the camera field of view |
| 05. VC | Viewpoint Change - changes of viewpoint of the target |
| 06. CM | Camera Motion - the target object is captured by moving camera |
| 07. BC | Background Clutter - the background information which includes the target is messy |
| 08. SA | Similar Appearance - there are objects of similar appearance near the target |
| 09. LI | Low Illumination - the illumination in the target region is low |
| 10. OE | Over Exposure - the target object is in an overexposed environment |
| 11. IV | Illumination Variations - Illumination Variations in the background of the target object |
| 12. LR | Low Resolution - low resolution of target objects in images |
| 13. DEF | Deformation - non-rigid object deformation |
| 14. TC | Thermal Crossover - the target object has the same temperature as its surroundings or other objects |
| 15. FL | Frame Lost - some thermal infrared frames are lost |
| 16. FM | Fast Motion - the motion of the ground truth between two adjacent frames is large than 20 pixels |
| 17. NM | No Motion - the target object is in a no motion state |
| 18. MB | Motion Blur - motion of the target object causes blurring of the picture |
| 19. SV | Scale Variation - the ratio of the first bounding box and the current bounding box is out of the range [0.5,2] |
| 20.ARC | Aspect Ratio Change - the ratio of bounding box aspect is outside the range [0.5,2] |
| 21.BOM | Background Object Motion - influence of background object motion for Event camera |
sequence
├─event
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─infrared
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─visible
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─event.txt
├─infrared.txt
├─init.txt
├─query.txt
├─target_i.jpg
├─target_v.jpg
└─visible.txt
- Download QuadTrack600 from BaiduNetdisk(Password:Quad)
- Toolkit for evaluation on QuadTrack600 :BaiduNetdisk(Password:Quad)