Towards General Multimodal Visual Tracking

Motivation

Existing multimodal tracking studies focus on bimodal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark.

About QuadTrack600 benchmark

QuadTrack600 comprises 600 video sequences, which total over 384.7K high-resolution (640×480) frame groups. It is the first dataset to integrate four distinct modalities: RGB, thermal infrared, event, and language. In each frame group, all visual modalities are spatially aligned and meticulously annotated with bounding boxes. Additionally, each sequence includes a language description of the target, and 21 sequence-level challenge attributes are provided for detailed performance analysis. QuadTrack600 is highly diverse, featuring 41 different object categories and capturing a wide array of challenging scenarios, including extreme lighting, occlusion, and adverse weather conditions.

Comparision of QuadTrack600 with existing bi-modal tracking benchmarks

Data collection and alignment

To simultaneously collect video sequences from all three modalities, we construct a handheld imaging system. The system consists of two parts: a Hikvision binocular thermal camera for acquiring paired visible and thermal infrared video sequences, and a DVS (Dynamic Vision Sensor) for acquiring event streaming data. By manually adjusting the imaging optical axis of the device, a common view field is available for all three modalities.

Due to differences in imaging hardware among sensors, we must register multimodal video sequences both temporally and spatially. For temporal registration, since the visible and thermal infrared sequences are pre-calibrated by the imaging hardware, we can synchronize the three modalities by adjusting the event modality data. Specifically, we set a fixed time window for the event stream data, based on the recording frame rate (25 Hz) of the binocular thermal camera, and map event points to a common plane for alignment with the other modalities. By manually processing each video sequence individually, we achieve accurate temporal registration among the three modalities. For spatial registration, we convert the temporally registered three-modality video into image frames and select the thermal infrared image with the lowest resolution (640 × 480) as the registration target. We then use professional image editing software to crop and scale the visible and event images to ensure precise alignment with the thermal infrared images.

Attributes

Attr	Description
01. PO	Partial Occlusion - the target object is partially occluded
02. TO	Total Occlusion - the target object is totally occluded
03. HO	Hyaline Occlusion - the target is occluded by hyaline object
04. OV	Out-of-View - the target leaves the camera field of view
05. VC	Viewpoint Change - changes of viewpoint of the target
06. CM	Camera Motion - the target object is captured by moving camera
07. BC	Background Clutter - the background information which includes the target is messy
08. SA	Similar Appearance - there are objects of similar appearance near the target
09. LI	Low Illumination - the illumination in the target region is low
10. OE	Over Exposure - the target object is in an overexposed environment
11. IV	Illumination Variations - Illumination Variations in the background of the target object
12. LR	Low Resolution - low resolution of target objects in images
13. DEF	Deformation - non-rigid object deformation
14. TC	Thermal Crossover - the target object has the same temperature as its surroundings or other objects
15. FL	Frame Lost - some thermal infrared frames are lost
16. FM	Fast Motion - the motion of the ground truth between two adjacent frames is large than 20 pixels
17. NM	No Motion - the target object is in a no motion state
18. MB	Motion Blur - motion of the target object causes blurring of the picture
19. SV	Scale Variation - the ratio of the first bounding box and the current bounding box is out of the range [0.5,2]
20.ARC	Aspect Ratio Change - the ratio of bounding box aspect is outside the range [0.5,2]
21.BOM	Background Object Motion - influence of background object motion for Event camera

Dataset file structure

sequence
├─event
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─infrared
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─visible
│────000001.jpg
│────000002.jpg
│────000003.jpg
│
├─event.txt
├─infrared.txt
├─init.txt
├─query.txt
├─target_i.jpg
├─target_v.jpg
└─visible.txt

Dataset

Download QuadTrack600 from BaiduNetdisk(Password:Quad)
Toolkit for evaluation on QuadTrack600 :BaiduNetdisk(Password:Quad)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards General Multimodal Visual Tracking

Motivation

About QuadTrack600 benchmark

Comparision of QuadTrack600 with existing bi-modal tracking benchmarks

Data collection and alignment

Attributes

Dataset file structure

Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Multi-Modality-Tracking/QuadTrack

Folders and files

Latest commit

History

Repository files navigation

Towards General Multimodal Visual Tracking

Motivation

About QuadTrack600 benchmark

Comparision of QuadTrack600 with existing bi-modal tracking benchmarks

Data collection and alignment

Attributes

Dataset file structure

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages