In all attribute conditions, our method achieves state-of-the-art performance in 16 attributes.Specifically, our method significantly outperforms existing state-of-the-art algorithms on the attribute subsets of PO, HO, BC, SA and FM, with improvements of 3.5%/3.7%/2.5%, 4.3%/3.6%/2.4%, 4.4%/1.1%/3.4%, 5.5%/5.4%/4.4%, and 3.2%/3.7%/3.5% in the PR/NPR/SR metrics, respectively.We attribute the performance improvements in these challenging scenarios to the effective quality-aware multimodal fusion and the successful exploitation of spatio-temporal information.However, our method does not perform as well as other RGBT trackers with the OV attribute scenario.
OV indicates that the tracked target leaves the camera's field of view, meaning the target disappears and is no longer present within the ground truth bounding box.
Since the proposed QTWM measures token quality based on the correlation between the template and the search region, it may attend to the most similar object when the target is absent from the search area. Nevertheless, by leveraging spatio-temporal information, the tracker can quickly recover and resume tracking the correct target once it reappears.