Quality-aware Spatio-temporal Transformer Network for RGBT Tracking

Zhaodong Ding, Chenglong Li, Tao Wang, Futian Wang
Anhui University
Corresponding Author

Abstract

Transformer-based RGBT tracking has attracted much attention due to the strong modeling capacity of self attention and cross attention mechanisms.These attention mechanisms utilize the correlations among tokens to construct powerful feature representations, but are easily affected by low-quality tokens.To address this issue, we propose a novel Quality-aware Spatio-temporal Transformer Network (QSTNet), which calculates the quality weights of tokens in search regions based on the correlation with multimodal template tokens to suppress the negative effects of low-quality tokens in spatio-temporal feature representations, for robust RGBT tracking. In particular, we argue that the correlation between search tokens of one modality and multimodal template tokens could reflect the quality of these search tokens, and thus design the Quality-aware Token Weighting Module (QTWM) based on the correlation matrix of search and template tokens to suppress the negative effects of low-quality tokens.Specifically, we calculate the difference matrix derived from the attention matrices of the search tokens from both modalities and the multimodal template tokens, and then assign the quality weight for each search token based on the difference matrix, which reflects the relative correlation of search tokens from different modalities to multimodal template tokens.In addition, we propose the Prompt-based Spatio-temporal Encoder Module (PSEM) to utilize spatio-temporal multimodal information while alleviating the impact of low-quality spatio-temporal features.Extensive experiments on four RGBT benchmark datasets demonstrate that the proposed QSTNet exhibits superior performance compared to other state-of-the-art tracking methods.

Framework

Normal and Anomalous Representations

The overall framework of our method. The dynamic modal fusion encoder is constructed by integrating the QTWM into the encoder layers. The Prompt-based Spatio-temporal Encoder Module (PSEM) is built by inserting the spatio-temporal prompt generator into the 12th encoder layer. During the tracking stage, the PSEM mines multimodal spatio-temporal cues to enhance target features, and the output multimodal spatio-temporal tokens are propagated to subsequent frames to exploit spatio-temporal cues.

Main Reults

Normal and Anomalous Representations

Attribute-based Performance

Normal and Anomalous Representations

In all attribute conditions, our method achieves state-of-the-art performance in 16 attributes.Specifically, our method significantly outperforms existing state-of-the-art algorithms on the attribute subsets of PO, HO, BC, SA and FM, with improvements of 3.5%/3.7%/2.5%, 4.3%/3.6%/2.4%, 4.4%/1.1%/3.4%, 5.5%/5.4%/4.4%, and 3.2%/3.7%/3.5% in the PR/NPR/SR metrics, respectively.We attribute the performance improvements in these challenging scenarios to the effective quality-aware multimodal fusion and the successful exploitation of spatio-temporal information.However, our method does not perform as well as other RGBT trackers with the OV attribute scenario. OV indicates that the tracked target leaves the camera's field of view, meaning the target disappears and is no longer present within the ground truth bounding box. Since the proposed QTWM measures token quality based on the correlation between the template and the search region, it may attend to the most similar object when the target is absent from the search area. Nevertheless, by leveraging spatio-temporal information, the tracker can quickly recover and resume tracking the correct target once it reappears.

Visualization

Normal and Anomalous Representations

This page was built using the Academic Project Page Template which was adopted from the Nerfies project page.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.