TSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection

Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modali...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition Jg. 148; S. 110190
Hauptverfasser:	Gao, Lina, Liu, Bing, Fu, Ping, Xu, Mingzhu
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Elsevier Ltd 01.04.2024
Schlagworte:	RGB-D image Salient object detection Self-attention mechanism Token sparsification Vision transformer Vision transformer Token sparsification RGB-D image Self-attention mechanism Salient object detection
ISSN:	0031-3203
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi-modality features with sparse tokens. The TSVT is an asymmetric encoder–decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM). •Proposing an asymmetric encoder-decoder visual transformer network, called TSVT.•TSVT can adaptively sparse tokens for effectively exploring global context.•An IDFM is designed to fuse the difference and consistency of multi-modality tokens.•TSVT achieves a more robust and effective saliency detection performance.
ISSN:	0031-3203
DOI:	10.1016/j.patcog.2023.110190