TSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection
Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modali...
Gespeichert in:
| Veröffentlicht in: | Pattern recognition Jg. 148; S. 110190 |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Elsevier Ltd
01.04.2024
|
| Schlagworte: | |
| ISSN: | 0031-3203 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi-modality features with sparse tokens. The TSVT is an asymmetric encoder–decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM).
•Proposing an asymmetric encoder-decoder visual transformer network, called TSVT.•TSVT can adaptively sparse tokens for effectively exploring global context.•An IDFM is designed to fuse the difference and consistency of multi-modality tokens.•TSVT achieves a more robust and effective saliency detection performance. |
|---|---|
| ISSN: | 0031-3203 |
| DOI: | 10.1016/j.patcog.2023.110190 |