MHS-VIT: Mamba hybrid self-attention vision transformers for traffic image detection

With the rapid development of intelligent transportation systems, especially in traffic image detection tasks, the introduction of the transformer architecture greatly promotes the improvement of model performance. However, traditional transformer models have high computational costs during training...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one Jg. 20; H. 6; S. e0325962
Hauptverfasser: Zhang, Xude, Ou, Weihua, Wu, Xiaoping, Zhang, Changzhen
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States Public Library of Science 30.06.2025
Public Library of Science (PLoS)
Schlagworte:
ISSN:1932-6203, 1932-6203
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the rapid development of intelligent transportation systems, especially in traffic image detection tasks, the introduction of the transformer architecture greatly promotes the improvement of model performance. However, traditional transformer models have high computational costs during training and deployment due to the quadratic complexity of their self-attention mechanism, which limits their application in resource-constrained environments. To overcome this limitation, this paper proposes a novel hybrid architecture, Mamba Hybrid Self-Attention Vision Transformers (MHS-VIT), which combines the advantages of Mamba state-space model (SSM) and transformer to improve the modeling efficiency and performance of visual tasks and to enhance the modeling efficiency and accuracy of the model in processing traffic images. Mamba, as a linear time complexity SSM, can effectively reduce the computational burden without sacrificing performance. The self-attention mechanism of the transformer is good at capturing long-distance spatial dependencies in images, which is crucial for understanding complex traffic scenes. Experimental results showed that MHS-VIT exhibited excellent performances in traffic image detection tasks. Whether it is vehicle detection, pedestrian detection, or traffic sign recognition tasks, this model could accurately and quickly identify target objects. Compared with backbone networks of the same scale, MHS-VIT achieved significant improvements in accuracy and model parameter quantity.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Competing Interests: The authors have declared that no competing interests exist.
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0325962