Shuffle window transformer DeepLabV3+: a lightweight convolutional neural network and transformer based hybrid semantic segmentation network

Semantic segmentation is a critical task in computer vision. Constructing complex semantic segmentation models with high accuracy, low spatial occupancy, and low computational complexity remains a challenge. To address this, this paper proposes a semantic segmentation network based on a hybrid archi...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Machine learning: science and technology Ročník 6; číslo 2; s. 25039 - 25056
Hlavní autoři:	Li, Yane, Chen, Zhichao, Qi, Hongxia, Fan, Ming, Li, Lihua
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Bristol IOP Publishing 30.06.2025
Témata:	Accuracy Artificial neural networks Complexity Computer vision Computing costs convolutional neural network DeepLabV3 Neural networks Semantic segmentation Semantics shuffle window transformer Strip
ISSN:	2632-2153, 2632-2153
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Semantic segmentation is a critical task in computer vision. Constructing complex semantic segmentation models with high accuracy, low spatial occupancy, and low computational complexity remains a challenge. To address this, this paper proposes a semantic segmentation network based on a hybrid architecture of convolutional neural network and Transformer, named shuffle window transformer DeeplabV3+ (SWT-DeepLabV3+). The network introduces a new module, called the SWT. When the window size is fixed, by integrating window attention (WA) and shuffle WA mechanisms, cross-window global context modeling with linear computational complexity is achieved. Additionally, we enhance the atrous spatial pyramid pooling (ASPP) by incorporating strip pooling to construct a strip ASPP, effectively extracting both regular and irregular multi-scale (MS) features. Simultaneously, the network adopts adaptive spatial feature fusion in the shallow layers. Dynamic adjustment of MS feature weights improves the backbone network’s ability to capture shallow discriminative features. Experimental results demonstrate that on three public datasets (PASCAL VOC 2012, Cityscapes, and CamVid), SWT-DeepLabV3+ exhibits outstanding segmentation performance under conditions of lower parameter count and computational cost, validating the model’s capability to achieve efficient processing while maintaining high accuracy.
Bibliografie:	MLST-102705.R3 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2632-2153 2632-2153
DOI:	10.1088/2632-2153/add853