Multi-unit stacked architecture: An urban scene segmentation network based on UNet and ShuffleNetv2

Classic high-accuracy semantic segmentation models typically come with a large number of parameters, making them unsuitable for deployment on driverless platforms with limited computational power. To strike a balance between accuracy and limited computational budget, and enable the use of the classi...

Full description

Saved in:
Bibliographic Details
Published in:Applied soft computing Vol. 165; p. 112065
Main Authors: Liu, Dian, Du, Jianchao, Li, Chuhan, Yu, Chenglong, Zhang, Mingjin
Format: Journal Article
Language:English
Published: Elsevier B.V 01.11.2024
Subjects:
ISSN:1568-4946
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Classic high-accuracy semantic segmentation models typically come with a large number of parameters, making them unsuitable for deployment on driverless platforms with limited computational power. To strike a balance between accuracy and limited computational budget, and enable the use of the classic segmentation model UNet in unmanned driving scenarios, this paper proposes a multi-unit stacked architecture (MSA), namely, MSA-Net, based on UNet and ShuffleNetv2. First, MSA-Net replaces the convolution blocks in the UNet encoder and decoder with stacked basic ShuffleNetv2 units, which greatly reduces computational cost while maintaining high segmentation accuracy. Second, MSA-Net designs enhanced skip connections using pointwise convolution and convolutional block attention (CBAM) to aid the decoder in selecting more relevant and valuable information. Third, MSA-Net proposes multi-scale internal connections to extend the receptive fields of encoder and decoder with little increase in model parameters. The comprehensive experiments show MSA-Net achieves an optimal balance on the Cityscapes dataset between accuracy and model complexity, with strong generalization on the enhanced PASCAL VOC 2012 dataset. MSA-Net achieves a mean intersection over union (mIoU) of 73.6% and an inference speed of 31.0 frames per second (FPS) on the Cityscapes test dataset. We also propose two other MSA-Net models of different sizes, providing more options for resource-constrained inference. •We propose a new lightweight network (MSA-Net) for semantic segmentation from urban scenes.•MSA-Net is the first to combine UNet and ShuffleNetv2, creating a deeper and lighter encoder–decoder architecture.•MSA-Net designs enhanced skip connections using pointwise convolution and CBAM.•MSA-Net proposes multi-scale internal connections to extend the receptive fields of encoder and decoder.•Despite having only 25% of the parameters of UNet, MSA-Net shows a remarkable accuracy improvement of 10.3% and achieves state-of-the-art results.
ISSN:1568-4946
DOI:10.1016/j.asoc.2024.112065