Multi-unit stacked architecture: An urban scene segmentation network based on UNet and ShuffleNetv2
Classic high-accuracy semantic segmentation models typically come with a large number of parameters, making them unsuitable for deployment on driverless platforms with limited computational power. To strike a balance between accuracy and limited computational budget, and enable the use of the classi...
Saved in:
| Published in: | Applied soft computing Vol. 165; p. 112065 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.11.2024
|
| Subjects: | |
| ISSN: | 1568-4946 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Classic high-accuracy semantic segmentation models typically come with a large number of parameters, making them unsuitable for deployment on driverless platforms with limited computational power. To strike a balance between accuracy and limited computational budget, and enable the use of the classic segmentation model UNet in unmanned driving scenarios, this paper proposes a multi-unit stacked architecture (MSA), namely, MSA-Net, based on UNet and ShuffleNetv2. First, MSA-Net replaces the convolution blocks in the UNet encoder and decoder with stacked basic ShuffleNetv2 units, which greatly reduces computational cost while maintaining high segmentation accuracy. Second, MSA-Net designs enhanced skip connections using pointwise convolution and convolutional block attention (CBAM) to aid the decoder in selecting more relevant and valuable information. Third, MSA-Net proposes multi-scale internal connections to extend the receptive fields of encoder and decoder with little increase in model parameters. The comprehensive experiments show MSA-Net achieves an optimal balance on the Cityscapes dataset between accuracy and model complexity, with strong generalization on the enhanced PASCAL VOC 2012 dataset. MSA-Net achieves a mean intersection over union (mIoU) of 73.6% and an inference speed of 31.0 frames per second (FPS) on the Cityscapes test dataset. We also propose two other MSA-Net models of different sizes, providing more options for resource-constrained inference.
•We propose a new lightweight network (MSA-Net) for semantic segmentation from urban scenes.•MSA-Net is the first to combine UNet and ShuffleNetv2, creating a deeper and lighter encoder–decoder architecture.•MSA-Net designs enhanced skip connections using pointwise convolution and CBAM.•MSA-Net proposes multi-scale internal connections to extend the receptive fields of encoder and decoder.•Despite having only 25% of the parameters of UNet, MSA-Net shows a remarkable accuracy improvement of 10.3% and achieves state-of-the-art results. |
|---|---|
| ISSN: | 1568-4946 |
| DOI: | 10.1016/j.asoc.2024.112065 |