Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation

Precision depth estimation plays a key role in many applications, including 3D scene reconstruction, virtual reality, autonomous driving and human–computer interaction. Through recent advancements in deep learning technologies, monocular depth estimation, with its simplicity, has surpassed the tradi...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Sensors (Basel, Switzerland) Ročník 25; číslo 1; s. 80
Hlavní autori:	Yang, Wei-Jong, Wu, Chih-Chen, Yang, Jar-Ferr
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Switzerland MDPI AG 01.01.2025 MDPI
Predmet:	adaptive fusion autoencoder Computer vision convolutional neural networks Deep learning Dimensions Estimation theory Image processing Machine vision Methods monocular depth estimation Neural networks residual vision transformer Semantics Taiwan autoencoder convolutional neural networks adaptive fusion residual vision transformer monocular depth estimation
ISSN:	1424-8220, 1424-8220
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Precision depth estimation plays a key role in many applications, including 3D scene reconstruction, virtual reality, autonomous driving and human–computer interaction. Through recent advancements in deep learning technologies, monocular depth estimation, with its simplicity, has surpassed the traditional stereo camera systems, bringing new possibilities in 3D sensing. In this paper, by using a single camera, we propose an end-to-end supervised monocular depth estimation autoencoder, which contains an encoder with a structure with a mixed convolution neural network and vision transformers and an effective adaptive fusion decoder to obtain high-precision depth maps. In the encoder, we construct a multi-scale feature extractor by mixing residual configurations of vision transformers to enhance both local and global information. In the adaptive fusion decoder, we introduce adaptive fusion modules to effectively merge the features of the encoder and the decoder together. Lastly, the model is trained using a loss function that aligns with human perception to enable it to focus on the depth values of foreground objects. The experimental results demonstrate the effective prediction of the depth map from a single-view color image by the proposed autoencoder, which increases the first accuracy rate about 28% and reduces the root mean square error about 27% compared to an existing method in the NYU dataset.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s25010080