Enhanced encoder–decoder architecture for accurate monocular depth estimation
Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. In state-of-the-art architectures, the main challenge is to efficiently capture complex objects and fine-grained details, which are often dif...
Uloženo v:
| Vydáno v: | The Visual computer Ročník 41; číslo 12; s. 9487 - 9508 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Heidelberg
Springer Nature B.V
01.09.2025
|
| Témata: | |
| ISSN: | 0178-2789, 1432-2315 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. In state-of-the-art architectures, the main challenge is to efficiently capture complex objects and fine-grained details, which are often difficult to predict. This paper introduces a novel deep-learning-based approach using an enhanced encoder–decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 s, outperforming vision transformers in efficiency while maintaining good accuracy. On the NYU-Depth V2 dataset, the model establishes state-of-the-art performance, with an absolute relative error of 0.064, a root-mean-square error of 0.228, and an accuracy of 89.3% for δ < 1.25. These metrics demonstrate that our model can accurately and efficiently predict depth even in challenging scenarios, providing a practical solution for real-time applications. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0178-2789 1432-2315 |
| DOI: | 10.1007/s00371-025-03972-z |