Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-Decoder Networks

In this letter, we research and evaluate end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird's eye view mapping. At the core, it utilizes a variational encoder-decoder n...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE robotics and automation letters Ročník 4; číslo 2; s. 445 - 452
Hlavní autori:	Chenyang Lu, van de Molengraft, Marinus Jacobus Gerardus, Dubbelman, Gijs
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Piscataway IEEE 01.04.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Cameras Cartesian coordinates Coders computer vision for transportation Feature extraction Ground truth Image segmentation Learning Mapping Measurement Neural networks object detection Occupancy segmentation and categorization Semantic scene understanding Semantics Training
ISSN:	2377-3766, 2377-3766
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In this letter, we research and evaluate end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird's eye view mapping. At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a two-dimensional top-view Cartesian coordinate system. The evaluations on Cityscapes show that the end-to-end learning of semantic-metric occupancy grids outperforms the deterministic mapping approach with flat-plane assumption by more than 12% mean intersection-over-union. Furthermore, we show that the variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data. Our network achieves real-time inference rates of approx. 35 Hz for an input image with a resolution of 256 × 512 pixels and an output map with 64 × 64 occupancy grid cells using a Titan V GPU.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2019.2891028