Self-Supervised Audio-Visual Feature Learning for Single-modal Incremental Terrain Type Clustering

The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE Access Ročník 9; s. 1
Hlavní autoři:	Ishikawa, Reina, Hachiuma, Ryo, Saito, Hideo
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 01.01.2021 Institute of Electrical and Electronics Engineers (IEEE) The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Ablation Algorithms Audio data Cameras Clustering Data mining Electrical engineering. Electronics. Nuclear engineering Electronic devices Feature extraction Machine learning Microphones Modal data Multi-modal learning Probabilistic models Robotics Robots Self-supervised Sensors Terrain Terrain type clustering Testing TK1-9971 Training Visualization
ISSN:	2169-3536, 2169-3536
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering by forcing the features to be closer together in the feature space. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time.We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3075582