Action matters for object representation learning from sequences

Saved in:
Bibliographic Details
Title: Action matters for object representation learning from sequences
Authors: Sene, Mohamed, Massamba, Quinton, Jean-Charles, Armetta, Frédéric, Lefort, Mathieu
Contributors: Statistique pour le Vivant et l’Homme (SVH), Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Université Grenoble Alpes (UGA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Université Grenoble Alpes (UGA), Systèmes Cognitifs et Systèmes Multi-Agents (SyCoSMA), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), ANR-23-CE23-0021,MeSMRise,Apprentissage profond de représentations sensorimotrices multimodales(2023)
Source: European Conference On Artificial Intelligence (ECAI), Artificial Intelligence and Cognition workshop (AIC) ; https://hal.science/hal-05324845 ; European Conference On Artificial Intelligence (ECAI), Artificial Intelligence and Cognition workshop (AIC), Oct 2025, Bologna, Italy ; https://ecai2025.org/
Publisher Information: CCSD
Publication Year: 2025
Collection: Portail HAL de l'Université Lumière Lyon 2
Subject Terms: Tetris, Sequence modeling, Self-supervised learning, Representation learning, Sensorimotor theory, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [SCCO.COMP]Cognitive science/Computer science
Subject Geographic: Bologna, Italy
Description: International audience ; According to the sensorimotor contingencies theory, building from developments in cognitive sciences, action, and especially the rules governing the changes in sensations caused by action, are essential to construct meaningful representation of the environment. However, most of current approaches to self-supervised learning are currently learning static representations from large datasets. In this article, we study how state-of-the-art deep learning methods can learn sensorimotor representations from sequences of interactions in a simplified deterministic Tetris-like environment, partially observable by the model. Especially, we emphasize the role of action to facilitate the emergence of representations related to objects present in the environment (here Tetris manipulable shapes). To that purpose, several model configurations were compared: a multi-task Transformer, a multi-task State Space Model, and ablated variants omitting action either in the input and/or in the pretext tasks. All models were trained in a self-supervised setting using predictive tasks of a masked subpart of the input sequence. Results show that explicitly including actions in the input and in the pretext tasks significantly improves the quality of learned representations, particularly in ambiguous situations. These findings support the relevance of the sensorimotor framework for structuring representation learning.
Document Type: conference object
Language: English
Availability: https://hal.science/hal-05324845
https://hal.science/hal-05324845v1/document
https://hal.science/hal-05324845v1/file/Article_ECAI-AIC.pdf
Rights: http://creativecommons.org/licenses/by/ ; info:eu-repo/semantics/OpenAccess
Accession Number: edsbas.5A1B3E59
Database: BASE
Description
Abstract:International audience ; According to the sensorimotor contingencies theory, building from developments in cognitive sciences, action, and especially the rules governing the changes in sensations caused by action, are essential to construct meaningful representation of the environment. However, most of current approaches to self-supervised learning are currently learning static representations from large datasets. In this article, we study how state-of-the-art deep learning methods can learn sensorimotor representations from sequences of interactions in a simplified deterministic Tetris-like environment, partially observable by the model. Especially, we emphasize the role of action to facilitate the emergence of representations related to objects present in the environment (here Tetris manipulable shapes). To that purpose, several model configurations were compared: a multi-task Transformer, a multi-task State Space Model, and ablated variants omitting action either in the input and/or in the pretext tasks. All models were trained in a self-supervised setting using predictive tasks of a masked subpart of the input sequence. Results show that explicitly including actions in the input and in the pretext tasks significantly improves the quality of learned representations, particularly in ambiguous situations. These findings support the relevance of the sensorimotor framework for structuring representation learning.