ImageInThat: Manipulating Images to Convey User Instructions to Robots

Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can...

Full description

Saved in:
Bibliographic Details
Published in:2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) pp. 757 - 766
Main Authors: Mahadevan, Karthik, Lewis, Blaine, Li, Jiannan, Mutlu, Bilge, Tang, Anthony, Grossman, Tovi
Format: Conference Proceeding
Language:English
Published: IEEE 04.03.2025
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer-horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.
AbstractList Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer-horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.
Author Tang, Anthony
Mahadevan, Karthik
Lewis, Blaine
Mutlu, Bilge
Grossman, Tovi
Li, Jiannan
Author_xml – sequence: 1
  givenname: Karthik
  surname: Mahadevan
  fullname: Mahadevan, Karthik
  email: karthikm@dgp.toronto.edu
  organization: University of Toronto,Department of Computer Science,Toronto,Canada
– sequence: 2
  givenname: Blaine
  surname: Lewis
  fullname: Lewis, Blaine
  email: blaine@dgp.toronto.edu
  organization: University of Toronto,Department of Computer Science,Toronto,Canada
– sequence: 3
  givenname: Jiannan
  surname: Li
  fullname: Li, Jiannan
  email: jiannanli@smu.edu.sg
  organization: School of Computing & Information Systems, Singapore Management University,Singapore,Singapore
– sequence: 4
  givenname: Bilge
  surname: Mutlu
  fullname: Mutlu, Bilge
  email: bilge@cs.wisc.edu
  organization: University of Wisconsin-Madison,Department of Computer Sciences,Madison,USA
– sequence: 5
  givenname: Anthony
  surname: Tang
  fullname: Tang, Anthony
  email: tonyt@smu.edu.sg
  organization: School of Computing & Information Systems, Singapore Management University,Singapore,Singapore
– sequence: 6
  givenname: Tovi
  surname: Grossman
  fullname: Grossman, Tovi
  email: tovi@dgp.toronto.edu
  organization: University of Toronto,Department of Computer Science,Toronto,Canada
BookMark eNo1j81qAjEYRVOoi9b6BqXkBWaaLz-TpLsy1DpgEUTXksTEBjSRmVjw7dtqu7pwDxzuvUe3KSeP0BOQGoDo59mya0AQUlNCRf3TSA5S36CJlloxQZhUmsEdmnYHs_NdWn2a8oI_TIrH096UmHb4QgZcMm5z-vJnvB58j7s0lP7kSszpwpbZ5jI8oFEw-8FP_nKM1tO3VTur5ov3rn2dVxGkKhVoAY3cWkMDtcFwYokUAML9bmWhsbzZUketU9Z5ZpgJzHMVqCK8cYIpNkaPV2_03m-OfTyY_rz5v8e-AQdVScE
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HRI61500.2025.10974179
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350378931
EndPage 766
ExternalDocumentID 10974179
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: IIS-1925043
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i178t-195167dba2f2bfa40b075115c61503f6b46d2c2bc8bce3a3af3e48f28046c5383
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001492540600078&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Thu May 29 05:57:37 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i178t-195167dba2f2bfa40b075115c61503f6b46d2c2bc8bce3a3af3e48f28046c5383
PageCount 10
ParticipantIDs ieee_primary_10974179
PublicationCentury 2000
PublicationDate 2025-March-4
PublicationDateYYYYMMDD 2025-03-04
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-March-4
  day: 04
PublicationDecade 2020
PublicationTitle 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI)
PublicationTitleAbbrev HRI
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.9008299
Snippet Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need...
SourceID ieee
SourceType Publisher
StartPage 757
SubjectTerms Codes
direct manipulation
end-user robot programming
Faces
Foundation models
Human-robot interaction
Natural languages
Prototypes
robot instruction following
Robot programming
Robots
Title ImageInThat: Manipulating Images to Convey User Instructions to Robots
URI https://ieeexplore.ieee.org/document/10974179
WOSCitedRecordID wos001492540600078&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED7RioEJEEG85YE1beI4ictaETUSVFXVSt0q27GhAwkiKRL_nrOTUjEwsFk-S9b5dWf7vvsA7gsd8yJykWpx4rM4CHxpX5tCo6hCd5cb41hLntLplK9Wo1kHVndYGK21Cz7TA1t0f_lFpbb2qWxof0stY1YPemmatGCtDvWLouFkntv05gHe-mg82DX-RZvirEZ2_M_-TsDb4-_I7MeynMKBLs8gy99w7-fl4lU0D-RZlJuWeqt8IU5Sk6YiYxtE_kWWuLBIvk8O62TzSlZN7cEye1yMJ37HguBvwpQ3fog-UJIWUlBDpREskGjl0Y9TVtfIJJIlBVVUKi6VjkQkTKQZN5TjzVfhcRadQ7-sSn0BJDY2HWAq8UwTTFDNUWNmeCIlOl58FF-CZwdh_d4muljv9L_6o_4ajuxQu5AsdgN9VErfwqH6bDb1x52bnm-5QZJM
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4omuhJjRjf9uB1Yem2u8UrkbBxIYRAwo203VY5uGtkNfHfOy2LxIMHb02nSTN9zbSdbz6A-9xwkUc-Uo3HAeNhGCj32tSxmmp0d4W1nrUkS0YjMZ93xzVY3WNhjDE--My0XNH_5eel_nBPZW33W-oYs3ZhjzNGwzVcq8b9orA9mKQuwXmI9z7KW5vmv4hTvN3oH_2zx2NobhF4ZPxjW05gxxSn0E9fcfenxfRFVg9kKIvlmnyreCZesiJVSXoujPyLzHBpkXSbHtbLJqUqq1UTZv3HaW8Q1DwIwbKTiCrooBcUJ7mS1FJlJQsV2nn05LTTNbKxYnFONVVaKG0iGUkbGSYsFXj31XigRWfQKMrCnAPh1iUETBSeapJJagRqzKyIlULXS3T5BTTdICze1qkuFhv9L_-ov4ODwXSYLbJ09HQFh27YfYAWu4YGKmhuYF9_VsvV-62fqm-FmpWT
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+20th+ACM%2FIEEE+International+Conference+on+Human-Robot+Interaction+%28HRI%29&rft.atitle=ImageInThat%3A+Manipulating+Images+to+Convey+User+Instructions+to+Robots&rft.au=Mahadevan%2C+Karthik&rft.au=Lewis%2C+Blaine&rft.au=Li%2C+Jiannan&rft.au=Mutlu%2C+Bilge&rft.date=2025-03-04&rft.pub=IEEE&rft.spage=757&rft.epage=766&rft_id=info:doi/10.1109%2FHRI61500.2025.10974179&rft.externalDocID=10974179