ImageInThat: Manipulating Images to Convey User Instructions to Robots

Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can...

Full description

Saved in:

Bibliographic Details
Published in:	2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) pp. 757 - 766
Main Authors:	Mahadevan, Karthik, Lewis, Blaine, Li, Jiannan, Mutlu, Bilge, Tang, Anthony, Grossman, Tovi
Format:	Conference Proceeding
Language:	English
Published:	IEEE 04.03.2025
Subjects:	Codes direct manipulation end-user robot programming Faces Foundation models Human-robot interaction Natural languages Prototypes robot instruction following Robot programming Robots
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer-horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.
AbstractList	Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer-horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.
Author	Tang, Anthony Mahadevan, Karthik Lewis, Blaine Mutlu, Bilge Grossman, Tovi Li, Jiannan
Author_xml	– sequence: 1 givenname: Karthik surname: Mahadevan fullname: Mahadevan, Karthik email: karthikm@dgp.toronto.edu organization: University of Toronto,Department of Computer Science,Toronto,Canada – sequence: 2 givenname: Blaine surname: Lewis fullname: Lewis, Blaine email: blaine@dgp.toronto.edu organization: University of Toronto,Department of Computer Science,Toronto,Canada – sequence: 3 givenname: Jiannan surname: Li fullname: Li, Jiannan email: jiannanli@smu.edu.sg organization: School of Computing & Information Systems, Singapore Management University,Singapore,Singapore – sequence: 4 givenname: Bilge surname: Mutlu fullname: Mutlu, Bilge email: bilge@cs.wisc.edu organization: University of Wisconsin-Madison,Department of Computer Sciences,Madison,USA – sequence: 5 givenname: Anthony surname: Tang fullname: Tang, Anthony email: tonyt@smu.edu.sg organization: School of Computing & Information Systems, Singapore Management University,Singapore,Singapore – sequence: 6 givenname: Tovi surname: Grossman fullname: Grossman, Tovi email: tovi@dgp.toronto.edu organization: University of Toronto,Department of Computer Science,Toronto,Canada
BookMark	eNo1j81qAjEYRVOoi9b6BqXkBWaaLz-TpLsy1DpgEUTXksTEBjSRmVjw7dtqu7pwDxzuvUe3KSeP0BOQGoDo59mya0AQUlNCRf3TSA5S36CJlloxQZhUmsEdmnYHs_NdWn2a8oI_TIrH096UmHb4QgZcMm5z-vJnvB58j7s0lP7kSszpwpbZ5jI8oFEw-8FP_nKM1tO3VTur5ov3rn2dVxGkKhVoAY3cWkMDtcFwYokUAML9bmWhsbzZUketU9Z5ZpgJzHMVqCK8cYIpNkaPV2_03m-OfTyY_rz5v8e-AQdVScE
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/HRI61500.2025.10974179
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350378931
EndPage	766
ExternalDocumentID	10974179
Genre	orig-research
GrantInformation_xml	– fundername: National Science Foundation grantid: IIS-1925043 funderid: 10.13039/100000001
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i178t-195167dba2f2bfa40b075115c61503f6b46d2c2bc8bce3a3af3e48f28046c5383
IEDL.DBID	RIE
ISICitedReferencesCount	0
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001492540600078&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Thu May 29 05:57:37 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i178t-195167dba2f2bfa40b075115c61503f6b46d2c2bc8bce3a3af3e48f28046c5383
PageCount	10
ParticipantIDs	ieee_primary_10974179
PublicationCentury	2000
PublicationDate	2025-March-4
PublicationDateYYYYMMDD	2025-03-04
PublicationDate_xml	– month: 03 year: 2025 text: 2025-March-4 day: 04
PublicationDecade	2020
PublicationTitle	2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI)
PublicationTitleAbbrev	HRI
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.9008299
Snippet	Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need...
SourceID	ieee
SourceType	Publisher
StartPage	757
SubjectTerms	Codes direct manipulation end-user robot programming Faces Foundation models Human-robot interaction Natural languages Prototypes robot instruction following Robot programming Robots
Title	ImageInThat: Manipulating Images to Convey User Instructions to Robots
URI	https://ieeexplore.ieee.org/document/10974179
WOSCitedRecordID	wos001492540600078&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED7RioEJEEG85YE1beI4ictaETUSVFXVSt0q27GhAwkiKRL_nrOTUjEwsFk-S9b5dWf7vvsA7gsd8yJykWpx4rM4CHxpX5tCo6hCd5cb41hLntLplK9Wo1kHVndYGK21Cz7TA1t0f_lFpbb2qWxof0stY1YPemmatGCtDvWLouFkntv05gHe-mg82DX-RZvirEZ2_M_-TsDb4-_I7MeynMKBLs8gy99w7-fl4lU0D-RZlJuWeqt8IU5Sk6YiYxtE_kWWuLBIvk8O62TzSlZN7cEye1yMJ37HguBvwpQ3fog-UJIWUlBDpREskGjl0Y9TVtfIJJIlBVVUKi6VjkQkTKQZN5TjzVfhcRadQ7-sSn0BJDY2HWAq8UwTTFDNUWNmeCIlOl58FF-CZwdh_d4muljv9L_6o_4ajuxQu5AsdgN9VErfwqH6bDb1x52bnm-5QZJM
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4omuhJjRjf9uB1Yem2u8UrkbBxIYRAwo203VY5uGtkNfHfOy2LxIMHb02nSTN9zbSdbz6A-9xwkUc-Uo3HAeNhGCj32tSxmmp0d4W1nrUkS0YjMZ93xzVY3WNhjDE--My0XNH_5eel_nBPZW33W-oYs3ZhjzNGwzVcq8b9orA9mKQuwXmI9z7KW5vmv4hTvN3oH_2zx2NobhF4ZPxjW05gxxSn0E9fcfenxfRFVg9kKIvlmnyreCZesiJVSXoujPyLzHBpkXSbHtbLJqUqq1UTZv3HaW8Q1DwIwbKTiCrooBcUJ7mS1FJlJQsV2nn05LTTNbKxYnFONVVaKG0iGUkbGSYsFXj31XigRWfQKMrCnAPh1iUETBSeapJJagRqzKyIlULXS3T5BTTdICze1qkuFhv9L_-ov4ODwXSYLbJ09HQFh27YfYAWu4YGKmhuYF9_VsvV-62fqm-FmpWT
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+20th+ACM%2FIEEE+International+Conference+on+Human-Robot+Interaction+%28HRI%29&rft.atitle=ImageInThat%3A+Manipulating+Images+to+Convey+User+Instructions+to+Robots&rft.au=Mahadevan%2C+Karthik&rft.au=Lewis%2C+Blaine&rft.au=Li%2C+Jiannan&rft.au=Mutlu%2C+Bilge&rft.date=2025-03-04&rft.pub=IEEE&rft.spage=757&rft.epage=766&rft_id=info:doi/10.1109%2FHRI61500.2025.10974179&rft.externalDocID=10974179