MotIF: Motion Instruction Fine-Tuning

While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated...

Full description

Saved in:
Bibliographic Details
Published in:IEEE robotics and automation letters Vol. 10; no. 3; pp. 2287 - 2294
Main Authors: Hwang, Minyoung, Hejna, Joey, Sadigh, Dorsa, Bisk, Yonatan
Format: Journal Article
Language:English
Published: Piscataway IEEE 01.03.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:2377-3766, 2377-3766
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories with motion descriptions. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art API-based single-frame VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions.
AbstractList While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories with motion descriptions. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art API-based single-frame VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions.
Author Hwang, Minyoung
Bisk, Yonatan
Hejna, Joey
Sadigh, Dorsa
Author_xml – sequence: 1
  givenname: Minyoung
  orcidid: 0000-0002-6548-0071
  surname: Hwang
  fullname: Hwang, Minyoung
  email: myhwang@mit.edu
  organization: Massachusetts Institute of Technology, Cambridge, MA, USA
– sequence: 2
  givenname: Joey
  orcidid: 0009-0008-6339-3426
  surname: Hejna
  fullname: Hejna, Joey
  email: jhejna@stanford.edu
  organization: Stanford University, Stanford, CA, USA
– sequence: 3
  givenname: Dorsa
  orcidid: 0000-0002-7802-9183
  surname: Sadigh
  fullname: Sadigh, Dorsa
  email: dorsa@cs.stanford.edu
  organization: Google Deepmind, Mountain View, CA, USA
– sequence: 4
  givenname: Yonatan
  surname: Bisk
  fullname: Bisk, Yonatan
  email: ybisk@andrew.cmu.edu
  organization: Carnegie Mellon University, Pittsburgh, PA, USA
BookMark eNp9kDFPwzAQhS1UJErpzsBQCTGm3NmJnbBVVQOVipBQmS3HXJCr4hQ7Gfj3pLRDxcD03vC-e7p3yQa-8cTYNcIUEYr71etsyoFnU5FxxQs4Y0MulEqEknJw4i_YOMYNAGCfE0U2ZHfPTbssHya9uMZPlj62obO_vnSeknXnnf-4Yue12UYaH3XE3srFev6UrF4el_PZKrG84G0iU6sk2VxwjpWqTC1tjllh3wE4oZQV1WQzrCC30qQIqZJojOSUE1igQozY7eHuLjRfHcVWb5ou-L5SC5QoFM9Q9Ck4pGxoYgxU611wnyZ8awS930P3e-j9Hvq4R4_IP4h1rdm_2Qbjtv-BNwfQEdFJTy6ESnPxAyjHbGs
CODEN IRALC6
CitedBy_id crossref_primary_10_1016_j_inffus_2025_103575
crossref_primary_10_1080_01691864_2025_2532610
Cites_doi 10.1109/CVPR52733.2024.01370
10.18653/v1/2024.findings-acl.672
10.15607/RSS.2024.XX.049
10.18653/v1/2024.emnlp-main.342
10.18653/v1/D19-1410
10.18653/v1/p19-1472
10.1109/ICCV48922.2021.01488
10.1109/CVPR52733.2024.02484
10.1109/CVPR52733.2024.01535
10.1109/ICCV51070.2023.00923
10.1109/icra57147.2024.10611409
10.1109/ICRA48891.2023.10160591
10.1609/aaai.v34i05.6239
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2025
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2025
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/LRA.2025.3527290
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2377-3766
EndPage 2294
ExternalDocumentID 10_1109_LRA_2025_3527290
10833748
Genre orig-research
GrantInformation_xml – fundername: Other Transaction Award
  grantid: HR00112490375
– fundername: Friction for Accountability in Conversational Transactions (FACT) Program
– fundername: Defense Advanced Research Projects Agency; U.S. Defense Advanced Research Projects Agency
  funderid: 10.13039/100000185
GroupedDBID 0R~
97E
AAJGR
AASAJ
AAWTH
ABQJQ
ABVLG
ACGFS
AGQYO
AGSQL
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
IFIPE
IPLJI
JAVBF
KQ8
M43
M~E
O9-
OCL
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
AARMG
ABAZT
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c292t-64c76ec83221b7baf6c8159cd002e166befec51b08c6a4104761aa62e8e0c0e93
IEDL.DBID RIE
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001410279500011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2377-3766
IngestDate Mon Jun 30 13:02:54 EDT 2025
Tue Nov 18 22:43:58 EST 2025
Thu Nov 27 00:34:39 EST 2025
Wed Nov 26 07:22:47 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c292t-64c76ec83221b7baf6c8159cd002e166befec51b08c6a4104761aa62e8e0c0e93
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-7802-9183
0000-0002-6548-0071
0009-0008-6339-3426
PQID 3161372513
PQPubID 4437225
PageCount 8
ParticipantIDs proquest_journals_3161372513
crossref_primary_10_1109_LRA_2025_3527290
crossref_citationtrail_10_1109_LRA_2025_3527290
ieee_primary_10833748
PublicationCentury 2000
PublicationDate 2025-03-01
PublicationDateYYYYMMDD 2025-03-01
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-03-01
  day: 01
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE robotics and automation letters
PublicationTitleAbbrev LRA
PublicationYear 2025
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref35
Yenamandra (ref1) 2023
ref34
Kwon (ref5) 2023
ref15
Bharadhwaj (ref22) 2024
Gao (ref14) 2023
Guan (ref3) 2024
Liu (ref37) 2023
Radford (ref42) 2021
Yecheng (ref4) 2024
ref17
ref38
Hu (ref6) 2023
Kwon (ref36) 2023
ref18
Liu (ref32) 2024
Wen (ref24) 2023
Achiam (ref25) 2023
Zhang (ref30) 2024
Rocamonde (ref12) 2024
Chiang (ref41) 2023
Gu (ref19) 2024
ref23
Doersch (ref20) 2022
Liu (ref43) 2024
ref44
ref21
Team (ref26) 2023
Sontakke (ref11) 2023
Yuan (ref31) 2024
ref28
Wang (ref10) 2024
ref8
Micha (ref16) 2024
ref9
Lugaresi (ref39) 2019
Baumli (ref13) 2023
Cheng (ref27) 2024
Peng (ref29) 2023
Nasiriany (ref33) 2024
ref40
Du (ref2) 2023
Yu (ref7) 2023
References_xml – volume-title: Proc. Int. Conf. Mach. Learn.
  year: 2024
  ident: ref33
  article-title: Pivot: Iterative visual prompting elicits actionable knowledge for VLMs
– start-page: 34892
  volume-title: Proc. Int. Conf. Neural Inf. Process. Syst.
  year: 2023
  ident: ref37
  article-title: Visual instruction tuning
– year: 2023
  ident: ref41
  article-title: Vicuna: An open-source chatbot impressing GPT-4 with 90 chatGPT quality
– volume-title: Proc. RSS
  year: 2024
  ident: ref32
  article-title: Moka: Open-vocabulary robotic manipulation through mark-based visual prompting
– volume-title: Proc. CoRL
  year: 2023
  ident: ref7
  article-title: Language to rewards for robotic skill synthesis
– year: 2024
  ident: ref27
  article-title: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-LLMs
– volume-title: Proc. Conf. Lang. Model.
  year: 2024
  ident: ref3
  article-title: Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors
– start-page: 8748
  volume-title: Proc. Int. Conf. Mach. Learn.
  year: 2021
  ident: ref42
  article-title: Learning transferable visual models from natural language supervision
– volume-title: Proc. Int. Conf. Mach. Learn.
  year: 2024
  ident: ref10
  article-title: RL-VLM-F: Reinforcement learning from vision language foundation model feedback
– volume-title: Proc. Int. Conf. Learn. Representations
  year: 2024
  ident: ref12
  article-title: Vision-language models are zero-shot reward models for reinforcement learning
– volume-title: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
  year: 2023
  ident: ref2
  article-title: Vision-language models as success detectors
– volume-title: Proc. CoRL
  year: 2024
  ident: ref16
  article-title: Robotic control via embodied chain-of-thought reasoning
– year: 2023
  ident: ref26
  article-title: Gemini: A family of highly capable multimodal models
– year: 2023
  ident: ref1
  article-title: Homerobot: Open-vocabulary mobile manipulation
  publication-title: . Int. Conf. Neural Inf. Process. Syst. Competition
– volume-title: Proc. Int. Conf. Learn. Representations
  year: 2024
  ident: ref4
  article-title: Eureka: Human-level reward design via coding large language models
– volume-title: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
  year: 2023
  ident: ref29
  article-title: Chat-Univi: Unified visual representation empowers large language models with image and video understanding
– year: 2024
  ident: ref43
  article-title: Llava-Next: Improved reasoning, OCR, and world knowledge
– start-page: 13610
  volume-title: Proc. Int. Conf. Neural Inf. Process. Syst.
  year: 2022
  ident: ref20
  article-title: TAP-vid: A benchmark for tracking any point in a video
– ident: ref15
  doi: 10.1109/CVPR52733.2024.01370
– ident: ref9
  doi: 10.18653/v1/2024.findings-acl.672
– ident: ref17
  doi: 10.15607/RSS.2024.XX.049
– ident: ref28
  doi: 10.18653/v1/2024.emnlp-main.342
– ident: ref38
  doi: 10.18653/v1/D19-1410
– ident: ref35
  doi: 10.18653/v1/p19-1472
– year: 2023
  ident: ref25
  article-title: GPT-4 technical report
– volume-title: Proc. IEEE Int. Conf. Robot. Autom.
  year: 2023
  ident: ref24
  article-title: Any-point trajectory modeling for policy learning
– start-page: 55681
  volume-title: Proc. Int. Conf. Neural Inf. Process. Syst.
  year: 2023
  ident: ref11
  article-title: RoboCLIP: One demonstration is enough to learn robot policies
– volume-title: Proc. Int. Conf. Learn. Representations
  year: 2024
  ident: ref19
  article-title: RT-Trajectory: Robotic task generalization via hindsight trajectory sketches
– volume-title: Proc. Eur. Conf. Comput. Vis.
  year: 2024
  ident: ref22
  article-title: Track2act: Predicting point tracks from Internet videos enables diverse zero-shot robot manipulation
– volume-title: Proc. CoRL
  year: 2024
  ident: ref31
  article-title: Robopoint: A vision-language model for spatial affordance prediction for robotics
– volume-title: Proc. IEEE Int. Conf. Robot. Autom.
  year: 2023
  ident: ref14
  article-title: Physically grounded vision-language models for robotic manipulation
– ident: ref18
  doi: 10.1109/ICCV48922.2021.01488
– ident: ref40
  doi: 10.1109/CVPR52733.2024.02484
– ident: ref8
  doi: 10.1109/CVPR52733.2024.01535
– ident: ref21
  doi: 10.1109/ICCV51070.2023.00923
– ident: ref23
  doi: 10.1109/icra57147.2024.10611409
– volume-title: Proc. IEEE Int. Conf. Robot. Autom.
  year: 2023
  ident: ref36
  article-title: Toward grounded social reasoning
– ident: ref44
  doi: 10.1109/ICRA48891.2023.10160591
– year: 2019
  ident: ref39
  article-title: Mediapipe: A framework for building perception pipelines
– year: 2024
  ident: ref30
  article-title: Llava-Next: A strong zero-shot video understanding model
– volume-title: Proc. Int. Conf. Learn. Representations
  year: 2023
  ident: ref5
  article-title: Reward design with language models
– volume-title: Proc. Int. Conf. Neural Inf. Process. Syst. Agent Learn. Open-Endedness Workshop
  year: 2023
  ident: ref13
  article-title: Vision-language models as a source of rewards
– start-page: 13584
  volume-title: Proc. Int. Conf. Mach. Learn.
  year: 2023
  ident: ref6
  article-title: Language instructed reinforcement learning for human-AI coordination
– ident: ref34
  doi: 10.1609/aaai.v34i05.6239
SSID ssj0001527395
Score 2.3301134
Snippet While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is...
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state – e.g., if an apple is...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 2287
SubjectTerms data sets for robot learning
Descriptions
Detectors
Frames (data processing)
Grounding
Human motion
Image analysis
Intention recognition
Optical flow
Representations
Robot dynamics
Robot learning
Robot motion
Robotics
Robots
semantic scene understanding
Semantics
Success
Tracking
Trajectories
Trajectory
Visualization
Title MotIF: Motion Instruction Fine-Tuning
URI https://ieeexplore.ieee.org/document/10833748
https://www.proquest.com/docview/3161372513
Volume 10
WOSCitedRecordID wos001410279500011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 2377-3766
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001527395
  issn: 2377-3766
  databaseCode: RIE
  dateStart: 20160101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2377-3766
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001527395
  issn: 2377-3766
  databaseCode: M~E
  dateStart: 20160101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB5s8aAHnxWrteSgBw9p89jsJt6KNFhoi0iV3sJmMwFBUrGtR3-7s5vUVkTBWw6zS_hmZ3dmducbgMuAog6JuS779XKbsUDamgTc9iQtAEGhXGYq5J6GYjwOp9PovipWN7UwiGgen2FHf5q7_GymljpVRhYe-poupQY1IURZrLVOqGgqsShYXUU6UXf40KMA0As65GSQD-l8O3pML5UfG7A5VeL9f_7PAexV7qPVK_V9CFtYHMHuBqngMVyNZotBfGONTH8ea7CmiLVikrInS50KacBj3J_c3tlVMwRbeZG3sDlTgqPSBuimIpU5VyG5IiqjLQ1dzlPMUQVu6oSKS6YJGLgrJfcwREc5GPknUC9mBZ6C5UVMyDSVmSStsJwChowpyXLm0iSp8JvQXeGUqIopXDeseElMxOBECSGbaGSTCtkmXH-NeC1ZMv6QbWgkN-RKEJvQWukiqexonvjkkPqCfDD_7Jdh57CjZy-fhbWgTpDiBWyr98Xz_K0NtdFHv20Wyicxgrkm
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwED50CuqDPydOp_ZBH3yoa9M0bXwbYtmwGyJT9hbSNAVBNnGbf7-XtHMTUfCtD5e2fJdL7i657wAuQow6pC5M2S8pXEpD6RoScJdInAARhnK5rZB7TqN-Px4O-UNVrG5rYbTW9vKZvjaP9iw_H6uZSZWhhceBoUtZhbWQUuKX5VqLlIohE-Ph_DDS4630sY0hIAmv0c1AL9L7tvnYbio_lmC7ryQ7__yjXdiuHEinXWp8D1b0aB-2lmgFD-CyN552kxunZzv0ON0FSayToJQ7mJlkSB2ekrvBbcet2iG4inAydRlVEdPKmKCfRZksmIrRGVE5LmraZyzThVahn3mxYpIaCgbmS8mIjrWnPM2DQ6iNxiN9BA7hNJJZJnOJeqEFhgw5VZIW1MeXZFHQgNYcJ6EqrnDTsuJV2JjB4wKRFQZZUSHbgKuvEW8lT8YfsnWD5JJcCWIDmnNdiMqSJiJAlzSI0AsLjn8Zdg4bnUEvFWm3f38Cm-ZL5SWxJtQQXn0K6-pj-jJ5P7PT5ROd1rs8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MotIF%3A+Motion+Instruction+Fine-Tuning&rft.jtitle=IEEE+robotics+and+automation+letters&rft.au=Hwang%2C+Minyoung&rft.au=Hejna%2C+Joey&rft.au=Sadigh%2C+Dorsa&rft.au=Bisk%2C+Yonatan&rft.date=2025-03-01&rft.pub=IEEE&rft.eissn=2377-3766&rft.volume=10&rft.issue=3&rft.spage=2287&rft.epage=2294&rft_id=info:doi/10.1109%2FLRA.2025.3527290&rft.externalDocID=10833748
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2377-3766&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2377-3766&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2377-3766&client=summon