MotIF: Motion Instruction Fine-Tuning
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated...
Saved in:
| Published in: | IEEE robotics and automation letters Vol. 10; no. 3; pp. 2287 - 2294 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Piscataway
IEEE
01.03.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 2377-3766, 2377-3766 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories with motion descriptions. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art API-based single-frame VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions. |
|---|---|
| AbstractList | While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories with motion descriptions. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art API-based single-frame VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions. |
| Author | Hwang, Minyoung Bisk, Yonatan Hejna, Joey Sadigh, Dorsa |
| Author_xml | – sequence: 1 givenname: Minyoung orcidid: 0000-0002-6548-0071 surname: Hwang fullname: Hwang, Minyoung email: myhwang@mit.edu organization: Massachusetts Institute of Technology, Cambridge, MA, USA – sequence: 2 givenname: Joey orcidid: 0009-0008-6339-3426 surname: Hejna fullname: Hejna, Joey email: jhejna@stanford.edu organization: Stanford University, Stanford, CA, USA – sequence: 3 givenname: Dorsa orcidid: 0000-0002-7802-9183 surname: Sadigh fullname: Sadigh, Dorsa email: dorsa@cs.stanford.edu organization: Google Deepmind, Mountain View, CA, USA – sequence: 4 givenname: Yonatan surname: Bisk fullname: Bisk, Yonatan email: ybisk@andrew.cmu.edu organization: Carnegie Mellon University, Pittsburgh, PA, USA |
| BookMark | eNp9kDFPwzAQhS1UJErpzsBQCTGm3NmJnbBVVQOVipBQmS3HXJCr4hQ7Gfj3pLRDxcD03vC-e7p3yQa-8cTYNcIUEYr71etsyoFnU5FxxQs4Y0MulEqEknJw4i_YOMYNAGCfE0U2ZHfPTbssHya9uMZPlj62obO_vnSeknXnnf-4Yue12UYaH3XE3srFev6UrF4el_PZKrG84G0iU6sk2VxwjpWqTC1tjllh3wE4oZQV1WQzrCC30qQIqZJojOSUE1igQozY7eHuLjRfHcVWb5ou-L5SC5QoFM9Q9Ck4pGxoYgxU611wnyZ8awS930P3e-j9Hvq4R4_IP4h1rdm_2Qbjtv-BNwfQEdFJTy6ESnPxAyjHbGs |
| CODEN | IRALC6 |
| CitedBy_id | crossref_primary_10_1016_j_inffus_2025_103575 crossref_primary_10_1080_01691864_2025_2532610 |
| Cites_doi | 10.1109/CVPR52733.2024.01370 10.18653/v1/2024.findings-acl.672 10.15607/RSS.2024.XX.049 10.18653/v1/2024.emnlp-main.342 10.18653/v1/D19-1410 10.18653/v1/p19-1472 10.1109/ICCV48922.2021.01488 10.1109/CVPR52733.2024.02484 10.1109/CVPR52733.2024.01535 10.1109/ICCV51070.2023.00923 10.1109/icra57147.2024.10611409 10.1109/ICRA48891.2023.10160591 10.1609/aaai.v34i05.6239 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2025 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2025 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/LRA.2025.3527290 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2377-3766 |
| EndPage | 2294 |
| ExternalDocumentID | 10_1109_LRA_2025_3527290 10833748 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Other Transaction Award grantid: HR00112490375 – fundername: Friction for Accountability in Conversational Transactions (FACT) Program – fundername: Defense Advanced Research Projects Agency; U.S. Defense Advanced Research Projects Agency funderid: 10.13039/100000185 |
| GroupedDBID | 0R~ 97E AAJGR AASAJ AAWTH ABQJQ ABVLG ACGFS AGQYO AGSQL AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD IFIPE IPLJI JAVBF KQ8 M43 M~E O9- OCL RIA RIE AAYXX CITATION 7SC 7SP 8FD AARMG ABAZT JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c292t-64c76ec83221b7baf6c8159cd002e166befec51b08c6a4104761aa62e8e0c0e93 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 2 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001410279500011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2377-3766 |
| IngestDate | Mon Jun 30 13:02:54 EDT 2025 Tue Nov 18 22:43:58 EST 2025 Thu Nov 27 00:34:39 EST 2025 Wed Nov 26 07:22:47 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c292t-64c76ec83221b7baf6c8159cd002e166befec51b08c6a4104761aa62e8e0c0e93 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-7802-9183 0000-0002-6548-0071 0009-0008-6339-3426 |
| PQID | 3161372513 |
| PQPubID | 4437225 |
| PageCount | 8 |
| ParticipantIDs | proquest_journals_3161372513 crossref_primary_10_1109_LRA_2025_3527290 crossref_citationtrail_10_1109_LRA_2025_3527290 ieee_primary_10833748 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-03-01 |
| PublicationDateYYYYMMDD | 2025-03-01 |
| PublicationDate_xml | – month: 03 year: 2025 text: 2025-03-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Piscataway |
| PublicationPlace_xml | – name: Piscataway |
| PublicationTitle | IEEE robotics and automation letters |
| PublicationTitleAbbrev | LRA |
| PublicationYear | 2025 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref35 Yenamandra (ref1) 2023 ref34 Kwon (ref5) 2023 ref15 Bharadhwaj (ref22) 2024 Gao (ref14) 2023 Guan (ref3) 2024 Liu (ref37) 2023 Radford (ref42) 2021 Yecheng (ref4) 2024 ref17 ref38 Hu (ref6) 2023 Kwon (ref36) 2023 ref18 Liu (ref32) 2024 Wen (ref24) 2023 Achiam (ref25) 2023 Zhang (ref30) 2024 Rocamonde (ref12) 2024 Chiang (ref41) 2023 Gu (ref19) 2024 ref23 Doersch (ref20) 2022 Liu (ref43) 2024 ref44 ref21 Team (ref26) 2023 Sontakke (ref11) 2023 Yuan (ref31) 2024 ref28 Wang (ref10) 2024 ref8 Micha (ref16) 2024 ref9 Lugaresi (ref39) 2019 Baumli (ref13) 2023 Cheng (ref27) 2024 Peng (ref29) 2023 Nasiriany (ref33) 2024 ref40 Du (ref2) 2023 Yu (ref7) 2023 |
| References_xml | – volume-title: Proc. Int. Conf. Mach. Learn. year: 2024 ident: ref33 article-title: Pivot: Iterative visual prompting elicits actionable knowledge for VLMs – start-page: 34892 volume-title: Proc. Int. Conf. Neural Inf. Process. Syst. year: 2023 ident: ref37 article-title: Visual instruction tuning – year: 2023 ident: ref41 article-title: Vicuna: An open-source chatbot impressing GPT-4 with 90 chatGPT quality – volume-title: Proc. RSS year: 2024 ident: ref32 article-title: Moka: Open-vocabulary robotic manipulation through mark-based visual prompting – volume-title: Proc. CoRL year: 2023 ident: ref7 article-title: Language to rewards for robotic skill synthesis – year: 2024 ident: ref27 article-title: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-LLMs – volume-title: Proc. Conf. Lang. Model. year: 2024 ident: ref3 article-title: Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors – start-page: 8748 volume-title: Proc. Int. Conf. Mach. Learn. year: 2021 ident: ref42 article-title: Learning transferable visual models from natural language supervision – volume-title: Proc. Int. Conf. Mach. Learn. year: 2024 ident: ref10 article-title: RL-VLM-F: Reinforcement learning from vision language foundation model feedback – volume-title: Proc. Int. Conf. Learn. Representations year: 2024 ident: ref12 article-title: Vision-language models are zero-shot reward models for reinforcement learning – volume-title: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. year: 2023 ident: ref2 article-title: Vision-language models as success detectors – volume-title: Proc. CoRL year: 2024 ident: ref16 article-title: Robotic control via embodied chain-of-thought reasoning – year: 2023 ident: ref26 article-title: Gemini: A family of highly capable multimodal models – year: 2023 ident: ref1 article-title: Homerobot: Open-vocabulary mobile manipulation publication-title: . Int. Conf. Neural Inf. Process. Syst. Competition – volume-title: Proc. Int. Conf. Learn. Representations year: 2024 ident: ref4 article-title: Eureka: Human-level reward design via coding large language models – volume-title: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. year: 2023 ident: ref29 article-title: Chat-Univi: Unified visual representation empowers large language models with image and video understanding – year: 2024 ident: ref43 article-title: Llava-Next: Improved reasoning, OCR, and world knowledge – start-page: 13610 volume-title: Proc. Int. Conf. Neural Inf. Process. Syst. year: 2022 ident: ref20 article-title: TAP-vid: A benchmark for tracking any point in a video – ident: ref15 doi: 10.1109/CVPR52733.2024.01370 – ident: ref9 doi: 10.18653/v1/2024.findings-acl.672 – ident: ref17 doi: 10.15607/RSS.2024.XX.049 – ident: ref28 doi: 10.18653/v1/2024.emnlp-main.342 – ident: ref38 doi: 10.18653/v1/D19-1410 – ident: ref35 doi: 10.18653/v1/p19-1472 – year: 2023 ident: ref25 article-title: GPT-4 technical report – volume-title: Proc. IEEE Int. Conf. Robot. Autom. year: 2023 ident: ref24 article-title: Any-point trajectory modeling for policy learning – start-page: 55681 volume-title: Proc. Int. Conf. Neural Inf. Process. Syst. year: 2023 ident: ref11 article-title: RoboCLIP: One demonstration is enough to learn robot policies – volume-title: Proc. Int. Conf. Learn. Representations year: 2024 ident: ref19 article-title: RT-Trajectory: Robotic task generalization via hindsight trajectory sketches – volume-title: Proc. Eur. Conf. Comput. Vis. year: 2024 ident: ref22 article-title: Track2act: Predicting point tracks from Internet videos enables diverse zero-shot robot manipulation – volume-title: Proc. CoRL year: 2024 ident: ref31 article-title: Robopoint: A vision-language model for spatial affordance prediction for robotics – volume-title: Proc. IEEE Int. Conf. Robot. Autom. year: 2023 ident: ref14 article-title: Physically grounded vision-language models for robotic manipulation – ident: ref18 doi: 10.1109/ICCV48922.2021.01488 – ident: ref40 doi: 10.1109/CVPR52733.2024.02484 – ident: ref8 doi: 10.1109/CVPR52733.2024.01535 – ident: ref21 doi: 10.1109/ICCV51070.2023.00923 – ident: ref23 doi: 10.1109/icra57147.2024.10611409 – volume-title: Proc. IEEE Int. Conf. Robot. Autom. year: 2023 ident: ref36 article-title: Toward grounded social reasoning – ident: ref44 doi: 10.1109/ICRA48891.2023.10160591 – year: 2019 ident: ref39 article-title: Mediapipe: A framework for building perception pipelines – year: 2024 ident: ref30 article-title: Llava-Next: A strong zero-shot video understanding model – volume-title: Proc. Int. Conf. Learn. Representations year: 2023 ident: ref5 article-title: Reward design with language models – volume-title: Proc. Int. Conf. Neural Inf. Process. Syst. Agent Learn. Open-Endedness Workshop year: 2023 ident: ref13 article-title: Vision-language models as a source of rewards – start-page: 13584 volume-title: Proc. Int. Conf. Mach. Learn. year: 2023 ident: ref6 article-title: Language instructed reinforcement learning for human-AI coordination – ident: ref34 doi: 10.1609/aaai.v34i05.6239 |
| SSID | ssj0001527395 |
| Score | 2.3301134 |
| Snippet | While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is... While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state – e.g., if an apple is... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 2287 |
| SubjectTerms | data sets for robot learning Descriptions Detectors Frames (data processing) Grounding Human motion Image analysis Intention recognition Optical flow Representations Robot dynamics Robot learning Robot motion Robotics Robots semantic scene understanding Semantics Success Tracking Trajectories Trajectory Visualization |
| Title | MotIF: Motion Instruction Fine-Tuning |
| URI | https://ieeexplore.ieee.org/document/10833748 https://www.proquest.com/docview/3161372513 |
| Volume | 10 |
| WOSCitedRecordID | wos001410279500011&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 2377-3766 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001527395 issn: 2377-3766 databaseCode: RIE dateStart: 20160101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2377-3766 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001527395 issn: 2377-3766 databaseCode: M~E dateStart: 20160101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB5s8aAHnxWrteSgBw9p89jsJt6KNFhoi0iV3sJmMwFBUrGtR3-7s5vUVkTBWw6zS_hmZ3dmducbgMuAog6JuS779XKbsUDamgTc9iQtAEGhXGYq5J6GYjwOp9PovipWN7UwiGgen2FHf5q7_GymljpVRhYe-poupQY1IURZrLVOqGgqsShYXUU6UXf40KMA0As65GSQD-l8O3pML5UfG7A5VeL9f_7PAexV7qPVK_V9CFtYHMHuBqngMVyNZotBfGONTH8ea7CmiLVikrInS50KacBj3J_c3tlVMwRbeZG3sDlTgqPSBuimIpU5VyG5IiqjLQ1dzlPMUQVu6oSKS6YJGLgrJfcwREc5GPknUC9mBZ6C5UVMyDSVmSStsJwChowpyXLm0iSp8JvQXeGUqIopXDeseElMxOBECSGbaGSTCtkmXH-NeC1ZMv6QbWgkN-RKEJvQWukiqexonvjkkPqCfDD_7Jdh57CjZy-fhbWgTpDiBWyr98Xz_K0NtdFHv20Wyicxgrkm |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwED50CuqDPydOp_ZBH3yoa9M0bXwbYtmwGyJT9hbSNAVBNnGbf7-XtHMTUfCtD5e2fJdL7i657wAuQow6pC5M2S8pXEpD6RoScJdInAARhnK5rZB7TqN-Px4O-UNVrG5rYbTW9vKZvjaP9iw_H6uZSZWhhceBoUtZhbWQUuKX5VqLlIohE-Ph_DDS4630sY0hIAmv0c1AL9L7tvnYbio_lmC7ryQ7__yjXdiuHEinXWp8D1b0aB-2lmgFD-CyN552kxunZzv0ON0FSayToJQ7mJlkSB2ekrvBbcet2iG4inAydRlVEdPKmKCfRZksmIrRGVE5LmraZyzThVahn3mxYpIaCgbmS8mIjrWnPM2DQ6iNxiN9BA7hNJJZJnOJeqEFhgw5VZIW1MeXZFHQgNYcJ6EqrnDTsuJV2JjB4wKRFQZZUSHbgKuvEW8lT8YfsnWD5JJcCWIDmnNdiMqSJiJAlzSI0AsLjn8Zdg4bnUEvFWm3f38Cm-ZL5SWxJtQQXn0K6-pj-jJ5P7PT5ROd1rs8 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MotIF%3A+Motion+Instruction+Fine-Tuning&rft.jtitle=IEEE+robotics+and+automation+letters&rft.au=Hwang%2C+Minyoung&rft.au=Hejna%2C+Joey&rft.au=Sadigh%2C+Dorsa&rft.au=Bisk%2C+Yonatan&rft.date=2025-03-01&rft.pub=IEEE&rft.eissn=2377-3766&rft.volume=10&rft.issue=3&rft.spage=2287&rft.epage=2294&rft_id=info:doi/10.1109%2FLRA.2025.3527290&rft.externalDocID=10833748 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2377-3766&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2377-3766&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2377-3766&client=summon |