Exploring promptable foundation models for high-resolution video eye tracking in the lab

Saved in:
Bibliographic Details
Title: Exploring promptable foundation models for high-resolution video eye tracking in the lab
Authors: Niehorster, Diederick C., Maquiling, Virmarie, Byrne, Sean, Kasneci, Enkelejda, Nyström, Marcus
Contributors: Lund University, Joint Faculties of Humanities and Theology, Units, Lund University Humanities Lab, Lunds universitet, Humanistiska och teologiska fakulteterna, Fakultetsgemensamma verksamheter, Humanistlaboratoriet, Originator, Lund University, Faculty of Social Sciences, Departments of Administrative, Economic and Social Sciences, Department of Psychology, Lunds universitet, Samhällsvetenskapliga fakulteten, Samhällsvetenskapliga institutioner och centrumbildningar, Institutionen för psykologi, Originator, Lund University, Profile areas and other strong research environments, Lund University Profile areas, LU Profile Area: Natural and Artificial Cognition, Lunds universitet, Profilområden och andra starka forskningsmiljöer, Lunds universitets profilområden, LU profilområde: Naturlig och artificiell kognition, Originator, Lund University, Profile areas and other strong research environments, Strategic research areas (SRA), eSSENCE: The e-Science Collaboration, Lunds universitet, Profilområden och andra starka forskningsmiljöer, Strategiska forskningsområden (SFO), eSSENCE: The e-Science Collaboration, Originator
Source: ETRA '25. :1-8
Subject Terms: Natural Sciences, Computer and Information Sciences, Human Computer Interaction, Naturvetenskap, Data- och informationsvetenskap (Datateknik), Människa-datorinteraktion (Interaktionsdesign)
Description: We explore whether SAM2, a vision foundation model, can be used for accurate localization of eye image features that are used in lab-based eye tracking: corneal reflections (CRs), the pupil, and the iris. We prompted SAM2 via a typical hand annotation process that consisted of clicking on the pupil, CR, iris and sclera for only one image per participant. SAM2 was found to support better spatial precision in the resulting gaze signals for the pupil (> 44% lower RMS-S2S), but not the CR and iris, than traditional image-processing methods or two state-of-the-art deep-learning tools. Providing more frames with prompts to initialize SAM2 did not improve performance. We conclude that SAM2’s powerful zero-shot segmentation capabilities provide an interesting new avenue to explore in high-resolution lab-based eye tracking. We provide our adaptation of SAM2’s codebase that allows segmenting videos of arbitrary duration and prepending arbitrary prompting frames.
Access URL: https://doi.org/10.1145/3715669.3723118
Database: SwePub
Description
Abstract:We explore whether SAM2, a vision foundation model, can be used for accurate localization of eye image features that are used in lab-based eye tracking: corneal reflections (CRs), the pupil, and the iris. We prompted SAM2 via a typical hand annotation process that consisted of clicking on the pupil, CR, iris and sclera for only one image per participant. SAM2 was found to support better spatial precision in the resulting gaze signals for the pupil (> 44% lower RMS-S2S), but not the CR and iris, than traditional image-processing methods or two state-of-the-art deep-learning tools. Providing more frames with prompts to initialize SAM2 did not improve performance. We conclude that SAM2’s powerful zero-shot segmentation capabilities provide an interesting new avenue to explore in high-resolution lab-based eye tracking. We provide our adaptation of SAM2’s codebase that allows segmenting videos of arbitrary duration and prepending arbitrary prompting frames.
DOI:10.1145/3715669.3723118