The Cross-Linguistic Coordination of Overt Attention and Speech Production as Evidence for a Language of Vision.

Saved in:
Bibliographic Details
Title: The Cross-Linguistic Coordination of Overt Attention and Speech Production as Evidence for a Language of Vision.
Authors: Coco MI; Department of Psychology, Sapienza University of Rome.; I.R.C.S.S Fondazione Santa Lucia., Fernandes EG; School of Psychology, University of Minho., Arai M; Faculty of Economics, Seijo University., Keller F; School of Informatics, University of Edinburgh.
Source: Cognitive science [Cogn Sci] 2026 Feb; Vol. 50 (2), pp. e70185.
Publication Type: Journal Article
Language: English
Journal Info: Publisher: Wiley-Blackwell Country of Publication: United States NLM ID: 7708195 Publication Model: Print Cited Medium: Internet ISSN: 1551-6709 (Electronic) Linking ISSN: 03640213 NLM ISO Abbreviation: Cogn Sci Subsets: MEDLINE
Imprint Name(s): Publication: 2009-: Hoboken, N.J. : Wiley-Blackwell
Original Publication: Norwood, N. J., Ablex Pub. Corp.
MeSH Terms: Attention*/physiology , Speech*/physiology , Visual Perception*/physiology , Language* , Linguistics*, Eye Movements/physiology ; Humans ; Male ; Female ; Semantics ; Adult ; Young Adult ; Eye-Tracking Technology
Abstract: A central question in cognition is how representations are integrated across different modalities, such as language and vision. One prominent hypothesis posits the existence of an abstract, prelinguistic "language of vision" as a representational system that organizes meaning compositionally, enabling cross-modal integration. This hypothesis predicts that the language of vision operates universally, independent of linguistic surface features such as word order. We conducted eye-tracking experiments where participants described visual scenes in English, Portuguese, and Japanese. By analyzing spoken descriptions alongside eye-movement sequences divided into planning and articulation phases, we demonstrate that semantic similarity between sentences strongly predicts the similarity of associated scan patterns in all three languages, even across scenes and between sentences in different languages. In contrast, the effect of syntactic constraints was secondary and transient: it was restricted to within-language and within-scene comparisons, and temporally confined to the early planning phase of the utterance. Our findings support an interactive account of cross-modal coordination in which a universal language of vision provides stable semantic scaffolding, while syntax serves as a local constraint, primarily active during message linearization.
(© 2026 The Author(s). Cognitive Science published by Wiley Periodicals LLC on behalf of Cognitive Science Society (CSS).)
References: Altmann, G. T., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73, 247–264.
Altmann, G. T., & Kamide, Y. (2007). The real‐time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language, 57(4), 502–518.
Arai, M., & Keller, F. (2013). The use of verb‐specific information for prediction in sentence processing. Language and Cognitive Processes, 28(4), 525–560.
Bainbridge, W. A., Hall, E. H., & Baker, C. I. (2019). Drawings of real‐world scenes during free recall reveal detailed object and spatial information in memory. Nature Communications, 10(1).
Bar, M. (2004). Visual objects in context. Nature Reviews Neuroscience, 5, 617–629.
Barker, M., Rehrig, G., & Ferreira, F. (2023). Speakers prioritise affordance‐based object semantics in scene descriptions. Language, Cognition and Neuroscience, 38(8), 1045–1067.
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.
Bates, D., Mächler, M., Bolker, B. M., & Walker, S. C. (2015). Fitting linear mixed‐effects models using lme4. Journal of Statistical Software, 67, 1–48.
Bock, K., Irwin, D. E., Davidson, D. J., & Levelt, W. J. (2003). Minding the clock. Journal of Memory and Language, 48, 653–685.
Bowers, J. S., Malhotra, G., Dujmović, M., Montero, M. L., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., & Blything, R. (2023). Deep problems with neural network models of human vision. Behavioral and Brain Sciences, 46, e385.
Brockmole, J. R., & Henderson, J. M. (2006). Using real‐world scenes as contextual cues for search. Visual Cognition, 13, 99–108.
Brown‐Schmidt, S., & Konopka, A. E. (2008). Little houses and casas pequeñas: Message formulation and syntactic form in unscripted speech with speakers of English and Spanish. Cognition, 109(2), 274–280.
Brown‐Schmidt, S., & Tanenhaus, M. K. (2006). Watching the eyes when talking about size: An investigation of message formulation and utterance planning. Journal of Memory and Language, 54(4), 592–609.
Castner, N., Kuebler, T. C., Scheiter, K., Richter, J., Eder, T., Hüttig, F., Keutel, C., & Kasneci, E. (2020). Deep semantic gaze embedding and scanpath comparison for expertise classification during OPT viewing. In ACM Symposium on Eye Tracking Research and Applications (pp. 1–10).
Cavanagh, P. (2021). The language of vision. Perception, 50, 195–215.
Cer, D., Yang, Y., Kong, S.‐y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo‐Cespedes, M., Yuan, S., Tar, C., Sung, Y.‐H., Strope, B., & Kurzweil, R. (2018). Universal sentence encoder.
Cer, D., Yang, Y., Kong, S.‐y., Hua, N., Limtiaco, N., st. John, R., Constant, N., Guajardo‐Cespedes, M., Yuan, S., Tar, C., Strope, B., & Kurzweil, R. (2018). Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Brussels, Belgium: Association for Computational Linguistics.
Chen, X., Jiang, M., & Zhao, Q. (2024). Beyond average: Individualized visual scanpath prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 25420–25431).
Chen, X., Qiu, X., & Wang, S. (2025). The role of working memory in structural priming during language comprehension: Evidence from a visual‐world paradigm. Psychonomic Bulletin & Review, 32, 2375–2388.
Cichy, R. M., & Kaiser, D. (2019). Deep neural networks as scientific models. Trends in Cognitive Sciences, 23(4), 305–317.
Clark, H. H., & Krych, M. A. (2004). Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50, 62–81.
Coco, M. I. (2022). Encyclopedia of Behavioral Neuroscience (2nd ed.), volume 1. Elsevier.
Coco, M. I., Araujo, S., & Petersson, K. M. (2017). Disentangling stimulus plausibility and contextual congruency: Electro‐physiological evidence for differential cognitive dynamics. Neuropsychologia, 96, 150–163.
Coco, M. I., Dale, R., & Keller, F. (2018). Performance in a collaborative search task: The role of feedback and alignment. Topics in Cognitive Science, 10, 55–79.
Coco, M. I., & Duran, N. D. (2016). When expectancies collide: Action dynamics reveal the interaction between stimulus plausibility and congruency. Psychonomic Bulletin & Review, 23, 1920–1931.
Coco, M. I., & Keller, F. (2012). Scan patterns predict sentence production in the cross‐modal processing of visual scenes. Cognitive Science, 36, 1204–1223.
Coco, M. I., & Keller, F. (2015). Integrating mechanisms of visual guidance in naturalistic language production. Cognitive Processing, 16, 131–150.
Coco, M. I., Keller, F., & Malcolm, G. L. (2016). Anticipation in real‐world scenes: The role of visual context and visual memory. Cognitive Science, 40, 1995–2024.
Coco, M. I., Nuthmann, A., & Dimigen, O. (2020). Fixation‐related brain potentials during semantic integration of object–scene information. Journal of Cognitive Neuroscience, 32, 571–589.
Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language. A new methodology for the real‐time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6, 84–107.
Corballis, M. C. (2017). Language evolution: A changing perspective. Trends in Cognitive Sciences, 21, 229–236.
Coventry, K. R., Gudde, H. B., Diessel, H., Collier, J., Guijarro‐Fuentes, P., Vulchanova, M., Vulchanov, V., Todisco, E., Reile, M., Breunesse, M., Plado, H., Bohnemeyer, J., Bsili, R., Caldano, M., Dekova, R., Donelson, K., Forker, D., Park, Y., Pathak, L. S., Peeters, D., Pizzuto, G., Serhan, B., Apse, L., Hesse, F., Hoang, L., Hoang, P., Igari, Y., Kapiley, K., Haupt‐Khutsishvili, T., Kolding, S., Priiki, K., Mačiukaitytė, I., Mohite, V., Nahkola, T., Tsoi, S. Y., Williams, S., Yasuda, S., Cangelosi, A., Duñabeitia, J. A., Mishra, R. K., Rocca, R., Šķilters, J., Wallentin, M., Žilinskaitė‐Šinkūnienė, E., & Incel, O. D. (2023). Spatial communication systems across languages reflect universal action constraints. Nature Human Behaviour, 7(12), 2099–2110.
Davenport, J. L., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15, 559–564.
Davidoff, J., Davies, I., & Roberson, D. (1999). Colour categories in a stone‐age tribe. Nature, 398(6724), 203–204.
de Marneffe, M.‐C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal dependencies. Computational Linguistics, 47(2), 255–308.
Dell, G. S., Burger, L. K., & Svec, W. R. (1997). Language production and serial order: A functional analysis and a model. Psychological Review, 104(1), 123.
Dimigen, O., Kliegl, R., & Sommer, W. (2012). Trans‐saccadic parafoveal preview benefits in fluent reading: A study with fixation‐related brain potentials. Neuroimage, 62(1), 381–393.
Doerig, A., Sommers, R. P., Seeliger, K., Richards, B., Ismael, J., Lindsay, G. W., Kording, K. P., Konkle, T., van Gerven, M. A., Kriegeskorte, N., & Kietzmann, T. C. (2023). The neuroconnectionist research programme. Nature Reviews Neuroscience, 24, 431–450.
Draschkow, D. & Võ, M. L.‐H. (2017). Scene grammar shapes the way we interact with objects, strengthens memories, and speeds search. Scientific Reports, 7, 16471.
Elliott, D., & de Vries, A. (2015). Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 42–52).
Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1292–1302).
Evans, K. K., & Baddeley, A. (2018). Intention, attention and long‐term memory for visual scenes: It all depends on the scenes. Cognition, 180, 24–37.
Fodor, J. A. (1975). The language of thought, volume 5. Harvard University Press.
Foulsham, T., Walker, E., & Kingstone, A. (2011). The where, what and when of gaze allocation in the lab and the natural environment. Vision Research, 51, 1920–1931.
Friederici, A. D., & Weissenborn, J. (2007). Mapping sentence form onto meaning: The syntax–semantic interface. Brain Research, 1146, 50–58.
Galati, A., Dale, R., Alviar, C., & Coco, M. I. (2026). Task goals constrain the alignment in eye‐movements and speech during interpersonal coordination. Journal of Memory and Language, 146, 104691.
Gennari, S. P., Sloman, S. A., Malt, B. C., & Fitch, W. T. (2002). Motion events in language and cognition. Cognition, 83(1), 49–79.
Gentner, D. (2003). Language in mind: Advances in the study of language and thought. Cambridge, MA: MIT Press.
Gleitman, L. R., January, D., Nappa, R., & Trueswell, J. C. (2007). On the give and take between event apprehension and utterance formulation. Journal of Memory and Language, 57, 544–569.
Green, P., & MacLeod, C. J. (2016). Simr: An R package for power analysis of generalized linear mixed models by simulation. Methods in Ecology and Evolution, 7(4), 493–498.
Greene, M. R., & Oliva, A. (2009). The briefest of glances: The time course of natural scene understanding. Psychological Science, 20(4), 464–472.
Gregory, R. L. (1974). Concepts and mechanisms of perception. Charles Scribner's Sons.
Griffin, Z. M., & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 11(4), 274–279.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: Computer science and computational biology. ACM Sigact News, 28, 41–60.
Hackl, M. (2013). The syntax–semantics interface. Lingua, 130, 66–87.
Hafri, A., Gleitman, L. R., Landau, B., & Trueswell, J. C. (2022). Where word and world meet: Language and vision share an abstract representation of symmetry. Journal of Experimental Psychology: General, 152, 509–527.
Hafri, A., & Trueswell, J. C. (2025). Apprehending relational events: The visual world paradigm and the interplay of event perception and language. Brain Research, 1869, 150000.
Hahn, M., & Keller, F. (2023). Modeling task effects in human reading with neural network‐based attention. Cognition, 230(1), 1–25.
Haspelmath, M., Dryer, M. S., Gil, D., & Comrie, B. (2008). The world atlas of language structures online. Max Planck Digital Library Munich.
Hayhoe, M. M., McKinney, T., Chajka, K., & Pelz, J. B. (2012). Predictive eye movements in natural vision. Experimental Brain Research, 217, 125–136.
Heinen, R., Bierbrauer, A., Wolf, O. T., & Axmacher, N. (2024). Representational formats of human memory traces. Brain Structure and Function, 229(3), 513–529.
Hespos, S. J., & Spelke, E. S. (2004). Conceptual precursors to language. Nature, 430(6998), 453–456.
Hessels, R. S., Teunisse, M. K., Niehorster, D. C., Nyström, M., Benjamins, J. S., Senju, A., & Hooge, I. T. (2023). Task‐related gaze behaviour in face‐to‐face dyadic collaboration: Toward an interactive theory? Visual Cognition, 31, 291–313.
Honnibal, M., & Johnson, M. (2015). An improved non‐monotonic transition system for dependency parsing. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 (pp. 1373–1378). Association for Computational Linguistics (ACL).
Huettig, F., Olivers, C. N., & Hartsuiker, R. J. (2011). Looking, language, and memory: Bridging research from the visual world and visual search paradigms. Acta Psychologica, 137(2), 138–150.
Jackendoff, R. (1992). Semantic structures, volume 18. MIT Press.
Jackendoff, R. (2025). The parallel architecture in language and elsewhere. Topics in Cognitive Science, 17(4), 822–831.
Kaiser, D., Quek, G. L., Cichy, R. M., & Peelen, M. V. (2019). Object vision in a structured world. Trends in Cognitive Sciences, 23, 672–685.
Kamide, Y., Altmann, G. T., & Haywood, S. L. (2003). The time‐course of prediction in incremental sentence processing: Evidence from anticipatory eye movements. Journal of Memory and Language, 49(1), 133–156.
Kamide, Y., Scheepers, C., & Altmann, G. T. (2003). Integration of syntactic and semantic information in predictive processing: Cross‐linguistic evidence from German and English. Journal of Psycholinguistic Research, 32, 37–55.
Kaunitz, L. N., Kamienkowski, J. E., Varatharajah, A., Sigman, M., Quiroga, R. Q., & Ison, M. J. (2014). Looking for a face in the crowd: Fixation‐related potentials in an eye‐movement visual search task. NeuroImage, 89, 297–305.
Kay, P. & Kempton, W. (1984). What is the Sapir–Whorf hypothesis? American Anthropologist, 86(1), 65–79.
Kazanina, N., & Poeppel, D. (2023). The neural ingredients for a language of thought are available. Trends in Cognitive Sciences, 27, 996–1007.
Kempen, G., Olsthoorn, N., & Sprenger, S. (2012). Grammatical workspace sharing during language production and language comprehension: Evidence from grammatical multitasking. Language and Cognitive Processes, 27(3), 345–380.
Khatin‐Zadeh, O., Hu, J., Eskandari, Z., Banaruee, H., Yanjiao, Z., Farsani, D., & He, J. (2024). Embodiment and gestural realization of ergative verbs. Psychological Research, 88(3), 762–772.
Knoeferle, P., & Crocker, M. W. (2006). The coordinated interplay of scene, utterance, and world knowledge: Evidence from eye tracking. Cognitive Science, 30, 481–529.
Koh, J. Y., Fried, D., & Salakhutdinov, R. R. (2023). Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 21487–21506.
Koring, L., Mak, P., & Reuland, E. (2012). The time course of argument reactivation revealed: Using the visual world paradigm. Cognition, 123(3), 361–379.
Kriegeskorte, N. (2015). Deep neural networks: A new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1(1), 417–446.
Kuchinsky, S. E., Bock, K., & Irwin, D. E. (2011). Reversing the hands of time: Changing the mapping from seeing to saying. Journal of Experimental Psychology. Learning, Memory, and Cognition, 37, 748–756.
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmertest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26.
Land, M. F. (2012). The operation of the visual system in relation to action. Current Biology, 22(18), R811–R817.
Land, M., Mennie, N., & Rusted, J. (1999). The roles of vision and eye movements in the control of activities of daily living. Perception, 28, 1311–1328.
Landau, B., & Jackendoff, R. (1993). ‘What’ and ‘where’ in spatial language and spatial cognition. Behavioral and Brain Sciences, 16, 217–265.
Lashley, K. S. (1951). The problem of serial order in behavior, (Vol. 21, p. 21). Oxford: Bobbs‐Merrill.
Lauer, T., Cornelissen, T. H., Draschkow, D., Willenbockel, V., & Võ, M. L. H. (2018). The role of scene summary statistics in object recognition. Scientific Reports, 8(1), 14666.
Levelt, W. J., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22(1), 1–38.
Levinson, S. C. (1997). From outer to inner space: Linguistic categories and non‐linguistic thinking. Language and Conceptualization, 1, 13–45.
Li, G., Duan, N., Fang, Y., Gong, M., & Jiang, D. (2020). Unicoder‐VL: A universal encoder for vision and language by cross‐modal pre‐training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34 (pp. 11336–11344).
Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012). Behavior matching in multimodal communication is synchronized. Cognitive Science, 36, 1404–1426.
Luke, S. G. (2017). Evaluating significance in linear mixed‐effects models in R. Behavior Research Methods, 49, 1494–1502.
Lupyan, G., Rahman, R. A., Boroditsky, L., & Clark, A. (2020). Effects of language on visual perception. Trends in Cognitive Sciences, 24(11), 930–944.
Ma, X., Liu, Y., Clariana, R., Gu, C., & Li, P. (2023). From eye movements to scanpath networks: A method for studying individual differences in expository text reading. Behavior Research Methods, 55(2), 730–750.
Majid, A., Bowerman, M., Kita, S., Haun, D. B., & Levinson, S. C. (2004). Can language restructure cognition? The case for space. Trends in Cognitive Sciences, 8(3), 108–114.
Mani, K., & Johnson‐Laird, P. N. (1982). The mental representation of spatial descriptions. Memory & Cognition, 10(2), 181–187.
Maran, M., Friederici, A. D., & Zaccarella, E. (2022). Syntax through the looking glass: A review on two‐word linguistic processing across behavioral, neuroimaging and neurostimulation studies. Neuroscience & Biobehavioral Reviews, 142, 104881.
McClelland, J. L., Rumelhart, D. E., & PDP Research Group. (1987). Parallel distributed processing, volume 2: Explorations in the microstructure of cognition: Psychological and biological models, (Vol. 2). MIT Press.
Merkx, D., Frank, S., & Ernestus, M. (2022). Seeing the advantage: Visually grounding word embeddings to better capture human semantic knowledge. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 1–11). Dublin, Ireland: Association for Computational Linguistics.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Morgan, E. U., van der Meer, A., Vulchanova, M., Blasi, D. E., & Baggio, G. (2020). Meaning before grammar: A review of ERP experiments on the neurodevelopmental origins of semantic processing.
Moschitti, A., Pighin, D., & Basili, R. (2008). Tree kernels for semantic role labeling. Computational Linguistics, 34(2), 193–224.
Mudrik, L., Lamy, D., & Deouell, L. Y. (2010). ERP evidence for context congruity effects during simultaneous object‐scene processing. Neuropsychologia, 48, 507–517.
Mudrik, L., Shalgi, S., Lamy, D., & Deouell, L. Y. (2014). Synchronous contextual irregularities affect early scene processing: Replication and extension. Neuropsychologia, 56, 447–458.
Murlidaran, S., & Eckstein, M. P. (2025). Eye movements during free viewing to maximize scene understanding. Nature Communications, 17(1), https://doi.org/10.1038/s41467‐025‐67673‐w.
Myachykov, A., Posner, M., & Tomlin, R. (2007). A parallel interface for language and cognition: Theory, method, and experimental evidence. Linguistic Review, 24(4), 457–474.
Myachykov, A., Thompson, D., Scheepers, C., & Garrod, S. (2011). Visual attention and structural choice in sentence production across languages. Language and Linguistics Compass, 5(2), 95–107.
Nivre, J. (2005). Pseudo‐projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 2005.
Pagnotta, M., Laland, K. N., & Coco, M. I. (2020). Attentional coordination in demonstrator‐observer dyads facilitates learning and predicts performance in a novel manual task. Cognition, 201, 104314.
Papafragou, A., Hulbert, J., & Trueswell, J. (2008). Does language guide event perception? Evidence from eye movements. Cognition, 108, 155–184.
Piantadosi, S. T. (2023). Modern language models refute Chomsky's approach to language. In From fieldwork to linguistic theory: A tribute to Dan Everett, 15, 353–414.
Pickering, M. J., & Gambi, C. (2018). Predicting while comprehending language: A theory and review. Psychological Bulletin, 144(10), 1002.
Poletiek, F. H., Monaghan, P., van de Velde, M., & Bocanegra, B. R. (2021). The semantics–syntax interface: Learning grammatical categories and hierarchical syntactic structure through semantics. Journal of Experimental Psychology: Learning, Memory, and Cognition, 47(7), 1141.
Quilty‐Dunn, J., Porot, N., & Mandelbaum, E. (2023). The best game in town: The reemergence of the language‐of‐thought hypothesis across the cognitive sciences. Behavioral and Brain Sciences, 46, e261.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High‐resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10674–10685). Los Alamitos, CA: IEEE Computer Society.
Snedeker, J., & Trueswell, J. C. (2004). The developing constraints on parsing decisions: The role of lexical‐biases and referential scenes in child and adult sentence processing. Cognitive Psychology, 49(3), 238–299.
Souza, A. S., & Skóra, Z. (2017). The interplay of language and visual perception in working memory. Cognition, 166, 277–297.
Spivey, M. J., Tanenhaus, M. K., Eberhard, K. M., & Sedivy, J. C. (2002). Eye movements and spoken language comprehension: Effects of visual context on syntactic ambiguity resolution. Cognitive Psychology, 45(4), 447–481.
Tachihara, K., Barker, M., Cotter, B., Hayes, T., Henderson, J., Zhou, A., & Ferreira, F. (2026). Planning to be incremental: Scene descriptions reveal meaningful clustering in language production. Cognition, 266, 106330.
Talmy, L. (1975). Figure and ground in complex sentences. In Proceedings of the 1st Annual Meeting of the Berkeley Linguistics Society (pp. 419–430).
Tanenhaus, M. K., Spivey‐Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632–1634.
Tatler, B. W. (2007). The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision, 7(14), 4.
Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522.
Torralba, A., Oliva, A., Castelhano, M. S., & Henderson, J. M. (2006). Contextual guidance of eye movements and attention in real‐world scenes: The role of global features in object search. Psychological Review, 113, 766–786.
Ünal, E., Mamus, E., & Özyürek, A. (2024). Multimodal encoding of motion events in speech, gesture and cognition. Language and Cognition, 16(4), 785–804.
Ünal, E., Wilson, F., Trueswell, J., & Papafragou, A. (2024). Asymmetries in encoding event roles: Evidence from language and cognition. Cognition, 250, 105868.
Võ, M. L.‐H. (2021). The meaning and structure of scenes. Vision Research, 181, 10–20.
Võ, M. L. H., Boettcher, S. E., & Draschkow, D. (2019). Reading scenes: How scene grammar guides attention and aids perception in real‐world environments. Current Opinion in Psychology, 29, 205–210.
Võ, M. L.‐H., & Wolfe, J. M. (2013). Differential electrophysiological signatures of semantic and syntactic scene processing. Psychological Science, 24, 1816–1823.
Webb, A., Knott, A., & MacAskill, M. R. (2010). Eye movements during transitive action observation have sequential structure. Acta Psychologica, 133(1), 51–56.
Wilson, V. A., Zuberbühler, K., & Bickel, B. (2022). The evolutionary origins of syntax: Event cognition in nonhuman primates. Science Advances, 8(25), eabn8464.
Wolfe, J. M., Alvarez, G. A., Rosenholtz, R., Kuzmova, Y. I., & Sherman, A. M. (2011). Visual search for arbitrary objects in real scenes. Attention, Perception, and Psychophysics, 73, 1650–1671.
Wolff, P., & Holmes, K. J. (2011). Linguistic relativity. Wiley Interdisciplinary Reviews: Cognitive Science, 2(3), 253–265.
Wu, J., Gan, W., Chen, Z., Wan, S., & Philip, S. Y. (2023). Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2247–2256). IEEE.
Xue, R., Xu, J., Mondal, S., Le, H., Zelinsky, G., Hoai, M., & Samaras, D. (2025). Few‐shot personalized scanpath prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 13497–13507).
Zacks, J. M., & Tversky, B. (2001). Event structure in perception and conception. Psychological Bulletin, 127(1), 3.
Grant Information: 2022APAFFN - CUP: B53D2301448 0001 European Union; 203427 Synchronous Linguistic and Visual Processing
Contributed Indexing: Keywords: Cross‐linguistic differences; Cross‐modal coordination; Eye‐tracking; Language of vision; Scene description; Scene grammar; Semantic and syntactic similarity
Entry Date(s): Date Created: 20260224 Date Completed: 20260224 Latest Revision: 20260226
Update Code: 20260226
PubMed Central ID: PMC12930141
DOI: 10.1111/cogs.70185
PMID: 41732037
Database: MEDLINE
Description
Abstract:A central question in cognition is how representations are integrated across different modalities, such as language and vision. One prominent hypothesis posits the existence of an abstract, prelinguistic "language of vision" as a representational system that organizes meaning compositionally, enabling cross-modal integration. This hypothesis predicts that the language of vision operates universally, independent of linguistic surface features such as word order. We conducted eye-tracking experiments where participants described visual scenes in English, Portuguese, and Japanese. By analyzing spoken descriptions alongside eye-movement sequences divided into planning and articulation phases, we demonstrate that semantic similarity between sentences strongly predicts the similarity of associated scan patterns in all three languages, even across scenes and between sentences in different languages. In contrast, the effect of syntactic constraints was secondary and transient: it was restricted to within-language and within-scene comparisons, and temporally confined to the early planning phase of the utterance. Our findings support an interactive account of cross-modal coordination in which a universal language of vision provides stable semantic scaffolding, while syntax serves as a local constraint, primarily active during message linearization.<br /> (© 2026 The Author(s). Cognitive Science published by Wiley Periodicals LLC on behalf of Cognitive Science Society (CSS).)
ISSN:1551-6709
DOI:10.1111/cogs.70185