A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models.

Saved in:
Bibliographic Details
Title: A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models.
Authors: Piętka, Kinga, Bereta, Michał
Source: Applied Sciences (2076-3417); Feb2026, Vol. 16 Issue 4, p1918, 43p
Subject Terms: LANGUAGE models, COGNITIVE ability, BENCHMARK problems (Computer science), SEMANTICS (Philosophy), INFORMATION technology, EXECUTIVE function, LOGIC, COMPUTATIONAL linguistics
Abstract: With the growth of large language models (LLMs), there are increasing calls to interpret their behavior through the prism of analogies to human cognitive mechanisms. At the same time, scientific literature points to the fundamental limitations of these systems, describing them, among other things, as models that generate a superficial simulation of reasoning without real access to semantic meanings ("stochastic parrots" or "illusion of reasoning"). This paper proposes an innovative, modular benchmark for assessing the cognitive competence of LLMs, integrating three complementary dimensions of language processing: factual, syntactic, and logical. Eight language models (LLama 3.2, Mistral 7B, LLama 3:8B, Gemini 2.5 Flash, ChatGPT-3, ChatGPT-4o mini, ChatGPT-4, and ChatGPT-5) were tested using a uniform procedure with context reset after each interaction and a three-point scoring scheme (0/0.5/1). The results obtained showed a clear advantage for the largest models in tasks based on general knowledge and formal transformations known from training, with a significant decrease in effectiveness, regardless of model size, in tasks requiring conjunctive reasoning based solely on new, local premises. Importantly, unstable but measurable corrective abilities of some models were also observed after feedback, suggesting the presence of reactive mechanisms, but were insufficient to consider them systems capable of cognitive self-reflection. The combined analysis indicates that LLMs effectively simulate syntax and logic rules when the task corresponds to recognizable formal patterns, but fail in situations requiring the construction of new, coherent chains of beliefs and symbolic inferences, which undermines the thesis of their cognitive "understanding". The results justify the need to create more complex and semantically restrictive evaluation frameworks that will allow distinguishing statistical fit from systemic, multi-stage formal reasoning. The proposed benchmark is a step towards a more multidimensional and diagnostic evaluation of LLMs, shifting the focus from "will the model respond correctly?" to "why and under what conditions is the model able to reason?" [ABSTRACT FROM AUTHOR]
Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:With the growth of large language models (LLMs), there are increasing calls to interpret their behavior through the prism of analogies to human cognitive mechanisms. At the same time, scientific literature points to the fundamental limitations of these systems, describing them, among other things, as models that generate a superficial simulation of reasoning without real access to semantic meanings ("stochastic parrots" or "illusion of reasoning"). This paper proposes an innovative, modular benchmark for assessing the cognitive competence of LLMs, integrating three complementary dimensions of language processing: factual, syntactic, and logical. Eight language models (LLama 3.2, Mistral 7B, LLama 3:8B, Gemini 2.5 Flash, ChatGPT-3, ChatGPT-4o mini, ChatGPT-4, and ChatGPT-5) were tested using a uniform procedure with context reset after each interaction and a three-point scoring scheme (0/0.5/1). The results obtained showed a clear advantage for the largest models in tasks based on general knowledge and formal transformations known from training, with a significant decrease in effectiveness, regardless of model size, in tasks requiring conjunctive reasoning based solely on new, local premises. Importantly, unstable but measurable corrective abilities of some models were also observed after feedback, suggesting the presence of reactive mechanisms, but were insufficient to consider them systems capable of cognitive self-reflection. The combined analysis indicates that LLMs effectively simulate syntax and logic rules when the task corresponds to recognizable formal patterns, but fail in situations requiring the construction of new, coherent chains of beliefs and symbolic inferences, which undermines the thesis of their cognitive "understanding". The results justify the need to create more complex and semantically restrictive evaluation frameworks that will allow distinguishing statistical fit from systemic, multi-stage formal reasoning. The proposed benchmark is a step towards a more multidimensional and diagnostic evaluation of LLMs, shifting the focus from "will the model respond correctly?" to "why and under what conditions is the model able to reason?" [ABSTRACT FROM AUTHOR]
ISSN:20763417
DOI:10.3390/app16041918