Zobrazit v EDS

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Uloženo v:

Podrobná bibliografie
Název:	SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Autoři:	Imani, Shima, Moon, Seungwhan, Ahmadyan, Adel, Zhang, Lu, Ahmed, Kirmani, Damavandi, Babak
Rok vydání:	2025
Sbírka:	ArXiv.org (Cornell University Library)
Témata:	Artificial Intelligence
Popis:	We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems
Druh dokumentu:	text
Jazyk:	unknown
Relation:	http://arxiv.org/abs/2512.05954
Dostupnost:	http://arxiv.org/abs/2512.05954
Přístupové číslo:	edsbas.84F3A2E3
Databáze:	BASE

View record from BASE

Nájsť tento článok vo Web of Science

Popis
Abstrakt:	We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems