Podrobná bibliografie
| Název: |
DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code |
| Autoři: |
Agrawal, Shriyansh, Lau, Aidan, Shah, Sanyam, R, Ahan M, Zhu, Kevin, Dev, Sunishchal, Sharma, Vasu |
| Rok vydání: |
2025 |
| Sbírka: |
ArXiv.org (Cornell University Library) |
| Témata: |
Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning |
| Popis: |
The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included. ; Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025): 4th Workshop on Deep Learning for Code |
| Druh dokumentu: |
text |
| Jazyk: |
unknown |
| Relation: |
http://arxiv.org/abs/2510.18904; Neural Information Processing Systems (NeurIPS 2025) |
| Dostupnost: |
http://arxiv.org/abs/2510.18904 |
| Přístupové číslo: |
edsbas.A111B6E2 |
| Databáze: |
BASE |