Code Redteaming: Probing Ethical Sensitivity of LLMs Through Natural Language Embedded in Code

Saved in:
Bibliographic Details
Title: Code Redteaming: Probing Ethical Sensitivity of LLMs Through Natural Language Embedded in Code
Authors: Chanjun Park, Jeongho Yoon, Heuiseok Lim
Source: Mathematics, Vol 14, Iss 1, p 189 (2026)
Publisher Information: MDPI AG
Publication Year: 2026
Collection: Directory of Open Access Journals: DOAJ Articles
Subject Terms: adversarial evaluation, content safety, ethical language detection, code analysis, large language models, Mathematics, QA1-939
Description: Large language models are increasingly used in code generation and developer tools, yet their robustness to ethically problematic natural language embedded in source code is underexplored. In this work, we study content-safety vulnerabilities arising from ethically inappropriate language placed in non-functional code regions (e.g., comments or identifiers), rather than traditional functional security vulnerabilities such as exploitable program logic. In real-world and educational settings, programmers may include inappropriate expressions in identifiers, comments, or print statements that are operationally inert but ethically concerning. We present Code Redteaming , an adversarial evaluation framework that probes models’ sensitivity to such linguistic content. Our benchmark spans Python and C and applies sentence-level and token-level perturbations across natural-language-bearing surfaces, evaluating 18 models from 1B to 70B parameters. Experiments reveal inconsistent scaling trends and substantial variance across injection types and surfaces, highlighting blind spots in current safety filters. These findings motivate input-sensitive safety evaluations and stronger defenses for code-focused LLM applications.
Document Type: article in journal/newspaper
Language: English
Relation: https://www.mdpi.com/2227-7390/14/1/189; https://doaj.org/toc/2227-7390; https://doaj.org/article/3f5930e14782473da84170f104d29e88
DOI: 10.3390/math14010189
Availability: https://doi.org/10.3390/math14010189
https://doaj.org/article/3f5930e14782473da84170f104d29e88
Accession Number: edsbas.CB80868C
Database: BASE
Description
Abstract:Large language models are increasingly used in code generation and developer tools, yet their robustness to ethically problematic natural language embedded in source code is underexplored. In this work, we study content-safety vulnerabilities arising from ethically inappropriate language placed in non-functional code regions (e.g., comments or identifiers), rather than traditional functional security vulnerabilities such as exploitable program logic. In real-world and educational settings, programmers may include inappropriate expressions in identifiers, comments, or print statements that are operationally inert but ethically concerning. We present Code Redteaming , an adversarial evaluation framework that probes models’ sensitivity to such linguistic content. Our benchmark spans Python and C and applies sentence-level and token-level perturbations across natural-language-bearing surfaces, evaluating 18 models from 1B to 70B parameters. Experiments reveal inconsistent scaling trends and substantial variance across injection types and surfaces, highlighting blind spots in current safety filters. These findings motivate input-sensitive safety evaluations and stronger defenses for code-focused LLM applications.
DOI:10.3390/math14010189