Generation of a Multi-Class IoT Malware Dataset for Cybersecurity.

Uloženo v:
Podrobná bibliografie
Název: Generation of a Multi-Class IoT Malware Dataset for Cybersecurity.
Autoři: Maghanaki, Mazdak, Keramati, Soraya, Chen, F. Frank, Shahin, Mohammad
Zdroj: Electronics (2079-9292); Nov2025, Vol. 14 Issue 21, p4196, 38p
Témata: MALWARE, INTERNET of things, FEATURE selection, MACHINE learning, ACQUISITION of data, INTERNET security, EVALUATION methodology, CLASSIFICATION
Abstrakt: This study introduces a modular, behaviorally curated malware dataset suite consisting of eight independent sets, each specifically designed to represent a single malware class: Trojan, Mirai (botnet), ransomware, rootkit, worm, spyware, keylogger, and virus. In contrast to earlier approaches that aggregate all malware into large, monolithic collections, this work emphasizes the selection of features unique to each malware type. Feature selection was guided by established domain knowledge and detailed behavioral telemetry obtained through sandbox execution and a subsequent report analysis on the AnyRun platform. The datasets were compiled from two primary sources: (i) the AnyRun platform, which hosts more than two million samples and provides controlled, instrumented sandbox execution for malware, and (ii) publicly available GitHub repositories. To ensure data integrity and prevent cross-contamination of behavioral logs, each sample was executed in complete isolation, allowing for the precise capture of both static attributes and dynamic runtime behavior. Feature construction was informed by operational signatures characteristic of each malware category, ensuring that the datasets accurately represent the tactics, techniques, and procedures distinguishing one class from another. This targeted design enabled the identification of subtle but significant behavioral markers that are frequently overlooked in aggregated datasets. Each dataset was balanced to include benign, suspicious, and malicious samples, thereby supporting the training and evaluation of machine learning models while minimizing bias from disproportionate class representation. Across the full suite, 10,000 samples and 171 carefully curated features were included. This constitutes one of the first dataset collections intentionally developed to capture the behavioral diversity of multiple malware categories within the context of Internet of Things (IoT) security, representing a deliberate effort to bridge the gap between generalized malware corpora and class-specific behavioral modeling. [ABSTRACT FROM AUTHOR]
Copyright of Electronics (2079-9292) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Popis
Abstrakt:This study introduces a modular, behaviorally curated malware dataset suite consisting of eight independent sets, each specifically designed to represent a single malware class: Trojan, Mirai (botnet), ransomware, rootkit, worm, spyware, keylogger, and virus. In contrast to earlier approaches that aggregate all malware into large, monolithic collections, this work emphasizes the selection of features unique to each malware type. Feature selection was guided by established domain knowledge and detailed behavioral telemetry obtained through sandbox execution and a subsequent report analysis on the AnyRun platform. The datasets were compiled from two primary sources: (i) the AnyRun platform, which hosts more than two million samples and provides controlled, instrumented sandbox execution for malware, and (ii) publicly available GitHub repositories. To ensure data integrity and prevent cross-contamination of behavioral logs, each sample was executed in complete isolation, allowing for the precise capture of both static attributes and dynamic runtime behavior. Feature construction was informed by operational signatures characteristic of each malware category, ensuring that the datasets accurately represent the tactics, techniques, and procedures distinguishing one class from another. This targeted design enabled the identification of subtle but significant behavioral markers that are frequently overlooked in aggregated datasets. Each dataset was balanced to include benign, suspicious, and malicious samples, thereby supporting the training and evaluation of machine learning models while minimizing bias from disproportionate class representation. Across the full suite, 10,000 samples and 171 carefully curated features were included. This constitutes one of the first dataset collections intentionally developed to capture the behavioral diversity of multiple malware categories within the context of Internet of Things (IoT) security, representing a deliberate effort to bridge the gap between generalized malware corpora and class-specific behavioral modeling. [ABSTRACT FROM AUTHOR]
ISSN:20799292
DOI:10.3390/electronics14214196