Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning.

Saved in:
Bibliographic Details
Title: Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning.
Authors: Joseph, D. Paul, Perumal, Viswanathan
Source: PeerJ Computer Science; Mar2025, p1-27, 27p
Subject Terms: FORENSIC sciences, CLASSIFICATION, CONCEPTUAL models, INFORMATION storage & retrieval systems, TEXT mining, INFORMATION retrieval, MACHINE learning
Abstract: In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity. [ABSTRACT FROM AUTHOR]
Copyright of PeerJ Computer Science is the property of PeerJ Inc. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The β<subscript>j</subscript> parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and β<subscript>j</subscript> into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter β<subscript>k</subscript> that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θ<subscript>d</subscript>) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating β<subscript>k</subscript> into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity. [ABSTRACT FROM AUTHOR]
ISSN:23765992
DOI:10.7717/peerj-cs.2608