Decoding Complexity: CHPDA – Intelligent Pattern Exploration with a Context-Aware Hybrid Pattern Detection Algorithm
Saved in:
| Title: | Decoding Complexity: CHPDA – Intelligent Pattern Exploration with a Context-Aware Hybrid Pattern Detection Algorithm |
|---|---|
| Authors: | Lokesh Koli, Shubham Kalra, Karanpreet Singh |
| Publisher Information: | Qeios Ltd, 2025. |
| Publication Year: | 2025 |
| Description: | Detecting sensitive data such as Personally Identifiable Information (PII) and Protected Health Information (PHI) is critical for data security platforms.[1] This study evaluates regex-based pattern matching algorithms and exact-match search techniques to optimize detection speed, accuracy, and scalability. Our benchmarking results indicate that Google RE2 provides the best balance of speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among regex engines, outperforming PCRE while maintaining broader hardware compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated superior performance (8 ms/MB) and scalability for large datasets. Performance analysis revealed that regex processing time scales linearly with dataset size and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 score (91. 6%) by improving recall and minimizing false positives. Device benchmarking confirmed that our solution maintains efficient CPU and memory usage on both high-performance and mid-range systems. Despite its effectiveness, challenges remain, such as limited multilingual support and the need for regular pattern updates.[2] Future work should focus on expanding language coverage, integrating data security and privacy management (DSPM) with data loss prevention (DLP) tools, and enhancing regulatory compliance for broader global adoption. |
| Document Type: | Article |
| DOI: | 10.32388/mt33ka |
| Rights: | CC BY |
| Accession Number: | edsair.doi...........4cd7a6e66a59f5b7a9ba61c02459a1f8 |
| Database: | OpenAIRE |
| Abstract: | Detecting sensitive data such as Personally Identifiable Information (PII) and Protected Health Information (PHI) is critical for data security platforms.[1] This study evaluates regex-based pattern matching algorithms and exact-match search techniques to optimize detection speed, accuracy, and scalability. Our benchmarking results indicate that Google RE2 provides the best balance of speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among regex engines, outperforming PCRE while maintaining broader hardware compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated superior performance (8 ms/MB) and scalability for large datasets. Performance analysis revealed that regex processing time scales linearly with dataset size and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 score (91. 6%) by improving recall and minimizing false positives. Device benchmarking confirmed that our solution maintains efficient CPU and memory usage on both high-performance and mid-range systems. Despite its effectiveness, challenges remain, such as limited multilingual support and the need for regular pattern updates.[2] Future work should focus on expanding language coverage, integrating data security and privacy management (DSPM) with data loss prevention (DLP) tools, and enhancing regulatory compliance for broader global adoption. |
|---|---|
| DOI: | 10.32388/mt33ka |
Nájsť tento článok vo Web of Science