Space-efficient computation of parallel approximate string matching

Approximate string matching (ASM) has a number of applications in many disciplines, ranging from information retrieval to gene matching. Conventional solution to this problem is based on the dynamic programming-based strategy having quadratic space and time complexity. The complexity of the conventi...

Full description

Saved in:
Bibliographic Details
Published in:The Journal of supercomputing Vol. 79; no. 8; pp. 9093 - 9126
Main Authors: Sadiq, Muhammad Umair, Yousaf, Muhammad Murtaza
Format: Journal Article
Language:English
Published: New York Springer US 01.05.2023
Springer Nature B.V
Subjects:
ISSN:0920-8542, 1573-0484
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Approximate string matching (ASM) has a number of applications in many disciplines, ranging from information retrieval to gene matching. Conventional solution to this problem is based on the dynamic programming-based strategy having quadratic space and time complexity. The complexity of the conventional solution makes it impractical to search queries from the huge sequences having billions of characters. Therefore, many studies have been proposed that improves on the space and time requirement of the basic solution which includes heuristic, filtration, and index-based solutions. These existing solutions obtain the better performance by compromising on the completeness of the search. In this paper, we proposed the linear space algorithm for the approximate string matching problem while retaining the time complexity of conventional solution. The proposed method works in linear space without omitting any regions in the given text; hence, it finds all the possible matches. Conventional dynamic programming solution is modified in such a way that storage of complete trace back table is avoided by keeping only running count of each edit operation in the memory. A variety of laws and facts are discovered in classical dynamic programming table in that regard. We also presented the parallel approach to the proffered algorithm to improve the running time of the algorithm. The algorithm is evaluated on the CUDA-enabled GPUs. DNA sequences of sizes between 250 and 970 MBP are used for evaluation. Moreover, experiments are also performed by using natural language text to highlight the broader applicability of the proposed algorithm. Results show the substantial superiority of the algorithm in terms of performance and scalability compared to the state-of-the-art algorithms.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0920-8542
1573-0484
DOI:10.1007/s11227-022-05038-6