Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences

Exact string matching algorithms involve finding all occurrences of a pattern P in a text T. These algorithms have been extensively studied in computer science, primarily because of their applications in various fields such as text search and computational biology. The main goal of exact string matc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers in biology and medicine Jg. 131; S. 104292
Hauptverfasser: Karcioglu, Abdullah Ammar, Bulut, Hasan
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States Elsevier Ltd 01.04.2021
Elsevier Limited
Schlagworte:
ISSN:0010-4825, 1879-0534, 1879-0534
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Exact string matching algorithms involve finding all occurrences of a pattern P in a text T. These algorithms have been extensively studied in computer science, primarily because of their applications in various fields such as text search and computational biology. The main goal of exact string matching algorithms is to find all pattern matches correctly within the shortest possible time frame. Although hash-based string matching algorithms run fast, there are shortcomings, such as hash collisions. In this study, a novel hash function has been proposed that eliminates hash collisions for DNA sequences. It provides us perfect hashing and produces hash values in a time-efficient manner. We have proposed two exact string matching algorithms based on the proposed hash function. In the first approach, we replace the traditional Hash-q algorithm's hash function with the proposed one. In the second approach, we improved the first approach by utilizing the shift size indicated at the (m−1)th entry in the good suffix shift table when an exact matching is found. In these approaches, we eliminate the need to compare the last q characters of the pattern and text. We have included six algorithms from the literature in our evaluations. E. Coli and Human Chromosome1 datasets from the literature and a synthetic dataset produced randomly are utilized for comparisons. The results show that the proposed approaches achieve better performance metrics in terms of the average runtime, the average number of character comparisons, and the average number of hash comparisons. [Display omitted] •A novel collision free hash function is proposed for DNA sequences.•Based on the proposed hash function, we propose Hash-q Algorithm with Unique FNG algorithm as a first improvement to the traditional Hash-q algorithm.•Based on the proposed hash function, we propose Hash-q Boyer-Moore Algorithm with UniqueFNG algorithm as a second improvement to the traditional Hash-q algorithm.•The approaches are compared for E. Coli, synthetic dataset and Human Chromosome1 datasets.•Significant improvements have been achieved for the avg. runtime, the avg. # of character and the avg. # of hash comparisons.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0010-4825
1879-0534
1879-0534
DOI:10.1016/j.compbiomed.2021.104292