Efficient Encoding/Decoding of GC-Balanced Codes Correcting Tandem Duplications

Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. All code constructions are based on irreducible...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information theory Vol. 66; no. 8; pp. 4892 - 4903
Main Authors: Chee, Yeow Meng, Chrisnata, Johan, Kiah, Han Mao, Nguyen, Tuan Thanh
Format: Journal Article
Language:English
Published: New York IEEE 01.08.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:0018-9448, 1557-9654
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. All code constructions are based on irreducible words . Such code constructions are almost optimal to combat tandem duplications of length at most <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> where <inline-formula> <tex-math notation="LaTeX">k\leq 3 </tex-math></inline-formula>. However, the problem of designing efficient encoder/decoder for such codes has not been investigated. In addition, the method cannot be extended to deal with the case of arbitrary <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">k\geq 4 </tex-math></inline-formula>. In this work, we study efficient encoding/decoding methods for irreducible words over general <inline-formula> <tex-math notation="LaTeX">q </tex-math></inline-formula>-ary alphabet. Our methods provide the first known efficient encoder/decoder for <inline-formula> <tex-math notation="LaTeX">q </tex-math></inline-formula>-ary codes correcting tandem duplications of length at most <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">k\leq 3 </tex-math></inline-formula>. In particular, we describe an <inline-formula> <tex-math notation="LaTeX">(\ell,m) </tex-math></inline-formula>-finite state encoder and show that when <inline-formula> <tex-math notation="LaTeX">m=\Theta (1/\epsilon) </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">\ell =\Theta (1/\epsilon) </tex-math></inline-formula>, the encoder achieves rate that is <inline-formula> <tex-math notation="LaTeX">\epsilon </tex-math></inline-formula> away from the optimal rate. We also provide ranking/unranking algorithms for irreducible words and modify the algorithms to reduce the space requirements for the finite state encoder. Over the DNA alphabet (or quaternary alphabet), we also impose weight constraint on the codewords. In particular, a quaternary word is <inline-formula> <tex-math notation="LaTeX">{\tt GC} </tex-math></inline-formula>-balanced if exactly half of the symbols of are either <inline-formula> <tex-math notation="LaTeX">{\tt C} </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">{\tt G} </tex-math></inline-formula>. Via a modification of Knuth's balancing technique, we provide an efficient method that translates quaternary messages into <inline-formula> <tex-math notation="LaTeX">{\tt GC} </tex-math></inline-formula>-balanced codewords and the resulting codebook is able to correct tandem duplications of length at most <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">k\leq 3 </tex-math></inline-formula>. In addition, we provide the first known construction of codes to combat tandem duplications of length at most <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">k \geq 4 </tex-math></inline-formula>. Such codes can correct duplication errors in linear-time and they are almost optimal in terms of rate.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2020.2981069