Lecture Notes in Computer Science, 2006, Volume 4216/2006, 42-51, DOI: 10.1007/11875741_5

An Efficient Algorithm for Finding Long Conserved Regions Between Genes

Tak-Man Ma, Yuh-Dauh Lyuu and Yen-Wu Ti

View Related Documents

Abstract

We study the problem of approximate non-tandem repeat (conserved regions) extraction among strings (genes). Basically, given a string S and thresholds L and D over a finite alphabet, extracting approximate repeats is to find pairs (β, β′) of substrings of S under some constraints such that β and β′ have edit-distance at most D and their respective lengths are at least L. Previous works mainly focus on the case that D is small, so they are not appropriate for extracting approximate repeats with relatively large D. In contrast, this paper focuses on extracting long approximate repeats with large D and it is more efficient than previous works. We also show that our algorithm is optimal in time when D is a constant.
In this paper, given an input string S and thresholds L and D, we would like to extract all (D, L)-supermaximal approximate repeats (β, β′) of S. One useful application of extracting all (D, L)-supermaximal approximate repeats (β, β′) is to find all longest possible substrings β of S such that there exist some other substring β′ of S where β and β′ have edit-distance at most D and their respective lengths are at least L. This algorithm can be easily applied to the case where there are multiple input strings S 1,S 2,...,S n if we first concatenate the input strings into one long subject string S with a special symbol $``\sharp"$``\sharp" for separation: S1\sharp S2\sharp¼\sharp SnS_1\sharp S_2\sharp\ldots\sharp S_n. The running time complexity of our algorithm is O(DN 2) where N=|S 1|+|S 2|+⋯+|S n |.

Fulltext Preview

Image of the first page of the fulltext document