Lecture Notes in Computer Science, 2004, Volume 3240/2004, 206-217, DOI: 10.1007/978-3-540-30219-3_18

ATDD: An Algorithmic Tool for Domain Discovery in Protein Sequences

Stanislav Angelov, Sanjeev Khanna, Li Li and Fernando Pereira

View Related Documents

Abstract

The problem of identifying sequence domains is essential for understanding protein function. Most current methods for protein domain identification rely on prior knowledge of homologous domains and construction of high quality multiple sequence alignments. With rapid accumulation of enormous data from genome sequencing, it is important to be able to automatically determine domain regions from a set of proteins solely based on sequence information.
We describe a new algorithm for automatic protein domain detection that does not require multiple sequence alignment and differs from alignment based methods by allowing arbitrary rearrangements (both in relative ordering and distance) of the domains within the set of proteins under study. Moreover, our algorithm extracts domains by simply performing a comparative analysis of a given set of sequences, and no auxiliary information is required. The method views protein sequences as collections of overlapping fixed length blocks. A pair of blocks within a sequence gets a “vote of confidence” to be part of a domain if several other sequences have similar pairs of blocks at roughly the same distance from each other. Candidate domains are then identified by discovering regions in each protein sequence where most block pairs get strong votes of confidence. We applied our method on several test data sets with a fixed choice of parameters. To evaluate the results we computed sensitivity and specificity measures using SMART-derived domain annotations as a reference.

Fulltext Preview

Image of the first page of the fulltext document