The problem of identifying sequence domains is essential for understanding protein function. Most current methods for protein
domain identification rely on prior knowledge of homologous domains and construction of high quality multiple sequence alignments.
With rapid accumulation of enormous data from genome sequencing, it is important to be able to automatically determine domain
regions from a set of proteins solely based on sequence information.
We describe a new algorithm for automatic protein domain detection that does not require multiple sequence alignment and differs
from alignment based methods by allowing arbitrary rearrangements (both in relative ordering and distance) of the domains
within the set of proteins under study. Moreover, our algorithm extracts domains by simply performing a comparative analysis
of a given set of sequences, and no auxiliary information is required. The method views protein sequences as collections of
overlapping fixed length blocks. A pair of blocks within a sequence gets a “vote of confidence” to be part of a domain if
several other sequences have similar pairs of blocks at roughly the same distance from each other. Candidate domains are then
identified by discovering regions in each protein sequence where most block pairs get strong votes of confidence. We applied
our method on several test data sets with a fixed choice of parameters. To evaluate the results we computed sensitivity and
specificity measures using SMART-derived domain annotations as a reference.