Motif Statistics
Abstract
Pierre Nicodème5, Bruno Salvy6 and Philippe Flajolet6
| (5) |
DKFZ Theoretische Bioinformatik, Germany |
| (6) |
Algorithms Project. Inria Rocquencourt, France |
Abstract
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text.
This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently.
We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces:
multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields
exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow
predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences
of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite
motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently
than theoretically predicted.
This work has been partially supported by the Long Term Research Project Alcom-IT (#20244) of the European Union.
An extended version of this abstract is available as INRIA Research Report 3606.
References secured to subscribers.