Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Motif Statistics
Abstract

Pierre Nicodème5, Bruno Salvy6 and Philippe Flajolet6

(5)  DKFZ Theoretische Bioinformatik, Germany
(6)  Algorithms Project. Inria Rocquencourt, France
Abstract
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.
This work has been partially supported by the Long Term Research Project Alcom-IT (#20244) of the European Union.
An extended version of this abstract is available as INRIA Research Report 3606.

Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Referenced by
2 newer articles

  1. Louxin Zhang (2007) . IEEE/ACM Transactions on Computational Biology and Bioinformatics 4(3)
    [CrossRef]
  2. Flajolet, Philippe (2001) D⋅E⋅K=(1000)8. Random Structures and Algorithms 19(3-4)
    [CrossRef]
Remote Address: 38.107.191.106 • Server: mpweb19
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)