Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence
data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant
computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly
resulting from these counts poses practical problems of visualization and inference.
Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences.
The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit
drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible
both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of
the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery
of regulatory elements on the upstream regions of a set of genes of the yeast.
The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/
Keywords Verbumeulus - unusual words - subword statistics - pattern discovery - regulatory elements - suffix trees
Supported in part by the NSF of U.S.A. (Grant No. CCR-9700276), Purdue Research Foundation (Grant No. 690-1398-3145), the
Italian Ministry of University and Research, and the Research Program of the University of Padova.
Supported by Purdue Research Foundation (Grant No.690-1398-3145), the Italian Ministry of University and Research, the Research
Program of the University of Padova and Bourns College of Engineering, University of California, Riverside.
Alberto Apostolico (Dr. Eng., 1973, Univ. Naples) is a professor of computer engineering at Univ. Padova and professor of computer sciences
at Purdue University. He is a fulbright scholar in 1974–75 at CMU, held visiting and permanent positions in the U.S. (UIUC,
Rensselaer, Purdue, IBM) and Europe (U. of Salerno, U. of L' Aquila, LASI, U. of Paris, U. of London, King's Zif-Bielefeld,
Renyi-Hungarian Acad of Science), and a full prof. in Italy since 1987, at DEI since 1992. His research interests are algorithmic
analysis and design, with emphasis on pattern matching, on which subject he has authored more than 100 papers, and co-authored/edited
7 volumes. He serves on the Editorial Boards of Theor. Comp. Sci. Par. Proc. Let., J. of Comp. Biol., Chaos Th. and Appl.,
Springer Lecture Notes in Bioinformatics, Algorithmica (g.e.), Keynote at over 60, PC Member for over 50 international conferences
He has been a reviewer for NSF, Canadian SERC, NATO, HSFP, Finland Acad Sci, Hong Kong and Israel Science Councils. He is
a current or past member of ACM AICA, EATCS, IEEE. He has been (co-) recipient of U.S. (NSF, AFOSR, NIH) French, British,
Italian (CNR, MURST, MIUR) and international (Fulbright, NATO, ESPRIT) grants, and of au IBM Faculty Award in 2002.
Fang-Cheng Gong (Ph.D 1995, Dept. Plant Sciences, Univ. Arizona) has held positions of graduate research associate at Dept. Plant Sciences,
Univ. Arizona, Postdoc Fellow at the Dept. Plant Pathology of Univ. Arizona, Postdoc Fellow at, Dept. Biological Sciences,
Purdue University, before joining the research staff of Celera Genomics. His research interests include plant molecular biology,
microbial molecular genetics, and plant cell biology.
Stefano Lonardi (Ph.D., 2001, Purdue University) is an assistant professor of computer science & engineering at the Univ. California, Riverside.
His research is currently focused on bioinformatics, data compression, information hiding, and data mining. He received his
“Laurea” degree from Univ. Pisa in 1994. and his Ph.D. degree in computer science, from Purdue University. He also holds a
Research Doctorate from the Univ. Padua (1999). He is a member of ACM, IEEE, Upsilon Pi Upsilon and Phi Kappa Phi honor societies,
and the International Society for Computational Biology.