Volume 5, Number 2, 99-105, DOI: 10.1007/s00799-003-0050-z

Searchable words on the Web

Hugh E. Williams and Justin Zobel

From the issue entitled "Special section on The Semantic Web and Science Data Interoperation"

View Related Documents

Abstract

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

Keywords  Web search - Terms - Word occurrences - Indexing

Fulltext Preview

Image of the first page of the fulltext document