We present a novel model of automated composite text digest, the Pyramidal Digest. The model integrates traditional text summarization
and text classification in that the digest not only serves as a “summary” but is also able to classify text segments of any
given size, and answer queries relative to a context.
“Pyramidal” refers to the fact that the digest is created in at least three dimensions: scope, granularity, and scale. The
Pyramidal Digest is defined recursively as a structure of extracted and abstracted features that are obtained gradually –
from specific to general, and from large to small text segment size – through a combination of shallow parsing and machine
learning algorithms. There are three noticeable threads of learning taking place: learning of characteristic relations, rhetorical
relations, and lexical relations.
Our model provides a principle for efficiently digesting large quantities of text: progressive learning can digest text by
abstracting its significant features. This approach scales, with complexity bounded by O(n log n), where n is the size of the text. It offers a standard and systematic way of collecting as many semantic features as possible that
are reachable by shallow parsing. It enables readers to query beyond keyword matches.