Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized
information extraction procedures that translate a source's native format into a structured representation suitable for further
application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which
wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine
whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting
regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented,
domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's
expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction
of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE
performs substantially better than standard regression testing.
This revised version was published online in August 2006 with corrections to the Cover Date.