Evaluation Methods for Focused Crawling
Andrea Passerini2
, Paolo Frasconi2
and Giovanni Soda2 
| (2) |
DSI, University of Florence, ITALY |
Abstract
The exponential growth of documents available in the World Wide Webmak es it increasingly difficult to discover relevant information
on a specific topic. In this context, growing interest is emerging in focused crawling, a technique that dynamically browses the Internet by choosing directions that maximize the probability of discovering relevant
pages, given a specific topic. Predicting the relevance of a document before seeing its contents (i.e., relying on the parent
pages only) is one of the central problem in focused crawling because it can save significant bandwidth resources. In this
paper, we study three different evaluation functions for predicting the relevance of a hyperlink with respect to the target
topic. We show that classification based on the anchor text is more accurate than classification based on the whole page.
Moreover, we introduce a method that combines both the anchor and the whole parent document, using a Bayesian representation
of the Webg raph structure. The latter method obtains further accuracy improvements.
References secured to subscribers.