Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Evaluation Methods for Focused Crawling

Andrea PasseriniContact Information, Paolo FrasconiContact Information and Giovanni SodaContact Information

(2)  DSI, University of Florence, ITALY
Abstract
The exponential growth of documents available in the World Wide Webmak es it increasingly difficult to discover relevant information on a specific topic. In this context, growing interest is emerging in focused crawling, a technique that dynamically browses the Internet by choosing directions that maximize the probability of discovering relevant pages, given a specific topic. Predicting the relevance of a document before seeing its contents (i.e., relying on the parent pages only) is one of the central problem in focused crawling because it can save significant bandwidth resources. In this paper, we study three different evaluation functions for predicting the relevance of a hyperlink with respect to the target topic. We show that classification based on the anchor text is more accurate than classification based on the whole page. Moreover, we introduce a method that combines both the anchor and the whole parent document, using a Bayesian representation of the Webg raph structure. The latter method obtains further accuracy improvements.

Contact Information Andrea Passerini
Email: passerini@dsi.ing.unifi.it

Contact Information Paolo Frasconi
Email: paolo@dsi.ing.unifi.it

Contact Information Giovanni Soda
Email: giovanni@dsi.ing.unifi.it
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.107 • Server: mpweb18
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)