Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Topic Distillation and Spectral Filtering

Soumen ChakrabartiContact Information, Byron E. DomContact Information, David GibsonContact Information, Ravi KumarContact Information, Prabhakar RaghavanContact Information, Sridhar RajagopalanContact Information and Andrew Tomkins1

(1) IBM Research Division, Almaden Research Center, 650 Harry Rd., San Jose, CA 95120-6099, USA

Abstract  This paper discuss topic distillation, an information retrieval problemthat is emerging as a critical task for the www. Algorithms for this problemmust distill a small number of high-quality documents addressing a broadtopic from a large set of candidates.We give a review of the literature, and compare the problem with relatedtasks such as classification, clustering, and indexing. We then describe ageneral approach to topic distillation with applications to searching andpartitioning, based on the algebraic properties of matrices derived fromparticular documents within the corpus. Our method – which we call special filtering – combines the use of terms, hyperlinks and anchor-textto improve retrieval performance. We give results for broad-topic querieson the www, and also give some anecdotal results applying the sametechniques to US Supreme Court law cases, US patents, and a set of WallStreet Journal newspaper articles.

hypertext - information filtering - information retrieval - resource discovery - spectral methods - world wide web - www


Contact InformationSoumen Chakrabarti
Email: soumen@almaden.ibm.com

Contact InformationByron E. Dom
Email: dom@almaden.ibm.com

Contact InformationDavid Gibson
Email: gibson@almaden.ibm.com

Contact InformationRavi Kumar
Email: ravi@almaden.ibm.com

Contact InformationPrabhakar Raghavan
Email: pragh@almaden.ibm.com

Contact InformationSridhar Rajagopalan
Email: sridhar@almaden.ibm.com
Fulltext Preview (Small, Large)
Image of the first page of the fulltext


Export this article
Export this article as RIS | Text
 
Remote Address: 38.107.191.112 • Server: mpweb18
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)