Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
Efficiently Detecting Webpage Updates Using Samples
| Book Series | Lecture Notes in Computer Science |
| Publisher | Springer Berlin / Heidelberg |
| ISSN | 0302-9743 (Print) 1611-3349 (Online) |
| Volume | Volume 4607/2007 |
| Book | Web Engineering |
| DOI | 10.1007/978-3-540-73597-7 |
| Copyright | 2007 |
| ISBN | 978-3-540-73596-0 |
| DOI | 10.1007/978-3-540-73597-7_23 |
| Pages | 285-300 |
| Subject Collection | Computer Science |
| SpringerLink Date | Monday, August 13, 2007 |
| |
|
Efficiently Detecting Webpage Updates Using Samples
Qingzhao Tan1 , Ziming Zhuang2 , Prasenjit Mitra1, 2 and C. Lee Giles1, 2 
| (1) |
Computer Science and Engineering, |
| (2) |
Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802, USA |
Abstract
Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the local repository
completely synchronized with the Web. To address this problem, sampling-based techniques periodically poll a subset of webpages
in the local repository to detect changes on the Web, and update the local copies accordingly. The goal of such an approach
is to discover as many changed webpages as possible within the boundary of the available resources. In this paper we advance
the state-of-art of the sampling-based techniques by answering a challenging question: Given a sampled webpage that has been updated, which other webpages are also likely to have changed? We propose a set of sampling policies with various downloading granularities, taking into account the link structure, the
directory structure, and the content-based features. We also investigate the update history and the popularity of the webpages
to adaptively model the download probability. We ran extensive experiments on a real web data set of about 300,000 distinct
URLs distributed among 210 websites. The results showed that our sampling-based algorithm can detect about three times as
many changed webpages as the baseline algorithm. It also showed that the changed webpages are most likely to be found in the
same directory and the upper directories of the changed sample. By applying clustering algorithm on all the webpages, pages
with similar change pattern are grouped together so that updated webpages can be found in the same cluster as the changed
sample. Moreover, our adaptive downloading strategies significantly outperform the static ones in detecting changes for the
popular webpages.
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|