Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit
web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications).
Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation
bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but
are weak at detecting near duplicate blocks, characterized by items like navigation bars.
This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy
and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant
web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the
block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important
blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner
leads to a more accurate and efficient web page classification results than comparable existing approaches.
Keywords Web Page Cleaning - Noise Block - Web Content Mining - Classification - Near-Duplicate - Text Similarity
This research was supported by the Natural Science and Engineering Research Council (NSERC) of Canada under an Operating grant
(OGP-0194134) and a University of Windsor grant.