Lecture Notes in Computer Science, 2004, Volume 3238/2004, 256-269, DOI: 10.1007/978-3-540-30221-6_20

Genre Classification of Web Pages
User Study and Feasibility Analysis

Sven Meyer zu Eissen and Benno Stein

View Related Documents

Abstract

Genre classification means to discriminate between documents bymeans of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents.
While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services closer to a user’s information need. This objective raises two questions:
What are useful genres when searching the WWW?
Can these genres be reliably identified?
The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70% of the Web-documents are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published so far.

Keywords  Genre Classification - Machine Learning - User Study - Information Need - Information Retrieval - WWW

Fulltext Preview

Image of the first page of the fulltext document