Genre classification means to discriminate between documents bymeans of their form, their style, or their targeted audience.
Put another way, genre classification is orthogonal to a classification based on the documents’ contents.
While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea
here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services
closer to a user’s information need. This objective raises two questions:
| 1 |
What are useful genres when searching the WWW?
|
| 2 |
Can these genres be reliably identified?
|
The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a
genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is
turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are
based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact
feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70% of the Web-documents
are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published
so far.
Keywords Genre Classification - Machine Learning - User Study - Information Need - Information Retrieval - WWW