This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation
procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine
learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system
architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently
classify Chinese Web documents.
This work is supported by the 973 High-Tech Projects Foundation of China and partially supported by a grant (No. 69933010)
from NSFC.