View Related Documents

Abstract

Automatic segmentation of text strings, in particular entity names, into structured records is often needed for efficient information retrieval, analysis, mining, and integration. Hidden Markov Model (HMM) has been shown as the state of the art for this task. However, previous work did not take into account the synonymy of words and their abbreviations, or possibility of their misspelling. In this paper, we propose a fuzzy synset-based HMM for text segmentation, based on a semantic relation and an edit distance between words. The model is also to deal with texts written in a language like Vietnamese, where a meaningful word can be composed of more than one syllable. Experiments on Vietnamese company names are presented to demonstrate the performance of the model.

Fulltext Preview

Image of the first page of the fulltext document