Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Selected Papers from the 11th Conference of the Spanish Association for Artificial Intelligence (CAEPIA 2005)

Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain

J. R. MéndezContact Information, E. L. IglesiasContact Information, F. Fdez-RiverolaContact Information, F. DíazContact Information and J. M. CorchadoContact Information

(1)  Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
(2)  Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, Segovia, Spain
(3)  Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, Salamanca, Spain
Abstract
Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.

Contact Information J. R. Méndez
Email: moncho.mendez@uvigo.es

Contact Information E. L. Iglesias
Email: eva@uvigo.es

Contact Information F. Fdez-Riverola
Email: riverola@uvigo.es

Contact Information F. Díaz
Email: fdiaz@infor.uva.es

Contact Information J. M. Corchado
Email: corchado@usal.es
Fulltext Preview (Small, Large)
Image of the first page of the fulltext


Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.113 • Server: mpweb16
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)