Contribution au classement statistique mutualisé de messages électroniques (spam)

Abstract : Since the 90's, different machine learning methods were investigated and applied to the email classification problem (spam filtering), with very good but not perfect results. It was always considered that these methods are well adapted to filter messages to a single user and not filter to messages of a large set of users, like a community. Our approach was, at first, look for a better understanding of handled data, with the help of a corpus of real messages, before studying new algorithms. With the help of a logistic regression classifier with online active learning, we could show, empirically, that with a simple classification algorithm coupled with a learning strategy well adapted to the real context it's possible to get results which are as good as those we can get with more complex algorithms. We also show, empirically, with the help of messages from a small group of users, that the efficiency loss is not very high when the classifier is shared by a group of users.
Document type :
Theses
Complete list of metadatas

https://pastel.archives-ouvertes.fr/pastel-00637173
Contributor : Bibliothèque Mines Paristech <>
Submitted on : Monday, October 31, 2011 - 9:36:24 AM
Last modification on : Monday, November 12, 2018 - 10:53:18 AM
Long-term archiving on: Wednesday, February 1, 2012 - 2:20:46 AM

Identifiers

  • HAL Id : pastel-00637173, version 1

Citation

José Márcio Martins da Cruz. Contribution au classement statistique mutualisé de messages électroniques (spam). Autre [cs.OH]. École Nationale Supérieure des Mines de Paris, 2011. Français. ⟨NNT : 2011ENMP0027⟩. ⟨pastel-00637173⟩

Share

Metrics

Record views

641

Files downloads

2504