Semantic-aware unsolicited mail filtering with reduction of labelling efforts
- Laorden Gómez, Carlos
- Gonzalo Álvarez Marañón Codirectora
- Pablo García Bringas Codirector/a
Universidad de defensa: Universidad de Deusto
Fecha de defensa: 05 de julio de 2012
- Coral Calero Muñoz Presidente/a
- Rebeca Cortázar Secretario/a
- Emilio Santiago Corchado Rodríguez Vocal
- Leocadio González Casado Vocal
- Giuseppe Psaila Vocal
Tipo: Tesis
Resumen
Electronic mail is a powerful communication channel. Nevertheless, as happens with all useful media, it is prone to misuse. Spam has become a significant problem for e-mail users over the past decade; an enormous amount of spam arrives in peoples' mailboxes every day. Spam is also a major computer security problem: it is a medium for phishing (i.e., attacks that seek to acquire sensitive information from end-users) and for spreading malicious software (e.g., computer viruses, Trojan horses, spyware and Internet worms). In order to find a solution to this problem, the research community has made a great effort, with good results in solving text categorization problems. Thus, spam filtering systems have adapted different machine-learning techniques, providing a satisfactory evaluation of the e-mails' content. These techniques model the e-mails using the Vector Space Model (VSM), an algebraic approach for Information Filtering, Information Retrieval, indexing and ``ranking''. This model represents natural language documents in a mathematical way by vectors in a multidimensional space with good results. Still, the VSM assumes that all terms are independent, what, at least from the linguistic point of view, is not entirely correct. Therefore, it can not support the linguistic phenomena that can be found in natural languages. In a similar vein, the VSM is also affected by other characteristics of the text such as word sense ambiguity. Indeed, today's attacks against Bayesian spam filters attempt to keep the content of spam e-mail visible to humans, but obscured to filters. This could lead to misclassified legitimate e-mails and spammers evading filtering. Furthermore, junk e-mails evolve at an incredible pace to adapt to the most effective classifiers and surpass the filters, hence limiting in time the validity of the spam collections and classifiers. That is why obtaining properly labelled datasets for the training phase required by the machine-learning methods employed by anti-spam filters becomes a very complex task. In light of this background, we propose to study the application of new techniques capable of overcoming the semantic limitations of current spam filtering systems. To this end, we propose (i) the application of a new representation based on VSM, the enhanced Topic-based Vector Space Model (eTVSM), and (ii) a disambiguation pre-process that enhances the filtering capabilities of anti-spam systems. Moreover, it is also proposed to reduce the labelling efforts necessary for the proper performance of machine-learning methods, by applying (i) collective classification, which looks for connections between the different documents to optimise the classification, and (ii) anomaly detection, which generates a model based solely on one class and detects deviations from this model.