Spam Filtering by Semantics-based Text Classification
Spam has been a serious and annoying problem for decades. Even though plenty of solutions have been put forward, there still remains a lot to be promoted in filtering spam emails more efficiently. Nowadays a major problem in spam filtering as well as text classification in natural language processing is the huge size of vector space due to the numerous feature terms, which is usually the cause of extensive calculation and slow classification. Extracting semantic meanings from the content of texts and using these as feature terms to build up the vector space, instead of using words as feature terms in tradition ways, could reduce the dimension of vectors effectively and promote the classification at the same time. In this paper, a novel Chinese spam filtering approach with semantics-based text classification technology was proposed and the related feature terms were selected from the semantic meanings of the text content. Both the extraction of semantic meanings and the selection of feature terms are implemented through attaching annotations on the texts layerby- layer. This filter performed well when experimented on a public Chinese spam corpus.
In the traditional ways of natural language processing, text is always cut into small terms based on words. For every term the value of the Term Frequency (TF) is calculated and all the terms with their TF values together could form a vector which can represent the text.
This method is usually described as representing the text by turning it into a bag of words. Obviously, when there are abundant number of words or terms involved the size of the vector space formed can be incredibly huge.
Nowadays, this representation method is still familiar in natural language processing, but many term selection techniques have developed to reduce the size of vector space. For example, the value of Inverse Document Frequency (IDF) is proposed to be combined with TF in order to take away the terms which barely influence the classification.
The TF-IDF method as a purely statistical method may not reflect the importance of every term for classification correctly and hardly reflect the information about structures of all the terms in text at all, which lead to an obvious disadvantage. This is because in the art of natural language the flexible and ingenious combination of words is unpredictable.
Latent Semantic Analyzing(LSA) was first put, which was based on the TF-IDF method and designed for distinguish the synonyms in information retrieval at first, then applied to semantic recognition and text classification later as LSA could evaluate the similarity between concepts, sentences and texts .
Spam Filtering is to find out whether the emails are spam or not. In our experiment, the class of personal letter was the only class that was defined as ham.
But for a specific user, the recruitment massages could also be important and useful, as a consequence, the emails in the class of recruitment should be identified as ham for this user.
So with the multiple classification the personalized spam filtering system was much more easier to develop.
In the Proposed System we have implemented the LDA and SVM Algorithm. If the Email is having following attributes means, that email is a spam Attributes are, Invoicing, training, recruiting, eroticism, website, selling, letter, defrauding, Etc.
We have created the web application as like gmail. By using the LDA and SVM Algorithm, spam mails are filtered based on the category. Accuracy for both the algorithms is calculated. And finally it proved that LDA has higher accuracy than the SVM Algorithm.
LDA- Latent Dirichlet allocation
SVM-Support Vector Machine
System : Pentium IV 2.4 GHz.
Hard Disk : 40 GB.
Floppy Drive : 1.44 Mb.
Monitor : 15 VGA Colour.
Mouse : Logitech.
Ram : 512 Mb.
Operating system : Windows XP/7.
Coding Language : JAVA/J2EE
IDE : Netbeans 7.4
Database : MYSQL
T. A. Almeida, A. Yamakami, and J. Almeida, “Filtering Spams using the Minimum Description Length Principle,” Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1854–1858, 2010.