Document Density Density-based spam detector (Japanese)

The volume of mass unsolicited electronic mail, often known as spam, has recently increased enormously and has become a serious threat to not only the Internet but also to society. A new spam detection method which uses document space density information is proposed. Although it requires extensive e-mail traffic to acquire the necessary information, an unsupervised learning engine with a short white list can achieve a 98% recall rate and 100% precision. A direct-mapped cache method contributes handling of over 13,000 e-mails per second. Experimental results, which were conducted using over 50 million actual e-mails, are also reported.

Document space density and Direct-mapped cache method
Although most of the conventional spam filters use vector representation for the basic representation of data, we use document space density as the key information to distinguish spam from other e-mails (See Page top Figure). More precisely, we just count the number of similar e-mails. By counting the number of similar mails, we can easily distinguish spam from other e-mails.

A high speed text search engine is an important component of our method. Since it has to handle over 10,000 e-mails per second and it needs to check a million previous e-mails to handle the current single e-mail, naive implementation requires over 10 billion similarity checks. To solve this problem, we have developed a direct-mapped cache architecture to speed up processing.

The hash data base in the Figure stores the hash values of each e-mail and the number of similar e-mails. The direct-mapped cache copies the n% of the hash values. It also stores the pointer to the e-mail's entry in the hash data base. To check a single e-mail, in order to find similar previous e-mail which share S% of the same hash values, the algorithm first checks the direct-mapped cache. The direct-mapped cache is a simple hash table, and the algorithm can find the entries of e-mails which have the same hash value through this cache.

The contents in the direct-mapped cache are simply overwritten if the hash values are over-wrapped. When all the hash values in the direct-mapped cache are overwritten by other e-mails, the algorithm deletes the entry in the hash data base for the overwritten e-mail so that it can reuse the memory space of the hash data base.

Although this architecture does not have an explicit LRU cache mechanism to control entries in the hash data base, the overwrite mechanism of direct-mapped cache controls the entry in the hash data base as if it were controlled by a LRU mechanism.

Conclusion
The characteristics of the density-based spam detector are:
  • With a single small desk-top computer, this method can handle over 13,000 e-mails per second (1.25 billion e-mails per day). The direct-mapped cache method contributes to the efficient analysis of document space density.
  • An unsupervised learning engine used in the proposed method can automatically update its data-base, and does not require maintenance. It also protects user's privacy inherently.
  • Experimental results, which used over 50 million actual e-mail traffic, prove the accuracy, i.e. 98% recall rate and 100% precision:
References
  • K. Yoshida etal, Density-based spam detector, KDD2004, pages 486-493
  • K. Yoshida, et.al, Memory management of density-based spam detector, SAINT2005