I'm a software engineer working with big data.
I found a really interesting dataset, the SMS Spam Collection. It is a collection of ~5500 phone text messages, classified as either spam or ham. It is actually an aggregation of ~4 other SMS spam/ham datasets.
Because of the short nature of text messages and the unconventional spelling commonly used (leet speak, ‘how r u’, etc), classifying SMS messages can be tricky.
I was able to build a classifier that is ~97% accurate with a kappa of ~0.86. My features consisted of unigram, bigram, and trigram word tokens (the words were steammed). I was able to reduce my attribues down to just 44.
The really interesting part of all this is the attribute ngrams that are the largest indicator of spam/ham. Here is a list of the features ordered by the chi-squared test:
cal, txt, fre, claim, mobil, priz, www, servic, &, your, uk, to, text, stop, rep, award, i, or, now, cash, from, win, per, new, 18, me, but, <, that, how, ok, he, ü, oh, work, way, anyth, &, nic, eat, pa, 'at hom', 'you guy', rea,
Because of stemming, some of the words look a little mangled. This isn’t a problem.
The first 6 features make a lot of sense. SMS spam usually has some sort of ‘call now’ or ‘text this number’ or ‘claim your prize!’.
Interestingly, almost half of all ham SMS messages had the word ‘I’ whereas spam rarely had it. This happened for several words: ‘I’, ‘me’, ‘but’, ‘that’, ‘how’, ‘ok’, ‘he’, ‘oh’, ‘way’, and ‘anything’. These features were very strong indicators that the message is ham.
Only two bigrams were of any value, and none of the trigrams were of any use. I think this has to do with the very small size of SMS messages.
Using this simple set of features, Naive Bayes is able to get a very impressive 97% accuracy with 0.86 kappa on a test set. Best of all, this sort of system is very simple and fast.