Spam Detection with the Generalized Linear Model in R

December 01, 2012

My last post looked at using Naive Bayes for spam detection with the spam
dataset. The way this works is by grabbing n-grams (I am not sure of the value
of n for this data set), and then using that as a word bag with user provided
spam/not-spam classifications. The results were ok, but we can improve on them!

I have written (and am still adding to) a collection of R functions to make it
easier to perform simple data mining tasks. You can find these scripts on github
here. I will be
using these scripts in this post.

You may need to install some packages first:

Now we have to load up my R script.

Now we can build a Naive Bayes classifier.

This will split the dataset 70% / 30% into train and test sets, perform 10 fold
cross-validation, and then print out confusion matrix information. You should
get ~60% accuracy with kappa of ~0.269. Good… but not great. We can improve
upon this by using the Generalized Linear Model instead of Naive Bayes.

This should give you much better results. We should get ~93% accuracy with a
kappa of 0.845.

Now we can actually build a model on the whole dataset.

You don’t want to test on data you trained on, this is just for example.