Josh Walters

I'm a software engineer working with big data.

Blog Github Email Linked-In

Spam Detection with the Generalized Linear Model in R

December 01, 2012

My last post looked at using Naive Bayes for spam detection with the spam dataset. The way this works is by grabbing n-grams (I am not sure of the value of n for this data set), and then using that as a word bag with user provided spam/not-spam classifications. The results were ok, but we can improve on them!

I have written (and am still adding to) a collection of R functions to make it easier to perform simple data mining tasks. You can find these scripts on github here. I will be using these scripts in this post.

You may need to install some packages first:

    install.packages("klaR")
    install.packages("caret")
    install.packages('ElemStatLearn') # Has the spam dataset

Now we have to load up my R script.

    setwd('~/Desktop/DataMiningTools/src/')  
    source('classifier.r')

    library('ElemStatLearn')

Now we can build a Naive Bayes classifier.

    buildClassifierAndReport(spam,'nb')

This will split the dataset 70% / 30% into train and test sets, perform 10 fold cross-validation, and then print out confusion matrix information. You should get ~60% accuracy with kappa of ~0.269. Good… but not great. We can improve upon this by using the Generalized Linear Model instead of Naive Bayes.

    buildClassifierAndReport(spam,'glm')

This should give you much better results. We should get ~93% accuracy with a kappa of 0.845.

Now we can actually build a model on the whole dataset.

    c <- buildClassifier(spam,'glm')

You don’t want to test on data you trained on, this is just for example.

    pred <- predictClassWithClassifier(spam,c)
    match <- pred == spam[,58]
    (length(match[match==TRUE]))/(length(match))

Which gives ~93% accuracy again.