I'm a software engineer working with big data.
My last post looked at using Naive Bayes for spam detection with the spam dataset. The way this works is by grabbing n-grams (I am not sure of the value of n for this data set), and then using that as a word bag with user provided spam/not-spam classifications. The results were ok, but we can improve on them!
I have written (and am still adding to) a collection of R functions to make it easier to perform simple data mining tasks. You can find these scripts on github here. I will be using these scripts in this post.
You may need to install some packages first:
Now we have to load up my R script.
Now we can build a Naive Bayes classifier.
This will split the dataset 70% / 30% into train and test sets, perform 10 fold cross-validation, and then print out confusion matrix information. You should get ~60% accuracy with kappa of ~0.269. Good… but not great. We can improve upon this by using the Generalized Linear Model instead of Naive Bayes.
This should give you much better results. We should get ~93% accuracy with a kappa of 0.845.
Now we can actually build a model on the whole dataset.
You don’t want to test on data you trained on, this is just for example.
Which gives ~93% accuracy again.