I'm a software engineer working with big data.
Naive Bayes is one of the simplest classification algorithms, but it can also be very accurate. At a high level, Naive Bayes tries to classify instances based on the probabilities of previously seen attributes/instances, assuming complete attribute independence. Oddly enough, this usually works out to give you a good classifier.
Now, I am going to show you how to do Naive Bayes classification in R. First, you need to install a few packages, so time to boot up your R instance!
Caret is a very nice data mining package for R, it has tons of awesome features. The package klaR contains our Naive Bayes classifier.
Everyone does the iris dataset first, so I wont break that trend. Later, I will show you a much more interesting dataset. Loadup the iris dataset and separate the labels from the attributes.
Now, x has all the attributes and y has all the labels. Now we can train our model.
This one line will generate a Naive Bayes model, using 10-fold cross-validation. From above, x is the attributes and y is the labels. The ‘nb’ tells the trainer to use Naive Bayes. The trainController part tells the trainer to use cross-validataion (‘cv’) with 10 folds.
You can then print out the model:
Awesome! We have a 94% kappa, life is good! One of the really cool things about caret’s train function is that it will fine-tune the parameters to your model (to a certain extent).
Now that we have generated a classification model, how can we use it for prediction? Easy!
This will printout a bunch of lines. Near the top you can see the classes it predicted, then you will see the posterior probabilities in the bottom half. As we are only interested in the class predictions, we can grab only those with the following line.
Lets build a confusion matrix so that we can visualize the classification errors.
This will generate a confusion matrix of the predictions of your Naive Bayes model versus the actual classification of the data instances.
Now, what I have done here is actually a terrible idea. You never want to use the same data you trained on for testing, but this is only an example. I will provide a better example later on.
That is basically how you do Naive Bayes classification in R with cross-validation. Now, lets try this on a more interesting dataset, spam emails.
Here we take 90% of the dataset to train on, and then we test on the remaining 10%.
These results will be different on each run, as sample is a random function. The results aren’t great, but for a very simple classifier, they are really good!