I'm a software engineer working with big data.
As part of my journey of learning Haskell, I figured I would implement some statistics / machine learning algorithms. As an added benefit, I can write some posts about the code and describe the algorithms as I go.
First up, the easiest regression method our there (drum roll)… Simple Linear Regression!
First, I will show my full implementation, then I will break down the code and walk you through it. Maybe read the Wikipedia page on simple linear models first, so that you know what it is that we are trying to do.
Mean calculates the mean of a list, very simple.
Mean of points applies mean to a list of points. It gets the mean for all the x’s and all the y’s.
Correlation finds how related the x and y variables are. A correlation of 1 means that x and y are equal. A correlation of 0 means that x and y are not related. If we have a high correlation, we should be able to get a good model with simple linear regression.
The standard deviation function calculates the standard deviation of a list. The formula is very simple.
Standard deviation of points just calculates standard deviation for both x and y.
Finally, the simple linear function calculates the simple linear regression line for the given data. It should be very easy to understand what this function does, as it is just composed of the previous functions.
The predict function will predict a y given an x. Very simple, uses the geometric formula for a line.
The RMSE (root mean squared error) function is a way to measure how accurate our predictions are. The lower the better. List comprehensions are awesome.
Now, lets do something cool with this!
I am a huge fan of the Dresden Files books, and I love that the author, Jim Butcher, pumps them out like clockwork. But, I am a very impatient fan, and I want the books now!
So, how many days does it take from book release to book release? So far, 15 books have been published, so we have 14 day count deltas. Is there a trend? Are books published at a constant rate? Is it taking longer for books to come out? Lets see.
I am using EasyPlot, a wrapper around GNUPlot. It is a very simple way to get a plot up and running in Haskell.