Few month ago I had a chance to visit one technological summit driven by HortonWorks which was related to the evolution of Hadoop. There was a great number of amazing sessions held by different speakers who were describing their experience in working with Big Data. But one very important thing which I’ve noticed there was the fact that different projects and frameworks at some point were always somehow intersecting with another scope of problems called Machine Learning. This is probably one of the most exciting and interesting category of tasks solved by Big Data. That is why I decided to start covering this topic in my articles and today I’m going to tell you about one basic Machine Learning algorithm called Linear Regression.
Machine Learning scope represents a number of different algorithms and approaches that allow you to find certain patterns inside your data. Usually you can split your base information into two pieces – input part and result part. As an example you can have characteristics of houses and their target prices or a set of emails referenced as spam and not spam. Using Machine Learning algorithms you can perform identification of common features in your input part and then start making certain predictions and assumptions on how they correlate with results part. Once you find such relations, you can start applying them on the sets of same features in new instances of input data in order to predict their potential result values.
From my personal experience in order to start feeling comfortable in this scope you first of all you need to aquire a certain level of understanding of mathimatical background of Machine learning algorithms. For this purpose I would highly recommend you to pass relevant Coursera course created by Andrew Ng. It is indeed a remarkable learning material which provides you with a solid information about the bases of Machine Learning algorithms.
Besides you need to pick up a set of tools and frameworks which would allow you to work with the data and to apply the different Machine Learning techniques upon it. There is a big variety of different resources and libraries for this so I would suggest you to:
- Choose proper language for data analysis which has sufficient set of libraries targeting machine learning scope. Currently the most commonly used options are Python, Scala or R.
- Install some IDE tool like IPython, RStudio, Jupiter or Zeppelin so that you could start writing your code.
Besides I would recommend you to look through some courses in PluralSite which could give you more clue on the whole ecosystem of this scope:
- Understanding Machine Learning
- Understanding Machine Learning with Python
- Machine Learning Algorithms
Preditcion using Linear Regression
Linear Regression is one of the basic Machine learning which allows you to identify the relationship between sets of variables. For example it could be a dependency between city population and its total profit or dependency between age and average salary for some position. The idea of the algorithm is to find the best parameters for linear function which would ideally represent the correlation between variables and result values. In most simple case we can use a basic function y = a*x + b which would reflect a linear dependency between parameter x and result y:
The task of our machine learning algorithm is to pick the best a and b values for this function so that it would have minimal differences between existing set of x and y values. In order to archive this goal Linear Regression algorithm uses different techniques. One of them is called Gradient Descent. It allows to automatically calculate the optimal values for the parameters of our functions so that they would minimize the total difference between existing set of x and y values. Such difference is called Cost Function or Loss Function:
In ideal case the result of our Cost function should equal 0, what would mean that in our function all x variables perfectly correlate to our y result values and we can assume that for new x instances it will give us absolutely correct y values.
Here are the steps of implementation of Linear Regression Algorithm:
- Upload the data into data model so that is would have a set of variables and their result values and scale them properly to fit the algorithm
- Pick up the function which we want to use to describe the dependency
- Apply Gradient Descend in loop for some amount of iterations in order to find the optimal parameter values for the variables of our function. In each iteration we check the that the value of our cost function is decreasing which would mean that we are improving the accuracy of our formula
I know that it is impossible to cover the detail of this algorithm in one article and probably some things which I mentioned here does not really make much sense to you at this stage. But I really encourage you to apply for Machine Learning course by Andrew Ng referenced earlier as it describes the details of this technique on the lower level in very clear manner.
Implementing Linear regression in Spark
The good thing is that the steps I described above are already been implemented in most of Machine Learning libraries. Spark has a package called MLib which encapsulates different learning algorithms including Linear regression. In my example I’m going to show you how you can create a prediction system which would calculate the profit of the city depending on its population. For those of you who has applied for Machine Learning course recommended earlier, you may notice that such task exists in one of the practical exercises. I’m going to use the dataset from this task and I will place it into /tmp catalog of HDFS. Now I’ll show how to implement this task using Python language through Zeppelin Notebook and Angular JS extension with D3.js chart library.
Part 1: Loading the data into NoteBook:
%pyspark # Upload the data data = sc.textFile("/tmp/data.txt").map(lambda row: row.split(',')) fullSet = data.map(lambda line: (float(line), float(line))).collect()
Zeppelin solution usually can be divided into two parts – Backend and Frontend. First piece performs different calculations of business logic, second – represents its results. In order to tie them up users can apply technique called Binding. It allows to implement publisher-subscriber model through the intermediate Zeppelin context which would send notifications on every update of original variables.
Now the data is in place and we can start playing with it. First we will create a variable totalDataVisual which will be used to represent the all data in our DataSet. Lets try to make it dynamic by adding time.sleep between every update of the collection:
%pyspark totalDataVisual = z.z.angularBind("totalData", totalDataVisual) for i in dataFull: totalDataVisual.append(i) z.z.angularBind("totalData", totalDataVisual) time.sleep(0.1)
Next three variables trainingDataVisual, testingDataVisual and predictedDataVisual will be used for displaying the implementation of the algorithm. We will split our original data into two pieces. First one will be our training set which we will use to create prediction model using Linear Regression algorithm. Then we will apply this model on the second piece called Test set and will compare the differences between result values calculated by the predictor with the real values which we had at start in order to observe the level of its accuracy or the cost of our function.
%pyspark from pyspark.mllib.regression import LinearRegressionWithSGD from pyspark.mllib.classification import SVMWithSGD, SVMModel from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.feature import StandardScaler trainingDataVisual =  testingDataVisual =  predictedDataVisual =  z.z.angularBind("trainingData", trainingDataVisual) z.z.angularBind("testingData", testingDataVisual) z.z.angularBind("predictedData", predictedDataVisual) # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(',')] return LabeledPoint(values, values[1:]) totalDataVisual = trainingDataVisual =  testingDataVisual =  predictedDataVisual =  z.z.angularBind("totalData", totalDataVisual) z.z.angularBind("trainingData", trainingDataVisual) z.z.angularBind("testingData", testingDataVisual) z.z.angularBind("predictedData", predictedDataVisual) dataFullInverse = data.map(lambda line: LabeledPoint(float(line),map(float, line[0:1]))) trainingData, testingData = dataFullInverse.randomSplit([.7,.3],seed=1234) z.z.angularBind("trainingData", trainingData.map(lambda line: (line.features,line.label)).collect()) linearModel = LinearRegressionWithSGD.train(trainingData, 200,.2) z.z.angularBind("testingData", testingDataVisual) z.z.angularBind("predictedData", predictedDataVisual) for i in testingData.map(lambda line: (line.features,line.label)).collect(): params =  params.append(i) result = linearModel.predict(params) predictedDataVisual.append((result, i)) testingDataVisual.append((i, i)) z.z.angularBind("predictedData", testingDataVisual) z.z.angularBind("testingData", predictedDataVisual) time.sleep(1)
In order to visualize these calculation we will implement two methods in our visualization angular-based paragraph:
- host – this method will perform the binding of the values of our variables to the visualization function so that on every update on Python side we would call rendering method
- displaydata – this method will render plot chart using the D3 library
Once you will run the code you will first see the chart with all values of the DataSet. Then you will see another chart which will display you the values of training set as a blue dots, red dots as an original values of your testing set and green dots as a predicted values of your testing set received from the prediction model created using Linear Regression algorithm:
Source notebook sheet and data is available on GitHub.