Few month ago I had a chance to visit one technological summit driven by HortonWorks which was related to the evolution of Hadoop. There was a great number of amazing sessions held by different speakers who were describing their experience in working with Big Data. But one very important thing which I’ve noticed there was the fact that different projects and frameworks at some point were always somehow intersecting with another scope of problems called Machine Learning. This is probably one of the most exciting and interesting category of tasks solved by Big Data. That is why I decided to start covering this topic in my articles and today I’m going to tell you about one basic Machine Learning algorithm called Linear Regression.
The world of Big Data provides a wide variety of tools for organizing large amounts of data within a single place. But after we’ve pushed everything in place, usually we start seeking for the options of getting different benefits from this information. We involve statisticians and data experts for the investigation of our data and they start applying certain techniques to detect different features of such information. As a result these features will allow us to build some useful data models which will help us to improve the quality of our products. For example we can build the recommendations for our customers about which products they could like or dislike or even more we can identify the best characteristics for our new products relying on the current demands of our clients. Hadoop provides a set of tools for doing such operations and at this moment Spark is probably one of the best options which fits the demands of data specialists who work within Big Data realm. In this article I want to give the overview of this product and to show how to use it in HortonWorks Hadoop distribution sandbox.