Apache Spark – first steps in getting familiar with your data

The world of Big Data provides a wide variety of tools for organizing large amounts of data within a single place. But after we’ve pushed everything in place, usually we start seeking for the options of getting different benefits from this information. We involve statisticians and data experts for the investigation of our data and they start applying certain techniques to detect different features of such information. As a result these features will allow us to build some useful data models which will help us to improve the quality of our products. For example we can build the recommendations for our customers about which products they could like or dislike or even more we can identify the best characteristics for our new products relying on the current demands of our clients. Hadoop provides a set of tools for doing such operations and at this moment Spark is probably one of the best options which fits the demands of data specialists who work within Big Data realm. In this article I want to give the overview of this product and to show how to use it in HortonWorks Hadoop distribution sandbox.

Continue reading “Apache Spark – first steps in getting familiar with your data”