Spark receiver as a key part of your streaming application

I continue writing your about my journey to the world of Big Data and today I want to talk about one of the most popular Hadoop frameworks called Spark. Though Spark has one of the largerst army of contributors and defacto at this moment it is the most popular tool for processing large volumes of data in Hadoop, still sometimes it could be quite challenging to find certain information regarding different parts of this product. Last few months I had to do some deep diving into one of the key components of Spark application called Receiver. This is a starting point of any streaming flow implemented using this framework. That is why it is very important to know how it works and to understand the ways for its proper configuration and tuning. In this article I’d like tell you about how Spark receiver works and to share some knowledge regarding its configuration and tuning.

Continue reading “Spark receiver as a key part of your streaming application”

Streaming data to Hive using Spark

Real time processing of the data into the Data Store is probably one of the most spread category of scenarios which big data engineers can meet while building their solutions. Fortunately Hadoop ecosystem provides a number of options of how to achieve this goal and to design efficient and scalable streaming applications. In my previous articles I have already described one way of implementing such solutions using Hadoop framework called Storm. Today I would like to tell you about alternative approach which became very popular among Big Data developers due to its simplicity and high efficiency – Spark Streaming.

Continue reading “Streaming data to Hive using Spark”

Linear regression through Apache Zeppelin: how to visualize your PySpark backend using Angular and D3

Few month ago I had a chance to visit one technological summit driven by HortonWorks which was related to the evolution of Hadoop. There was a great number of amazing sessions held by different speakers who were describing their experience in working with Big Data. But one very important thing which I’ve noticed there was the fact that different projects and frameworks at some point were always somehow intersecting with another scope of problems called Machine Learning. This is probably one of the most exciting and interesting category of tasks solved by Big Data. That is why I decided to start covering this topic in my articles and today I’m going to tell you about one basic Machine Learning algorithm called Linear Regression.

Continue reading “Linear regression through Apache Zeppelin: how to visualize your PySpark backend using Angular and D3”

Apache Spark – first steps in getting familiar with your data

The world of Big Data provides a wide variety of tools for organizing large amounts of data within a single place. But after we’ve pushed everything in place, usually we start seeking for the options of getting different benefits from this information. We involve statisticians and data experts for the investigation of our data and they start applying certain techniques to detect different features of such information. As a result these features will allow us to build some useful data models which will help us to improve the quality of our products. For example we can build the recommendations for our customers about which products they could like or dislike or even more we can identify the best characteristics for our new products relying on the current demands of our clients. Hadoop provides a set of tools for doing such operations and at this moment Spark is probably one of the best options which fits the demands of data specialists who work within Big Data realm. In this article I want to give the overview of this product and to show how to use it in HortonWorks Hadoop distribution sandbox.

Continue reading “Apache Spark – first steps in getting familiar with your data”