Spark receiver as a key part of your streaming application

I continue writing your about my journey to the world of Big Data and today I want to talk about one of the most popular Hadoop frameworks called Spark. Though Spark has one of the largerst army of contributors and defacto at this moment it is the most popular tool for processing large volumes of data in Hadoop, still sometimes it could be quite challenging to find certain information regarding different parts of this product. Last few months I had to do some deep diving into one of the key components of Spark application called Receiver. This is a starting point of any streaming flow implemented using this framework. That is why it is very important to know how it works and to understand the ways for its proper configuration and tuning. In this article I’d like tell you about how Spark receiver works and to share some knowledge regarding its configuration and tuning.

Continue reading “Spark receiver as a key part of your streaming application”

Streaming data to Hive using Spark

Real time processing of the data into the Data Store is probably one of the most spread category of scenarios which big data engineers can meet while building their solutions. Fortunately Hadoop ecosystem provides a number of options of how to achieve this goal and to design efficient and scalable streaming applications. In my previous articles I have already described one way of implementing such solutions using Hadoop framework called Storm. Today I would like to tell you about alternative approach which became very popular among Big Data developers due to its simplicity and high efficiency – Spark Streaming.

Continue reading “Streaming data to Hive using Spark”

Scheduling jobs in Hadoop through Oozie

One of the common problems which software engineers can meet at different stages of application development are the tasks relating to the scheduling of jobs and processes on periodical bases. For this purpose Windows OS family provides a special component called Task Scheduler. Linux world proposes its own alternative approach – embedded daemon called Cron. Hadoop distributed ecosystem which is working on the top of mentioned underlying operational systems introduces another set of challenges related to scheduling problem different from the typical tasks. Here we need to deal with a category of jobs which are running though a number of physical machines and are flowing between Hadoop services. In order to simplify the implementation of such workflows Big Data introduces a special component called Oozie. In this article I would like to give you an overview of this product.

Continue reading “Scheduling jobs in Hadoop through Oozie”

Linear regression through Apache Zeppelin: how to visualize your PySpark backend using Angular and D3

Few month ago I had a chance to visit one technological summit driven by HortonWorks which was related to the evolution of Hadoop. There was a great number of amazing sessions held by different speakers who were describing their experience in working with Big Data. But one very important thing which I’ve noticed there was the fact that different projects and frameworks at some point were always somehow intersecting with another scope of problems called Machine Learning. This is probably one of the most exciting and interesting category of tasks solved by Big Data. That is why I decided to start covering this topic in my articles and today I’m going to tell you about one basic Machine Learning algorithm called Linear Regression.

Continue reading “Linear regression through Apache Zeppelin: how to visualize your PySpark backend using Angular and D3”

Accessing Kerberized Hadoop cluster using Ranger security policies and native APIs

Security always played important role in every informational system. For each particular solution before actual implementation we first need to carefully design its protection layers by means of different techniques like authorization, authentication or encryption.  But sometimes at the early start developers don’t pay much attention to this topic and concentrate their efforts on the functional aspects of their applications. This is what has happened to Hadoop.

In my previous article about security in world of Big Data I’ve already given a high-level overview of this model. Now I want to share some experience about how to work with Hadoop services on the low level straight from the source code. We will create new principle in Hadoop environment, then we will give him permissions in Ranger and will use him from java application to access the services remotely.

Continue reading “Accessing Kerberized Hadoop cluster using Ranger security policies and native APIs”

Apache Spark – first steps in getting familiar with your data

The world of Big Data provides a wide variety of tools for organizing large amounts of data within a single place. But after we’ve pushed everything in place, usually we start seeking for the options of getting different benefits from this information. We involve statisticians and data experts for the investigation of our data and they start applying certain techniques to detect different features of such information. As a result these features will allow us to build some useful data models which will help us to improve the quality of our products. For example we can build the recommendations for our customers about which products they could like or dislike or even more we can identify the best characteristics for our new products relying on the current demands of our clients. Hadoop provides a set of tools for doing such operations and at this moment Spark is probably one of the best options which fits the demands of data specialists who work within Big Data realm. In this article I want to give the overview of this product and to show how to use it in HortonWorks Hadoop distribution sandbox.

Continue reading “Apache Spark – first steps in getting familiar with your data”

Getting data into HBase and consuming it through native shell and Hive

If you followed my previous articles, probably at this stage you should have common understanding of primary components of Hadoop ecosystem and basics of distributed calculations. But in order implement performant processing we first need to prepare a strong data foundation for it. Hadoop provides a number of solutions for this purpose and HBase data store is one of the best products in this ecosystem which allows you to organize big amounts of data in a single place. In this article I want to tell about some techniques of working with HBase – how to import the data, how to read it through native API and how to simplify its consumption through another Hadoop product called Hive.

Continue reading “Getting data into HBase and consuming it through native shell and Hive”

Coordination of distributed applications through Zookeeper

When I started to work in scope of Big Data, one of the most challenging things for me was to understand the distributed nature of Hadoop applications. Usually most part of software developers think about their products in terms of single program components which are represented by standalone applications. Every program is an independent monolith unit which is located in own machine, runs in its own process and responsible for custom range of tasks. Big data introduces another level of composition, where every single program can be distributed across the nodes of the cluster as set of standalone services driven by some master service. The development of such model creates new challenges related to the synchronization these services and handling the consistent state of the overall application. This problem is common for every distributed system and instead of reinventing custom solution for each particular product of Hadoop family, community created a universal tool called Zookeeper. In this article I want to give you overview of this application and show some examples of working with it.

Continue reading “Coordination of distributed applications through Zookeeper”

Consuming OpenCV through Hadoop Storm DRPC Server from .NET

In previous article I gave a basic overview of Hadoop Storm framework and showed how to use it in order to perform some simple operations like word counting and persistence of the information in real time. But Storm service can cover much wider scope of tasks. The scalable and dynamic nature of this product allows to wrap most complicated algorithms and distribute their handling among different machines of the Hadoop cluster. Computer vision is a good candidate for such type of functionality. This term covers a large range of tasks related to the processing of graphical data and performing such operations upon it like objects detection, motion tracking or face recognition. As you can imaging these procedures can be quite expensive and wrapping them within some scalable parallel processing model could significantly increase the end capacity of the potential solutions. You can find a good example of such application called Amazon Recognition in the list of official Amazon services. In this article I want to show you how to build similar products using Hadoop Storm framework and open-source Computer Vision libraries.

Continue reading “Consuming OpenCV through Hadoop Storm DRPC Server from .NET”

Perfect Storm – real-time data streaming from .NET through Kafka to HBase, HDFS and Hive

In my previous articles I tried to give the overview of primary Hadoop services responsible for storing the data. With their help we can organize information into some common structures and perform operations upon them through such tools like MapReduce jobs or more high-level abstractions like Hive SQL or HBase querying language. But before doing this we certainly need to somehow put our data inside the cluster. The simplest way would be a copying of the information between the environments and  performing a set of commands from different service-related CLIs  to do the import or launching some bash scripts which can partly automate this work. But it would be great if we could have some common tool which would allow us to define different workflows for such processes so that single units of information could be imported, transformed, aggregated or passed through some algorithm before actual preservation. Such type of framework certainly should be scalable and should follow the general requirements of distributed environment. In Hadoop we have such tool called Storm and from my point of view this product is probably one of the most interesting and exciting parts of Big Data ecosystem. In this article I want to give you its overview and to share my experience of using it.

Continue reading “Perfect Storm – real-time data streaming from .NET through Kafka to HBase, HDFS and Hive”