When I started to work in scope of Big Data, one of the most challenging things for me was to understand the distributed nature of Hadoop applications. Usually most part of software developers think about their products in terms of single program components which are represented by standalone applications. Every program is an independent monolith unit which is located in own machine, runs in its own process and responsible for custom range of tasks. Big data introduces another level of composition, where every single program can be distributed across the nodes of the cluster as set of standalone services driven by some master service. The development of such model creates new challenges related to the synchronization these services and handling the consistent state of the overall application. This problem is common for every distributed system and instead of reinventing custom solution for each particular product of Hadoop family, community created a universal tool called Zookeeper. In this article I want to give you overview of this application and show some examples of working with it.
In previous article I gave a basic overview of Hadoop Storm framework and showed how to use it in order to perform some simple operations like word counting and persistence of the information in real time. But Storm service can cover much wider scope of tasks. The scalable and dynamic nature of this product allows to wrap most complicated algorithms and distribute their handling among different machines of the Hadoop cluster. Computer vision is a good candidate for such type of functionality. This term covers a large range of tasks related to the processing of graphical data and performing such operations upon it like objects detection, motion tracking or face recognition. As you can imaging these procedures can be quite expensive and wrapping them within some scalable parallel processing model could significantly increase the end capacity of the potential solutions. You can find a good example of such application called Amazon Recognition in the list of official Amazon services. In this article I want to show you how to build similar products using Hadoop Storm framework and open-source Computer Vision libraries.
In my previous articles I tried to give the overview of primary Hadoop services responsible for storing the data. With their help we can organize information into some common structures and perform operations upon them through such tools like MapReduce jobs or more high-level abstractions like Hive SQL or HBase querying language. But before doing this we certainly need to somehow put our data inside the cluster. The simplest way would be a copying of the information between the environments and performing a set of commands from different service-related CLIs to do the import or launching some bash scripts which can partly automate this work. But it would be great if we could have some common tool which would allow us to define different workflows for such processes so that single units of information could be imported, transformed, aggregated or passed through some algorithm before actual preservation. Such type of framework certainly should be scalable and should follow the general requirements of distributed environment. In Hadoop we have such tool called Storm and from my point of view this product is probably one of the most interesting and exciting parts of Big Data ecosystem. In this article I want to give you its overview and to share my experience of using it.