If you followed my previous articles, probably at this stage you should have common understanding of primary components of Hadoop ecosystem and basics of distributed calculations. But in order implement performant processing we first need to prepare a strong data foundation for it. Hadoop provides a number of solutions for this purpose and HBase data store is one of the best products in this ecosystem which allows you to organize big amounts of data in a single place. In this article I want to tell about some techniques of working with HBase – how to import the data, how to read it through native API and how to simplify its consumption through another Hadoop product called Hive.
In my previous articles I tried to give the overview of primary Hadoop services responsible for storing the data. With their help we can organize information into some common structures and perform operations upon them through such tools like MapReduce jobs or more high-level abstractions like Hive SQL or HBase querying language. But before doing this we certainly need to somehow put our data inside the cluster. The simplest way would be a copying of the information between the environments and performing a set of commands from different service-related CLIs to do the import or launching some bash scripts which can partly automate this work. But it would be great if we could have some common tool which would allow us to define different workflows for such processes so that single units of information could be imported, transformed, aggregated or passed through some algorithm before actual preservation. Such type of framework certainly should be scalable and should follow the general requirements of distributed environment. In Hadoop we have such tool called Storm and from my point of view this product is probably one of the most interesting and exciting parts of Big Data ecosystem. In this article I want to give you its overview and to share my experience of using it.
“Three admins once went to noSql bar, but a little while walked away from there as they could not find a table” – says one popular joke. The statement is arguable from my point of view, but it undercover the general idea – differences of approaches of treating the data. Most of developers usually start their career from working with some RDBMS products like Oracle or MSSQL server. These systems keep the data in some normalized format represented by a sets of tables and relations between them. In many cases this model ideally suits to accomplish the data-related requirements of many products. But sometimes such structure brings certain overheads which negatively effects the end performance of the application. In order to overcome these restrictions an alternative NoSQL approach for keeping and managing the data was invented. There are a lot of different implementations of this technique and each of them targets to resolve some individual tasks, but from Hadoop perspective HBase is probably one of the most commonly used NoSQL database which provides a strong and reliable mechanism for managing huge amounts of data across distributed environment. In this article I want to describe the general idea of this product and show some examples of working with it.
At this stage you probably have a general idea of what Hadoop is in technical scene. But why do we really need such a huge and complicated platform for doing such simple things like searching, counting or sorting our data. According to the research provided by Cisco last year annual global IP traffic will reach 2.3 zettabytes per year by 2020. Another research forecast performed by International Data Corporation few years ago stated that up to 2020 people will have to operate with 44 zettabytes of data. Can we really handle such capacities with our current hardware and algorithms? Hadoop is probably the best attempt to handle that problem at this time.
There is quite an interesting competition which exists in the world of Big Data called Terasort. It appeared in 2008 with the general idea to generate, sort and validate 1TB of data. At that period the result was 3 minute 48 seconds on Hadoop cluster of 910 nodes. By the time the amount of data increased to 100TB and just few month ago we got a new record of sorting 100TB of data for 98.8 seconds in the cluster of 512 nodes. The actual results are available Sort Benchmark Home page.