The world of Big Data provides a wide variety of tools for organizing large amounts of data within a single place. But after we’ve pushed everything in place, usually we start seeking for the options of getting different benefits from this information. We involve statisticians and data experts for the investigation of our data and they start applying certain techniques to detect different features of such information. As a result these features will allow us to build some useful data models which will help us to improve the quality of our products. For example we can build the recommendations for our customers about which products they could like or dislike or even more we can identify the best characteristics for our new products relying on the current demands of our clients. Hadoop provides a set of tools for doing such operations and at this moment Spark is probably one of the best options which fits the demands of data specialists who work within Big Data realm. In this article I want to give the overview of this product and to show how to use it in HortonWorks Hadoop distribution sandbox.
When I started to work in scope of Big Data, one of the most challenging things for me was to understand the distributed nature of Hadoop applications. Usually most part of software developers think about their products in terms of single program components which are represented by standalone applications. Every program is an independent monolith unit which is located in own machine, runs in its own process and responsible for custom range of tasks. Big data introduces another level of composition, where every single program can be distributed across the nodes of the cluster as set of standalone services driven by some master service. The development of such model creates new challenges related to the synchronization these services and handling the consistent state of the overall application. This problem is common for every distributed system and instead of reinventing custom solution for each particular product of Hadoop family, community created a universal tool called Zookeeper. In this article I want to give you overview of this application and show some examples of working with it.
“Three admins once went to noSql bar, but a little while walked away from there as they could not find a table” – says one popular joke. The statement is arguable from my point of view, but it undercover the general idea – differences of approaches of treating the data. Most of developers usually start their career from working with some RDBMS products like Oracle or MSSQL server. These systems keep the data in some normalized format represented by a sets of tables and relations between them. In many cases this model ideally suits to accomplish the data-related requirements of many products. But sometimes such structure brings certain overheads which negatively effects the end performance of the application. In order to overcome these restrictions an alternative NoSQL approach for keeping and managing the data was invented. There are a lot of different implementations of this technique and each of them targets to resolve some individual tasks, but from Hadoop perspective HBase is probably one of the most commonly used NoSQL database which provides a strong and reliable mechanism for managing huge amounts of data across distributed environment. In this article I want to describe the general idea of this product and show some examples of working with it.
If you ask me what is the post complicated part of Hadoop configuration, I will say that it is security. From early start of development of this product the main efforts were focused on making a stable distributed framework and security was not the priority of that time. The base assumption was that system would work as a part of some trusted network environment and simple security model would be sufficient to cover the requirements of that period. But by the time Hadoop evolved and the problems of more complicated security challenges started to play more and more important role. Especially is became a sharp question once Big Data started to drive into the side cloud computing. So the integration of Kerberos protocol became the first serious step made in this direction. After authentication part logically community started to solve the problems related to authorization. According to basic security model most part of the services worked with custom Access Control Lists (ACL) and the general idea was to localize their management in a single place. Cloudera invented Senrty product and HortonWorks proposed alternative in view of Ranger application. Later on security components were improved with other features like support of encryption, protection of RESTful endpoints, integration with Active Directory and other. In this article I want to give the general overview of primary parts of Hadoop security model.
HDFS is a core and fundamental component of Hadoop. This file system is oriented on handling huge amounts of data. From first glance you may not notice much differences from usual Linux file system as it follows lots of POSIX specifications. But behind the scene HDFS does a lot of extra work to provide stable and quick access to the data which is stored across different machines of distributed cluster. It is indeed a great data management mechanism which takes all responsibility for building most optimal data-flows according to the network topology of the cluster and which performs automatic handling of critical situations related to the breaks in hardware. In this article I’ll try to give a general overview of Hadoop file system and show some common techniques of working with the data inside it.
At this stage you probably have a general idea of what Hadoop is in technical scene. But why do we really need such a huge and complicated platform for doing such simple things like searching, counting or sorting our data. According to the research provided by Cisco last year annual global IP traffic will reach 2.3 zettabytes per year by 2020. Another research forecast performed by International Data Corporation few years ago stated that up to 2020 people will have to operate with 44 zettabytes of data. Can we really handle such capacities with our current hardware and algorithms? Hadoop is probably the best attempt to handle that problem at this time.
There is quite an interesting competition which exists in the world of Big Data called Terasort. It appeared in 2008 with the general idea to generate, sort and validate 1TB of data. At that period the result was 3 minute 48 seconds on Hadoop cluster of 910 nodes. By the time the amount of data increased to 100TB and just few month ago we got a new record of sorting 100TB of data for 98.8 seconds in the cluster of 512 nodes. The actual results are available Sort Benchmark Home page.
If you ask me what is the most complicated part of Hadoop, I will tell you that it is configuration. It’s really a nightmare to keep in sync all these parts and their dependencies. You have to know and properly configure hundreds of different properties per each Hadoop daemon. At some stage you start to update of one part of your cluster and it breaks another. You fix it and this fix breaks something else. As a result instead of working with your data and writing your code you spend days in searching correct patches and configurations for your daemons.