Accessing Kerberized Hadoop cluster using Ranger security policies and native APIs

Security always played important role in every informational system. For each particular solution before actual implementation we first need to carefully design its protection layers by means of different techniques like authorization, authentication or encryption.  But sometimes at the early start developers don’t pay much attention to this topic and concentrate their efforts on the functional aspects of their applications. This is what has happened to Hadoop.

In my previous article about security in world of Big Data I’ve already given a high-level overview of this model. Now I want to share some experience about how to work with Hadoop services on the low level straight from the source code. We will create new principle in Hadoop environment, then we will give him permissions in Ranger and will use him from java application to access the services remotely.

Continue reading “Accessing Kerberized Hadoop cluster using Ranger security policies and native APIs”

Perfect Storm – real-time data streaming from .NET through Kafka to HBase, HDFS and Hive

In my previous articles I tried to give the overview of primary Hadoop services responsible for storing the data. With their help we can organize information into some common structures and perform operations upon them through such tools like MapReduce jobs or more high-level abstractions like Hive SQL or HBase querying language. But before doing this we certainly need to somehow put our data inside the cluster. The simplest way would be a copying of the information between the environments and  performing a set of commands from different service-related CLIs  to do the import or launching some bash scripts which can partly automate this work. But it would be great if we could have some common tool which would allow us to define different workflows for such processes so that single units of information could be imported, transformed, aggregated or passed through some algorithm before actual preservation. Such type of framework certainly should be scalable and should follow the general requirements of distributed environment. In Hadoop we have such tool called Storm and from my point of view this product is probably one of the most interesting and exciting parts of Big Data ecosystem. In this article I want to give you its overview and to share my experience of using it.

Continue reading “Perfect Storm – real-time data streaming from .NET through Kafka to HBase, HDFS and Hive”

Exploring Hadoop file system

HDFS is a core and fundamental component of Hadoop. This file system is oriented on handling huge amounts of data. From first glance you may not notice much differences from usual Linux file system as it follows lots of POSIX specifications. But behind the scene HDFS does a lot of extra work to provide stable and quick access to the data which is stored across different machines of distributed cluster. It is indeed a great data management mechanism which takes all responsibility for building most optimal data-flows according to the network topology of the cluster and which performs automatic handling of critical situations related to the breaks in hardware. In this article I’ll try to give a general overview of Hadoop file system and show some common techniques of working with the data inside it.

Continue reading “Exploring Hadoop file system”

Counting and sorting words in Hadoop

At this stage you probably have a general idea of what Hadoop is in technical scene. But why do we really need such a huge and complicated platform for doing such simple things like searching, counting or sorting our data. According to the research provided by Cisco last year annual global IP traffic will reach 2.3 zettabytes per year by 2020. Another research forecast performed by International Data Corporation few years ago stated that up to 2020 people will have to operate with 44 zettabytes of data. Can we really handle such capacities with our current hardware and algorithms? Hadoop is probably the best attempt to handle that problem at this time.

There is quite an interesting competition which exists in the world of Big Data called Terasort. It appeared in 2008 with the general idea to generate, sort and validate 1TB of data. At that period the result was 3 minute 48 seconds on Hadoop cluster of 910 nodes. By the time the amount of data increased to 100TB and just few month ago we got a new record of sorting 100TB of data for 98.8 seconds in the cluster of 512 nodes. The actual results are available Sort Benchmark Home page.

Continue reading “Counting and sorting words in Hadoop”

First dive into Hadoop

In my first article I want to share my experience with the steps I did to start working the world of Big Data. As a guy from .NET stack it was really quite challenging for me to understand what technologies I should start studying for getting myself into the world of Hadoop and to identify these first practical steps which need to be done to start working in this realm.

Continue reading “First dive into Hadoop”