If you ask me what is the most complicated part of Hadoop, I will tell you that it is configuration. It’s really a nightmare to keep in sync all these parts and their dependencies. You have to know and properly configure hundreds of different properties per each Hadoop daemon. At some stage you start to update of one part of your cluster and it breaks another. You fix it and this fix breaks something else. As a result instead of working with your data and writing your code you spend days in searching correct patches and configurations for your daemons.
And this is where you start working with Hadoop vendors. In common case they are founders and main contributors of Big Data ecosystem. They sell Hadoop as a product, fully configured and ready for the real work. Also they continuously improve distributions with new features and patches, make regular stable releases which could be applied into the real production environment without any breaks. Besides they provide full real-time and non-real-time support of their products. You can say that Hadoop is an open-source product and we can build our own cluster from different source controls of the ecosystem ourselves, but I’m just curious how much money, people and efforts would it take to create some independent Hadoop cluster of 50 machines and continuously support it for few years with proper updates and fixes. Probably very expensive.
From year to year the number of Hadoop vendors is growing. I will not give you their deep comparison as it could be the topic of separate article and would require more advanced background of knowledge of Hadoop from the side of the reader (some more information can be found on Pluralsite course Big Data – Big Picture but it is a bit obsolete). I will just mention theses few primary vendors how came into my scope at some stages of working with Big Data. They share the most part of the market:
- Cloudera – first and probably main player in the world of Big Data. The company was founded in 2008 by the engineers from Facebook, Google and Yahoo who brought primary concepts of distributed computations and built the foundation of the ecosystem.
- Hortonworks – founded in 2011 by another groups of co-founders of Hadoop. This company also became important player on Big Data market as it provides well-balanced distribution for storing the data and processing it. Besides the engineers of Hortonworks are leading contributors of many Hadoop solutions like YARN and Storm. They are following full open-sourced model comparing to Cloudera which has some non-public parts in their distribution.
- MapR – was founded in 2008. The company contributes to such Big Data projects like HBase, Pig (programming language), Apache Hive, and Apache ZooKeeper. Besides it provided a special version of HDFS called MapR FS with some enriched API and data storage functionality.
- IBM BigInsights – Hadoop distribution provided by IBM which targets specific analytical purposes. Their cluster is oriented on fast data processing using such Hadoop tools as Spark.
From this description we can underline two categories of problems which Hadoop resolves – data storage and data processing and all vendors are really balancing between two of them, spending more efforts either to find a better way to deal with first or second or to make an overall approach to cover them both.
In order to continue our diving into the realm of Big Data we need to start working with the valid cluster. By word valid I mean that it should have a number of things installed and properly configured. But it could take months to pass through this individually without having some solid background. And the reason why I started to talk about the vendors at the beginning of this article is that actually most of them provide such demonstration cluster fully configured and ready for work in a view of sandbox. It is a virtual instance, same as we did before in previous article using VirtualBox and Ubuntu OS but with a number of installations of different packages with compatible dependencies and various configurations. Most part of my practical experience is related to Hortonworks version of Hadoop cluster, so I will continue the article basing on this their distribution. But I’m sure that other versions of the cluster provided by different vendors would have a lot of common features and workarounds. I guess examples which I want to show and problems which I want to touch will be actual for them also.
Downloading and investigating sandbox
Now lets go to the Hortonworks official site and download our sandbox. But before doing this make sure that your machine satisfies hardware requirements. Especially it is related to the virtual memory. Ideally you need to have at least 16 GB of RAM on your local machine. Taking into account that your main OS will eat about 4-5 GB you’ll be able to operate with almost 11 GB of memory for your sandbox. Potentially things could work with 8GB of RAM (4-5 GB used by Sandbox), but keep in mind that once you’ll start working with the data, memory consumption will start growing and at some stage you can just get OutOfMemory message from your SandBox. This can corrupt some of your daemons so you’ll have to either just restart them or to perform some recovery steps to recreate valid state which is quite time-consuming procedure. Unfortunately lower bounds of hardware capacities would not work.
The official Hortonworks sandbox can be downloaded from download section of their site. There are options available for Docker and VMware but I will use VirtualBox instance. After downloading the image you should import it using Import Virtual Appliance => Import of the VirtualBox application.
Before moving on I want to mention the list of some useful and commonly used operations which you will use in Linux from the command line:
- ll – shows the content of the catalog with some detail information
- pwd – shows current location
- cd – change current directory
- cp – copy resource from one location into another
- mv – move resource from one location into another
- cat – shows the content of the file from beginning
- tail – shows the content of the file from the end
- chmod – edit the the permissions to access the file by the owner, owning group group and other users
- chown – change the owner and owning group of the file
- vim – edit the content of the file using vim editor
Besides you will use Hadoop-related commands which are the part of Hadoop distribution binaries:
- hdfs dfs – command for working with HDFS. Supports many native Linux commands for working with the resources of the file system. The full list for Hadoop 2.6.0 version (installed in previous article) is available on the official site
- yarn -jar – schedules new MapReduce job from java jar file
Now lets start the our virtual instance. Hortonworks uses CentOS version of Linux for the sandbox. RedHat and Oracle are another versions of Linux used by this vendor for hosting Hadoop in real production environment. When you launch the instance, CentOS starts the boot stage and launches all necessary daemons and their dependencies. After successful boot-up you should see such screen in you VirtualBox application desktop:
Now we are ready to work with Hadoop environment. Pay attention on the initial link to local web resource proposed by the session greeting message. Following this link you can get to information resource where you can find all necessary credentials for working with the cluster.
There are two common approaches to start investigation and work with our cluster:
Working through Ambari
High level approach – in order to obtain general information about the cluster and perform certain configuration changes of standalone daemons Hortonworks introduces custom open-source tool called Ambari (Ambari is a Hortonworks specific tool so you will not find it in distributions of other vendors but you can manually install it there). If you remember we had two different UI applications in previous example – one for HDFS and one for YARN. Many other Hadoop tools also have their UI applications. Ambari is an attempt to generalize them into a single place, so that administrator can login and perform required steps for maintaining the cluster. By default Ambari in sandbox is accessible from http://127.0.0.1:8080/#/login with raj_ops – raj_ops (admin-admin in old versions of sandbox) username and password:
Ambai Dashboard provides overall information about the cluster. Under the hood it uses distributed monitoring system, usually Ganglia, but as an option Nagois can be applied for this. Shows the statistical values of working of the services like memory consumption, disc usage, average up-time and other. Flexible for different type of custom configurations.
Ambari Services allow to perform configuration changes on the standalone daemons across all nodes of the cluster. Its really a great feature especially when you are working with multiple nodes and you have to keep versioning of these changes. Besides users can easily start-restart-stop services after some updates and monitor the state of their instances on different machines.
Ambari Hosts – as different services can be hosted on different nodes, especially in large clusters sometimes it is very easy to loose such information from your scope. Ambari hosts helps you with this by showing a list of components per each individual host. Besides it provides user with the information about available resources of the hosts, so that before installing some new component you can properly evaluate the capacities and pick up correct background for the new functionality
Ambari Alerts – Hadoop administrators can configure notification policy through this tool. If something goes wrong with some type of service, they can automatically get information about this by subscribing to the Ambari API Alerts section. For example such request will return all critical alerts in the cluster:
Ambari Views – Ambari allows to implement specific service-based tools called Views:
Depending on the type of View user can access HDFS directly from UI, perform queries through query-based services like Pig and Hive, submit more complicated Hadoop workflows through such services like Storm and do other interesting things. The main benefit is that can be done from the single place without any direct interaction with Linux environment. Views are configurable and allow another great piece of flexibility in managing resources.
Investigating sandbox through shell
Low level approach – in most cases when you work with Ambari UI you get only general information about the cluster. In real world very often people have to create ssh connection directly to Linux hosts an operate with Hadoop services using command shell.
Lets try to investigate our sandbox using this approach. Open your SSH client and connect to your local Big Data host at the address 127.0.0.1:2222 using root username and hadoop password. On the first connect to sandbox usually you have to change root password. After this lets examine the root of our environment using following command:
dr-xr-xr-x 2 root root 4096 Oct 25 07:56 bin
drwxr-xr-x 3 root root 4096 Oct 25 07:19 boot
drwxr-xr-x 3 root root 4096 Oct 25 07:38 cgroups_test
drwxr-xr-x 14 root root 2760 Jan 29 19:33 dev
drwxr-xr-x 1 root root 4096 Jan 29 19:52 etc
drwxr-xr-x 9 root root 4096 Jan 29 19:17 hadoop
drwxr-xr-x 44 root root 4096 Oct 25 08:04 home
drwxr-xr-x 4 kafka hadoop 4096 Oct 25 08:14 kafka-logs
dr-xr-xr-x 8 root root 4096 Oct 25 07:38 lib
dr-xr-xr-x 8 root root 12288 Oct 25 07:56 lib64
drwx—— 2 root root 4096 Jun 2 2016 lost+found
drwxr-xr-x 2 root root 4096 Sep 23 2011 media
drwxr-xr-x 2 root root 4096 Sep 23 2011 mnt
-rw——- 1 root root 538 Jan 29 19:34 nohup.out
drwxr-xr-x 4 root root 4096 Oct 25 08:16 opt
drwxr-xr-x 2 root root 4096 Oct 25 07:19 packer-files
dr-xr-xr-x 214 root root 0 Jan 29 19:33 proc
dr-xr-x— 1 root root 4096 Jan 29 19:52 root
dr-xr-xr-x 2 root root 4096 Oct 25 07:56 sbin
drwxr-xr-x 2 root root 4096 Jun 2 2016 selinux
drwxr-xr-x 2 root root 4096 Sep 23 2011 srv
dr-xr-xr-x 13 root root 0 Jan 29 19:33 sys
drwxrwxrwt 1 root root 4096 Jan 29 20:04 tmp
drwxr-xr-x 1 root root 4096 Oct 25 07:28 usr
drwxr-xr-x 5 root root 4096 Oct 25 07:21 vagrant
drwxr-xr-x 1 root root 4096 Oct 25 07:57 var
With a special font I marked out the catalogues which are probably well know to the people with some experience of working on Linux. But for these of you who are not familiar with it I want to give their brief explanation:
- bin – contains primary binaries of the command used in operation system like ls, cat, tail
- sbin – contains other category of binaries of commands available to superusers
- etc – contains configuration files of the operation system. Each new application installed into the system should keep references to its configuration in this directory
- home – contains directories of the individual users
- lib (lib64) – contains libraries consumed by different applications around the environment
- tmp – used by different applications to keep non-sensitive temporary information
- usr – contains all programs installed by the users. Our actual Hadoop daemons are sitting there. Usually biggest catalog of the OS.
- var – contains the variable components of the applications of the system. Usually it is log files, cached information or journals
The reason why I gave this description is that Hadoop services also use these directories and follow same hierarchy in organizing their own content. You can find links to the configurations of the applications in the /etc catalog for example or logs in /var/log catalog. But the actual Hadoop files are stored in the /usr/hdp/current location. “current” is actually a symbolic link to the location of certain version of Hadoop. Such approach simplifies the maintains and next updates of the cluster where all you need to do is to switch the symbolic link from old catalog to the new one:
lrwxrwxrwx 1 root root 28 Oct 25 07:46 hadoop-client -> /usr/hdp/184.108.40.206-1245/hadoop
lrwxrwxrwx 1 root root 27 Oct 25 07:46 hbase-client -> /usr/hdp/220.127.116.11-1245/hbase
lrwxrwxrwx 1 root root 27 Oct 25 07:46 hbase-master -> /usr/hdp/18.104.22.168-1245/hbase
lrwxrwxrwx 1 root root 26 Oct 25 07:46 hive-client -> /usr/hdp/22.214.171.124-1245/hive
lrwxrwxrwx 1 root root 26 Oct 25 07:46 hive-metastore -> /usr/hdp/126.96.36.199-1245/hive
lrwxrwxrwx 1 root root 26 Oct 25 07:46 hive-server2 -> /usr/hdp/188.8.131.52-1245/hive
lrwxrwxrwx 1 root root 27 Oct 25 07:46 kafka-broker -> /usr/hdp/184.108.40.206-1245/kafka
lrwxrwxrwx 1 root root 26 Oct 25 07:46 knox-server -> /usr/hdp/220.127.116.11-1245/knox
lrwxrwxrwx 1 root root 26 Oct 25 07:46 livy-client -> /usr/hdp/18.104.22.168-1245/livy
lrwxrwxrwx 1 root root 26 Oct 25 07:46 livy-server -> /usr/hdp/22.214.171.124-1245/livy
lrwxrwxrwx 1 root root 27 Oct 25 07:46 oozie-client -> /usr/hdp/126.96.36.199-1245/oozie
lrwxrwxrwx 1 root root 27 Oct 25 07:46 oozie-server -> /usr/hdp/188.8.131.52-1245/oozie
lrwxrwxrwx 1 root root 25 Oct 25 07:46 pig-client -> /usr/hdp/184.108.40.206-1245/pig
I listed only a part of directories at this location but keep in mind that this is the core of Hadoop, where each catalog represents certain service. Lets take a look into the content of one of them:
drwxr-xr-x 2 root root 4096 Oct 25 07:29 bin
drwxr-xr-x 3 root root 4096 Oct 25 07:29 etc
-rw-r–r– 1 root root 8517435 Aug 26 01:19 hadoop-hdfs-220.127.116.11.5.0.0-1245.jar
-rw-r–r– 1 root root 3584744 Aug 26 01:19 hadoop-hdfs-18.104.22.168.5.0.0-1245-tests.jar
lrwxrwxrwx 1 root root 34 Oct 25 07:29 hadoop-hdfs.jar -> hadoop-hdfs-22.214.171.124.5.0.0-1245.jar
-rw-r–r– 1 root root 102480 Aug 26 01:19 hadoop-hdfs-nfs-126.96.36.199.5.0.0-1245.jar
lrwxrwxrwx 1 root root 38 Oct 25 07:29 hadoop-hdfs-nfs.jar -> hadoop-hdfs-nfs-188.8.131.52.5.0.0-1245.jar
lrwxrwxrwx 1 root root 40 Oct 25 07:29 hadoop-hdfs-tests.jar -> hadoop-hdfs-184.108.40.206.5.0.0-1245-tests.jar
drwxr-xr-x 2 root root 4096 Oct 25 07:29 lib
drwxr-xr-x 2 root root 4096 Oct 25 07:29 sbin
drwxr-xr-x 8 root root 4096 Oct 25 07:29 webapps
You can see same bin, etc, sbin and lib catalogs described earlier. Such structure is typical for every Hadoop daemon per each host – configuration part, public binary part, custom binary part and some libraries. For example in /usr/hdp/current/hadoop-hdfs-namenode/bin you can find the actual hdfs command which you use for working with the file system. Now lets try to find yarn program. If you are expecting difficulties with this actually there is another Linux command which can help you to do this:
find / -type f -name yarn
This is the actual location of our binary command which schedules MapReduce jobs from jar files.
Logging piece of Hadoop generally configured to use log4j (advanced settings are available through etc/hadoop/log4j.properties file) framework. Log files can be found in /var/log location:
drwxr-xr-x 3 falcon falcon 4096 Oct 25 08:17 falcon
drwxr-xr-x 2 flume hadoop 4096 Oct 25 07:32 flume
drwxrwxr-x 1 root hadoop 4096 Jan 29 19:33 hadoop
drwxr-xr-x 1 mapred hadoop 4096 Oct 25 07:38 hadoop-mapreduce
drwxr-xr-x 1 yarn hadoop 4096 Oct 25 07:38 hadoop-yarn
drwxr-xr-x 3 hbase hbase 4096 Oct 25 08:17 hbase
drwxr-xr-x 1 hive hadoop 4096 Jan 29 21:02 hive
drwxr-xr-x 2 hive hive 4096 Oct 25 07:35 hive2
drwxr-xr-x 2 hive hive 4096 Oct 25 07:34 hive-hcatalog
drwxr-xr-x 2 root root 4096 Oct 25 08:17 hst
drwx—— 2 root root 4096 Oct 25 07:57 httpd
drwxr-xr-x 2 hue hue 4096 Oct 25 08:17 hue
drwxr-xr-x 2 kafka hadoop 4096 Oct 25 08:17 kafka
drwxr-xr-x 2 knox knox 4096 Oct 25 08:17 knox
According to best practices each node of the cluster has a group called Hadoop and a collection of users who are the members of this group. The /etc catalog has some system configuration files which contain this information:
cat /etc/group hadoop:x:505:hive,storm,zookeeper,infrasolr,atlas,ams,zeppelin,livy,spark,flume,kafka,hdfs,sqoop,yarn,mapred,hbase,knox,hcat,hue,admin
Each Hadoop daemon is running under appropriate user which is configured to access certain resources according to internal Access Control List of individual service. Such model is called simple or basic security model. More advanced approach uses Kerberos security for authentication purposes and Ranger as a tool for single authorization to all services all over the cluster.
In the following article I tried to show you how to deploy standalone Hadoop cluster from a sandbox provided by Hortonworks. This is indeed a great platform for studying proposes as this product contains exact same content which you can meet in real production environment. You can play around with configurations, deploy your own packages or external products, test the logic of your software locally without any fears of breaking something and affecting someone. In my next articles I will keep using this platform and try to make a deeper dive into MapReduce and to show you how such Hadoop tools like Hive and Pig simplify the processing of the data using this paradigm.