Hadoop security overview

If you ask me what is the post complicated part of Hadoop configuration, I will say that it is security. From early start of development of this product the main efforts were focused on making a stable distributed framework and security was not the priority of that time. The base assumption was that system would work as a part of some trusted network environment and simple security model would be sufficient to cover the requirements of that period. But by the time Hadoop evolved and the problems of more complicated security challenges started to play more and more important role. Especially is became a sharp question once Big Data started to drive into the side cloud computing. So the integration of Kerberos protocol became the first serious step made in this direction. After authentication part logically community started to solve the problems related to authorization. According to basic security model most part of the services worked with custom Access Control Lists (ACL) and the general idea was to localize their management in a single place. Cloudera invented Senrty product and HortonWorks proposed alternative in view of Ranger application. Later on security components were improved with other features like support of encryption, protection of RESTful endpoints, integration with Active Directory and other. In this article I want to give the general overview of primary parts of Hadoop security model.

Step 1. Basic security model

By default Hadoop cluster works in non-secured mode. That means that all services trust each other and do not perform actual authentication. They all are acting on behalf of hosting accounts of operation system and logically follow the security rules of the file system. In Linux such security model is based on the three categories of ownership for each particular object – user owner level, group level and all users level and three types of permissions per each level – read, write and execute. So each resource in file system would have these attributes defined and each time somebody would like to perform some operation upon any entity, Linux would validate if sender has sufficient level of access. Lets take a look how this works in our cluster. First I want you to connect to your sandbox and  list the content of some catalog, /tmp for example:

ll /tmp

drwxr-xr-x  2 oozie         hadoop             4096 Feb  1 11:48 oozie-oozi910497584142374123.dir
drwxr-xr-x  2 root          root               4096 Jan 31 11:29 oozie-root1470142604609041610.dir
-r–r–r–  1 root          root          169803311 Oct  6  2015 oozie-sharelib.tar.gz

The output of ll command will provide us with the information all resources within requested location. It shows the following information:

  • permission to the resource: first symbol “d” – directory, “” –  file, “l” – link (equivalent of shortcut); next three sets of characters, three times, indicate permissions for owner, group and other – “r” read access, “w” write access and “x” execute access;
  • number of links to this resource;
  • user owner;
  • group owner;
  • resource size;
  • modification date;
  • resource name;

For our first resource we see that it is directory, it has read-write-execute level of access to oozie user, read-execute level to all users who are the members of hadoop group and read-execute to all other users. The third resource is a file with only read access to everybody. These permissions can be easily managed by two Linux commands:

  • chown – allows to modify the owner user and owner group of certain resource
  • chmod – allows to change the access attributes within applied ownership rule

These two commands allow you to manage the proper permissions policy over the whole file system. Lets create some file from our root user account, change its ownership to hdfs user and  hadoop group and then restrict its access policy to read-write to these owners and no access to other users:

su root

touch test.txt

ll test.txt

rw-r–r– 1 root root 0 Feb  9 08:46 test.txt

chown hdfs:hadoop test.txt

chmod 660 test.txt

ll test.txt

-rw-rw—- 1 hdfs hadoop 0 Feb  9 08:46 test.txt

As you see we changed the ownership to our file and it is now accessible to hdfs user and all users of hadoop group.

The copy of such security model was applied for HDFS file system and external Hadoop superuser can manage exact same privileges from command line using hdfs utility. HDFS supports ls command to output permissions applied to data, besides is supports chown and chmod commands. They are available for Hadoop file system through hdfs utility:

hdfs dfs -put test.txt /tmp

hdfs dfs -ls /tmp

-rw-r–r–   1 root      hdfs         12 2017-02-10 00:07 /tmp/test.txt

When we put data into hdfs Hadoop automatically updates security parameters of the owner according to the account who launches the command and set the owner group as hdfs. To retrieve the old ownership policy we can apply following commands:

sudo -u hdfs hdfs dfs -chown hdfs:hadoop /tmp/test.txt

sudo -u hdfs hdfs dfs -chmod 660 /tmp/test.txt

rw-rw—-   1 hdfs hadoop         12 2017-02-10 00:07 /tmp/test.txt

If you look into the list of users which exist in the system, you will see that Sandbox contains a separate account per each Hadoop service and most of them are the members of hadoop group or hdfs superuser group:

cat /etc/passwd

flume:x:515:505::/home/flume:/bin/bash
kafka:x:516:505::/home/kafka:/bin/bash
hdfs:x:517:505::/home/hdfs:/bin/bash
sqoop:x:518:505::/home/sqoop:/bin/bash
yarn:x:519:505::/home/yarn:/bin/bash
mapred:x:520:505::/home/mapred:/bin/bash
hbase:x:1002:505::/home/hbase:/bin/bash
knox:x:522:505::/home/knox:/bin/bash
hcat:x:523:505::/home/hcat:/bin/bash

cat /etc/group

hadoop:x:505:hive,storm,zookeeper,infra-solr,atlas,ams,zeppelin,livy,spark,flume,kafka,hdfs,sqoop,yarn,mapred,hbase,knox,hcat,hue,admin

hdfs:x:507:hdfs,admin,mapred

Such set of accounts and groups is forwarded through all the nodes of the cluster. And even for advanced security models these accounts still should persist for correct work of the cluster.

Step 2. Kerberized cluster

Simple security model could be definitely a good for some studying platform or development environment as it really abstracts developer from the complexities related with the administration of different sets of configurations. But in real production environments where external users from other domains need to interact with the cluster, such full trust policy does not suit. That is why Hadoop introduces Kerberos protocol as tool for authentication of all consumers of Big Data environment. Probably this part would contain a piece of overhead for the people who don’t have an access to some Kerberized platform and only started to study this technology but at least I’ll try to explain some basic concepts of such paradigm. When we are deciding to Kerberize our cluster we are saying that from that moment each operation within our system related to any Hadoop service would be performed with an extra step – special security check which will help target daemon to identify the original caller. So if user Bob from same domain is trying to access Hadoop file system then Namenode will know that it is indeed Bob but not some YARN component. The way how he is going to confirm this is through the interaction with special Kerberos component called Key Distribution Server (KDC) which contains the database of all Kerberos principles including Bob and NameNode and knows how to identify them. Bob will send request to KDC to confirm that he is really Bob and to another request to ask for an access to the NameNode service. KDC  will handle that verification and in case of success will return him a special token. Then he will send this token to the target service, which will be able to decrypt this information and to confirm that original caller is really Bob:

part5kerb2

You should understand that the main purpose of this process is authentication only. At the end stage target service will know who sent the request but not if this sender is allowed to perform the actions contained in this request.

The process of Kerberizing non-secured cluster is not simple. Before doing this you first have to implement certain manual preparation steps like installation and proper configuration of Key Distribution Server on primary host and deployment of Kerberos clients on all nodes of the cluster. Then you need to do a plenty of work related to the creation of principles and mapping them to the configurations of the services. But the good thing is that Hadoop vendors usually provide some extra tools which can heavily simplify some parts of this process. For example in HortonWorks distribution such action could be partly automated through Ambari Wizard for enabling Kerberos:

part5kerb.pngThis tool will automatically create all requited principles in KDC database and will perform necessary configuration changes of all services through the environment. This process is described here in details.

Kerberos logically divides its principles into two categories – service principles and user principles. The first type relates to all non-human applications which are running standalone and usually fully automated. Hadoop NameNode is an example of such application. These services are storing their sensitive information required for authentication in special files called keytabs. It is equivalent of user password that is why physical access to such files should be properly managed and mapped to correct service hosting accounts. Another category of principles is user principles.Its members authenticate to KDC by manual input of the password. Both approaches can be performed from the shell using kinit Kerberos utility:

  • User principle authentication:

kinit oleksii.yermolenko@MYDOMAIN.COM

Password for oleksii.yermolenko@MYDOMAIN.COM:

  • Service principle authentication:

kinit -kt /etc/security/keytabs/hdfs.service.keytab hdfs/node1@MYDOMAIN.COM

After successful login action to KDC every principle will receive a valid ticket issued for some period of time. This ticket usually can be found at the following location:

ll /tmp/kb5сс_{{userid}}

Another way to check you current ticket is to use klist Kerberos command:

klist

Ticket cache: FILE:/tmp/krb5cc_0
Default principal: oleksii.yermolenko@MYDOMAIN.COM

Valid starting            Expires 
02/10/17 05:40:17     02/10/17 15:40:20

If you are already working within preconfigured Kerberos environment or you’ve managed to get a Kerberized instance of sandbox you can verify the global configurations of your KDC in these configuration files stored on the primary node:

cat /var/kerberos/krb5kdc/kdc.conf

cat /etc/krb5.conf

These two files define the scope of Kerberos realm by declaring the collection of nodes and rules for Kerberos environment and KDC. Besides they can define a set of external domains which would be considered as a trusted resources. After applying such configurations all principles from these domains will be able to authenticate in Hadoop environment.

Sometimes it is useful to directly interact with KDC database and Kerberos provides a special utility kadmin for this. It allows to perform most commonly used operations upon the principles data:

kadmin.local

list_principals

hbase/host1@MYDOMAIN.COM
hbase/host2@MYDOMAIN.COM
hbase/host3@MYDOMAIN.COM
hbase/host4@MYDOMAIN.COM
hdfs/host1@MYDOMAIN.COM
hdfs/host2@MYDOMAIN.COM
hdfs/host3@MYDOMAIN.COM
hdfs/host4@MYDOMAIN.COM

As you can see KDC contains lots of principles representing each particular service. Just imagine how much work would it take to create a set of permissions for every independent principle at each ACL of all Hadoop daemons. Besides lots of principles would share the same set of access rules. So it would definitely make sense to somehow generalize them into some single policy. Hadoop actually does it by introducing a special mapping approach in view of abstraction called GroupMappingServiceProvider. It allows to define a set of rules which describe how your principles will be mapped to some more generalized user database, Linux for example. We can say that we want all hdfs/*@MYDOMAIN.COM principles to be mapped to the hdfs Linux user account and after this they will be able to carry the set of permissions which this user has in all Hadoop services. This is achieved by configuring a special property called hadoop.security.auth_to_local of core-site.xml file:

RULE:[1:$1@$0](hdfs@MYDOMAIN.COM)s/.*/hdfs/
RULE:[1:$1@$0](hbase@MYDOMAIN.COM)s/.*/hbase/
RULE:[1:$1@$0](spark@MYDOMAIN.COM)s/.*/spark/
RULE:[1:$1@$0](ambari@MYDOMAIN.COM)s/.*/ambari-qa/
RULE:[1:$1@$0](.*@.*MYDOMAIN.COM)s/@.*//

This setting defines a set of rules applied one by one from top to the bottom to any Kerberos principle in order to transform it to actual local Linux account. There is a great post by Rober Leval who tells about the syntax of these rules. At the end every principle can be easily verified on the correctness of such mapping by using a hadoop command directly from shell:

hadoop org.apache.hadoop.security.HadoopKerberosName oleksii.yermolenko@MYDOMAIN.COM 

Name: oleksii.yermolenko@MYDOMAIN.COM to oleksii.yermolenko

Step 3. Authorization

Kerberos allows us to deal with the identification of users and services. But when we know who is who, we need to understand what scope of operations is granted to every particular user within each independent service. From my point of view this is the most complicated part of Hadoop security as each service implements such step in its own fashion and it is really very hard to combine them all into a single security policy. I’ll try to give some high-level overview of some aspects of authorization implementation in several services:

  • YARN authorization – uses global Service Level Authorization mechanism which defines a separate ACL in hadoop-policy.xml configuration file. This file contains configuration which enable the list of accounts valid for performing scheduling operations in distributed cluster. On each iteration with YARN every principle will be verified if he has enough permissions to access service according to referenced ACL.
  • HDFS authorization – by default uses Linux file system security model which I’ve described in Step 1 of this article. Service Level Authorization feature is also supported by such services like DataNodes or NameNodes. This approach extends basic model and allows to apply multiple group policies to HDFS resources and some other extra features.
  • Oozie authorization  – Oozie is Hadoop scheduling service responsible for defining and periodical handling of some complicated workflows made of basic cluster operations. The authorization system of this service is very simple as it divides all consumers into two categories – users and admins. First can submit jobs, monitor them, grant access to them and kill them but this is related to their own jobs. Admins can also implement these operations but upon all jobs. To handle such division usually a separate Oozie admin Linux group is created in the environment. Then service is configured to use this group as admin group through oozie.service.AuthorizationService.admin.groups property of oozie-conf.xml file. After this every new admin users can just be added to this group to gain superuser level of permissions.
  • HBase authorization – HBase is a distributed Data Storage built upon Hadoop file system which keeps the data in non-SQL fashion as a set of  with key-values pares in different tables. Such functionality requires quite a serious authorization module which would provide different levels of access upon the data. This Datastore has several  main levels of permissions:
    • READ – read data from HBase resources
    • WRITE – write data to HBase resources
    • CREATE – create, alter and drop HBase resources
    • ADMIN – disable resources, grant privileges

HBase has a custom shell utility which provides an API for working with the data and managing these security rules. Default hbase local Linux user has ADMIN privileges and he can grant CREATE, READ and WRITE permissions to other users. This process is quite straightforward and easily be handled through hbase shell:

hbase shell

grant ‘oleksii.yermolenko’, ‘RWC’, ‘demotable’

  • Storm authorization – Storm service provides the functionality for online streaming of incoming data through some user-defined workflows. This mechanize does not require some significantly complicated security model that is why it usually based on ACL policy similar to YARN
  • Hive authorization – Hive is a SQL-based Datastore built upon HDFS. It allows to work with the data in normalized fashion using SQL querying syntax. This service supports two authorization options – data storage model and SQL-based model. The first one relies on the sets of permissions which exist in the HDFS for underlying resources. So each query would be verified according to the permissions which exist in Hadoop file system. The second method should be familiar to DB administrators as it uses GRANT-REVOKE syntax to apply different security rules upon existing content.
  • Spark authorization – Spark service performs complicated computing operations upon the data using distributed RAM of the cluster . This mechanize also does not require some significantly complicated security model that is why it usually based on ACL policy similar to YARN and Storm

If you may have noticed each service has its own requirements for related authorization policy. The main problem here is that administrator should know all of them and perform user-related changes throughout all ACLs and other security mechanisms of all services of the cluster. HortonWorkes tried to simplify this approach by introducing a single point of management of all these permissions in tool called Ranger. It is accessible through Ambari but users can access it directly though web UI using 6080 port by default. This product carries two main functions to cover authorization demands:

  •  User synchronization – when you work within a single realm it is not a big job to create all required users and group throughout the cluster and properly manage them. Things are getting more complicated when you start interacting with other domains. Just imagine how much work would it take to move all AD users into Hadoop system and then support all their changes. Just don’t mismatch the Kerberos principles with the end mapped accounts which are actually processed though authorization processes. UserSync Ranger module automates this stage by defining a series of LDAP queries directly to domain controller in order to retrieve and synchronize necessary users and groups into the cluster’s Ranger Database:

part5_2

  • Users authorization – as I’ve mentioned Ranger has its own user database which can be consumed by supported Hadoop services. This database is populated using both UserSync module or manual interactions:

part5_3

For each supported type of service Ranger can enable related plugin module which will play the role of interceptor for all authorization request. Such plugin internally contains a set of policies applied upon different set of users. HBase policies for example  would contain the permissions to read, write, create and admin tables and column families of the DataStore, YARN policies would have permissions to submit and track jobs:

part5_4.png

On each request after successful authentication operation would be mapped to specific plugin which will handle the actual authorization by verifying its related level of access to this application according to existing policies. By the time of publishing of this article the latest stable version of Ranger supported following Hadoop modules: HDFS, HBase, Hive, Storm, Kafka, YARN, Knox, Solr, Atlas.

Step 5. Encryption

Authentication and authorization creates a stable security ecosystem within the cluster. But is does not cover the scenarios of direct physical access to the data. For example we can steal the hard drives from the disk storage and connect them to some local machine. Then we will be able to decrypt the NameNode information and use it to restore actual data from the DataNodes as we did in previous article. In order to protect information from such use-case Hadoop provides a special protection mechanism which allows to create encryption zones for the data in HDFS. The generation of such zones is driven by the Key Management Service (KMS) which plays a central role in this process. Each encryption zone has an independent key which is available to Hadoop file system services and reliable users. On every single write operation of each new file block Active NameNode creates another encryption key generated by KMS and encrypts the content of the block with this key. Then the original encryption zone key is used to encrypt this new block key and its protected version is been saved into block metadata. To technically each file block will be encrypted with a separate key. To decrypt the data client has to get the encryption zone key first, then decrypt the block encryption key and then to decrypt to block’s content using this key. This process is fully transparent to the end users as it only requires a set of configuration steps.

Conclusion

Of cause it is impossible to cover all the aspects of Hadoop security in a single article But I hope that after reading it you have at least some high-level picture of security model used in the world of Big Data. This diagram summarizes the steps described in the article and probably you could find it useful to fill some missing gaps:

part5Sec.png

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s