Scheduling jobs in Hadoop through Oozie

One of the common problems which software engineers can meet at different stages of application development are the tasks relating to the scheduling of jobs and processes on periodical bases. For this purpose Windows OS family provides a special component called Task Scheduler. Linux world proposes its own alternative approach – embedded daemon called Cron. Hadoop distributed ecosystem which is working on the top of mentioned underlying operational systems introduces another set of challenges related to scheduling problem different from the typical tasks. Here we need to deal with a category of jobs which are running though a number of physical machines and are flowing between Hadoop services. In order to simplify the implementation of such workflows Big Data introduces a special component called Oozie. In this article I would like to give you an overview of this product.

Oozie overview

The history of Oozie component starts from 2010 when Yahoo created a standalone project for scheduling purposes in Big Data ecosystem. After a few years of development in Yahoo project was moved to Apache community and in 2012 there was a first official release of Oozie product. Nowadays it became one of the primary Hadoop components and you can meet it in most distributions provided by popular Hadoop vendors.

In order to start getting acquainted with this component from my personal experience I would recommend the following resources:

Form high level perspective Oozie framework consist of a number of components which are responsible for keeping and managing the batch-based workflows. A single workflow represents a multistage Hadoop job organized into directed acyclic graph which is executed step by step. This concept is similar to other Hadoop processing solutions like Hive or Spark. But if these frameworks work in scope of their own processes, Oozie allows to build cross-process workflows where every step could be executed within different Hadoop component. For example you can have a Hive query on the start followed by Spark action and some email notification at the end. But apart from that these steps could be scheduled on some periodical bases and this is another important feature of Oozie framework. That is why I usually divide this product into two logical parts:

  • Workflow organization – responsible for the definition of actual workflow
  • Workflow coordination – responsible for the scheduling of the job on periodical bases

In combination these pieces form a single Oozie application. A set of such applications can be grouped into collection called Oozie Bundle. Usually every workflow in bundle depends on the execution of other workflows and the output data of the first can be an input data for the next.

Oozie Application Design

From technical perspective Oozie is represented by a number of components which allow clients to submit their applications and track their execution state:

  • Oozie Web Server – server application running as common Tomcat container. It is responsible for interactions with the client applications, persistence of workflows, their scheduling and execution. Usually server is represented by single file oozie.war which contains all required dependencies inside
  • Oozie DB – behind the scene Oozie uses relational DB to persist all the data about users workflows and their state. Usually MySql and Postgres are used for this purpose, but other options like Oracle or Derby are supported as well.
  • Client application – Oozie provides a set client APIs for working with the service. In most cases operations are limited with submitting the workflows and getting the information about their execution. The most simple way of doing this is through native Oozie CLI directly from shell. Apart from that Oozie provides RESTful API which can be used for writing custom applications. For Java developers there is an option to access Oozie through Java client API which is a wrapper upon original Web RESTful API

The following diagram shows Oozie high level overall design:

Oozie.png

Diagram 1. – Oozie overall design

As I’ve mentioned earlier Oozie works in terms of applications. Typical application consist a number of files where users has to properly describe all required properties. Here is the description of these files and their role:

  • workflow.xml – contains actual definition of Oozie job which is represented by an XML with a number of action nodes that describe the steps of the workflow. File has Start node and Finish node. Between them there could be a number of action nodes which could refer to different operations in external services. Oozie supports Hive, Hive2, Pig, MapReduce, Spark, Sqoop, Email, SSH, DistCP  and Shell actions. Apart from that users can create Custom actions with some specific logic. These actions could be additionally decorated with conditional and aggregation logic through fork and join nodes. Here is an example of typical workflow:Oozie_workflow.png
  • coordinator.xml – contains the scheduling configuration of the workflow – Start Date, Frequency and End Date. The frequency parameter supports well-known for Linux users Cron syntax and I usually use this online resource to pick the correct value. Oozie application could be scheduled without coordinator file which would mean that the job would be executed only once. The job with coordinator file called Coordinator job
  • bundle.xml – contains the configuration about the the aggregation of collection of coordinator jobs into a single workflow. Used for some complicated scenarios which I personally didn’t met on practice
  • job.properties – contains a set of variables which define different properties of the job like the address of NameNode, HiveServer2 ODBC connection string or physical location of the application files in the file system. User can define custom variables in this file and use them within the application workflow file. For example you can have different job.properties files per Development, Testing and Production environment, while at the same time a single workflow file
  • other files – application could use some other files like Hive queries, Python scripts or MapReduce jar files which would be consumed by the application workflow actions

Unfortunately at this stage there is no way to manage these files through some sort of UI where you can create applications using handy Drag-n-Drop operations. From my point of view that is the biggest issue in working with this product as developers and operation specialists have to manually create and manage these XML files in pair with other definitions of the job instead of direct interactions through UI-based tool. But still the benefits which you can gain for these efforts are really great.

Scheduling custom job at local Sandbox

Now lets try to schedule some simple job in our sandbox. We are going to run workflow with single query to Hive table which would be triggering every day by Oozie service. Also we would decorate our job with email notification.

First of all lets create a catalog in our local Linux environment where we would keep all the necessary information about the job:

[root@sandbox ~]# mkdir /ooziejob
[root@sandbox ~]# cd /ooziejob

Now lets create a query file for our job. I will use one of the tables which was in my Sandbox in default database but you can choose any other table you have in your Hive warehouse:

[root@sandbox oozie]# vi query.hql

INSERT OVERWRITE DIRECTORY ‘/oozieresults’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE
SELECT * FROM sample_07 LIMIT 100;

!wq

The query is done. It will grab top 100 results from our table and will persist them into HDFS. Now lets create a workflow for the execution of this query and email notification. If you remember job’s workflow is represented by an XML file with a set of actions between start and end nodes. We will create it in the same place with the query:

[root@sandbox oozie]# vi workflow.xml


<?xml version="1.0" encoding="UTF-8"?>

<workflow-app xmlns="uri:oozie:workflow:0.5" name="${appName}">
<start to="hive-generate-dataset"/>

<action name="hive-generate-dataset">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>${jdbcURL}</jdbc-url>
<script>query.hql</script>
</hive2>
<ok to="email-success"/>
<error to="email-fail"/>
</action>

<action name="email-success">
<email xmlns="uri:oozie:email-action:0.2">
<to>oerm85@gmail.com</to>
<subject>Status of the query ${wf:id()}</subject>
<body>The workflow ${wf:id()} completed successfully</body>
<attachment>/oozieresults/000000_0</attachment>
</email>
<ok to="end"/>
<error to="fail"/>
</action>

<action name="email-fail">
<email xmlns="uri:oozie:email-action:0.2">
<to>oerm85@gmail.com</to>
<subject>Status of the query ${wf:id()}</subject>
<body>The workflow ${wf:id()} failed</body>
</email>
<ok to="end"/>
<error to="fail"/>
</action>

<kill name="fail">
<message>Hive2 (Beeline) action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>

<end name="end"/>
</workflow-app>

The next step would be the configuration of the driver. Let’s create a file with appropriate environmental properties of the job:

[root@sandbox oozie]# vi job.properties

nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8032
jdbcURL=jdbc:hive2://sandbox.hortonworks.com:10000/default;

appName=query-report
oozie.wf.application.path=${nameNode}/user/oozie/demoreport/workflow.xml
oozie.use.system.libpath=true

At this stage we can test our job and launch it for a single time without any coordination. For this we need to push our workflow and satellite data into HDFS:

hdfs dfs -mkdir /user/oozie/demoreport

hdfs dfs -put query.hql /user/oozie/demoreport

hdfs dfs -put workflow.xml /user/oozie/demoreport

As you can see except driver file we pushed all the data into our distributed file system. In our job we are using email action which means that in Oozie configuration we need to add proper SMTP settings. I’m using Google account so I’ve added required parameters into oozie-site.xml to make my email notifications work:

part14settingsd.png

Now let’s run oozie command with an argument to our driver file:

oozie job -oozie http://sandbox.hortonworks.com:11000/oozie -config job.properties -run

After this our job should be scheduled into Oozie service. We can track it through Oozie Web UI which is running on the 11000 port of our Sandbox:

part14_oozieJoOozieUIb.png

Apart from that we can check that our job is running using Resource Manager UI (top application):

part14_oozieJob.png

Once the job is done we should receive an email with the results of the query. In this example Oozie launched query only once. But if we want to schedule the execution on some periodical bases we need to add coordination for the job. Lets create coordinator file and modify our driver with this file:

[root@sandbox oozie]# vi coordinator.xml

</strong></em>

<em><coordinator-app name="rsa-report-coordinator" frequency="${coord:days(1)}" start="${startTime}" end="${endTime}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"></em>
<em> <action></em>
<em> <workflow></em>
<em> <app-path>${workflowPath}</app-path> </em>
<em> </workflow></em>
<em> </action></em>
<em></coordinator-app></em>
<p style="padding-left: 30px;"><em><strong>

!wq

 

[root@sandbox oozie]# vi job.properties

nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8032
jdbcURL=jdbc:hive2://sandbox.hortonworks.com:10000/default;

appName=query-report
workflowPath=${nameNode}/user/oozie/demoreport/workflow.xml
oozie.coord.application.path=${nameNode}/user/oozie/demoreport/coordinator.xml
oozie.use.system.libpath=true

startTime=2017-10-01T00:00Z
endTime=2100-02-28T15:00Z

!wq

[root@sandbox oozie]# hdfs dfs -put workflow.xml /user/oozie/demoreport

We added another file into HDFS and did the update of the driver file with two extra settings – location of coordinator file and start-end dates for our job. Now if you lunch the workflow through oozie command line again, it will submit application inside the service scheduler and you will start receiving emails on daily bases. The work is done and at this stage you should have a scheduled job which would be sending you lovely emails every day.

 

 

 

 

2 thoughts on “Scheduling jobs in Hadoop through Oozie

  1. Hi Oleksii,

    My name is Anuj Agarwal. I’m Founder of Feedspot.

    I would like to personally congratulate you as your blog Diving into Hadoop has been selected by our panelist as one of the Top 40 Hadoop Blogs on the web.

    https://blog.feedspot.com/hadoop_blogs/

    I personally give you a high-five and want to thank you for your contribution to this world. This is the most comprehensive list of Top 40 Hadoop Blogs on the internet and I’m honored to have you as part of this!

    Also, you have the honor of displaying the badge on your blog.

    Best,
    Anuj

    Like

    1. Hi Anuj, thank you for such great news. Really appreciate that. I started this blog to help people by sharing my experience about Big Data. It is very pleasant to hear that my efforts got recognition at this early stage.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s