Top Apache Oozie Interview Questions and Answers

Last updated on Feb 18 2022
Avinash M

Table of Contents

What is Oozie?

Oozie is a workflow scheduler for Hadoop Oozie allows a user to create Directed A cyclic Graphs of workflows and these can be running in parallel and sequential in Hadoop. It can also run plain java classes, Pig workflows and interact with the HDFS. It can run jobs sequentially and in parallel.

Why use oozie instead of just cascading a job one after another?

Major Flexibility. Start, stop, re-run and suspend Oozie allows us to restart from failure

Explain the Apache Oozie?

Apache Oozie is a scheduler that lets users schedule and executes Hadoop jobs. Users can execute multiple tasks parallelly so that more than one job can be executed simultaneously. It is a scalable, extensible, and reliable system that supports different types of Hadoop jobs. These jobs include MapReduce jobs, Hive, Streaming jobs, Scoop, and Pig.

apacheOozie1

What is the need for Apache Oozie?

Apache Oozie provides a great way to handle multiple jobs. There are different types of jobs that users want to schedule to be run later or the tasks that need to follow a specific sequence during execution. These kinds of executions can be made easy with the help of Apache Oozie. Using Apache Oozie, the administrator or the user can execute the various independent jobs parallelly, run the jobs back-to-back following a certain sequence, or can control the jobs from anywhere thus, making it very useful.

What are some of the useful EL functions in the Oozie workflow?

Below is the list of some useful EL functions of Oozie workflow.

  • wf. name() – It returns the application name in the workflow.
  • wf. id() – This function returns the job id of the currently running workflow job.
  • wf.errorCode(String node) – It returns the error code of the executing action node.
  • wf.lastErrorNod() – This function returns the name of the last executed action node in a workflow that exits with an error.

Explain the different nodes supported in Oozie workflow.

Below is the list of action nodes that Apache Oozie workflow supports in and helps in the computation tasks.

  • Map Reduce Action. This action node initiates the Hadoop Map-Reduce job
  • Pig Action. This node is used to start the Pig job from Apache Oozie workflow.
  • FS (HDFS) Action. This action node allows the Oozie workflow to manipulate all the HDFS-related files and directories. Also, it supports commands such as mkdir, move, chmod, delete, chgrp, and touchz.
  • Java Action. It is the sub-workflow action node that helps in the execution of public static void main(String[] args) method of main java class in Oozie workflow.

apacheOozie2

Explain the Pipeline works In Oozie

The pipeline in Oozie helps in connecting the multiple jobs in a workflow that executes regularly but during different intervals. In this pipeline, the output of multiple executions of workflow becomes the input of the next scheduled job in the workflow that gets executed back to back in the pipeline. The joined chain of workflows forms the Oozie pipeline of jobs.

Explain the life cycle of the Oozie workflow job

The job in the Apache Oozie  workflow transition through the blow states.

  • PREP – This is the state when the user creates the workflow job. During PREP state, the job is only defined and is not running.
  • RUNNING – When the job starts, it changes to the RUNNING state and remains in this state until the job reaches the end state, an error occurs, or the job is suspended.
  • SUSPENDED – The state of the job in Oozie workflow changes to SUSPENDED if the job is suspended in between. The job will remain in this state until it is killed or resumed.
  • SUCCEEDED – The workflow job becomes SUCCEEDED when the job reaches the end node.
  • KILLED – The workflow job transitions to KILLED state when the administrator kills any job in PREP, RUNNING OR SUSPENDED states
  • FAILED – The job state changes into a FAILED state when the running job fails due to an unexpected error.

apacheOozie3

What is the command line option to check the status of workflow/coordinator or bundle action in oozie?

oozie job -oozie http.//localhost.8080/oozie -info <>

What is bundle in oozie and what controls does it have?

Bundle is a higher-level abstraction in oozie that batches a set of coordinator applications. In bundle, the user has the control to start/stop/suspend/resume/rerun the jobs within the bundle level resulting in a better and easy operational control.

List the various status a bundle job will go under transition?

PREP, RUNNING, RUNNINGWITHERROR, SUSPENDED, PREPSUSPENDED, SUSPENDEDWITHERROR, PAUSED, PAUSEDWITHERROR, PREPPAUSED, SUCCEEDED, DONEWITHERROR, KILLED, FAILED

Is it mandatory to give password when the oozie workflow contains Hive2 action node and connects to Hive2?

Yes, passwords are required for a secured Hive Server2 that is backed by authentication like LDAP.
Non secured Hive Server and Kerberos don’t require passwords.

Name the oozie action extension that is used to copy from files from one cluster to another or within the same cluster?

DistCp action
Ex .
< action name=”[NODE-NAME]” >
< distcp xmlns=”uri.oozie.distcp-action.0.2″ >
< job-tracker > ${jobTracker} < /job-tracker >
< name-node > ${nameNode1} < /name-node >
< arg > ${nameNode1}/path/to/input.txt < /arg >
< arg > ${nameNode2}/path/to/output.txt < /arg >
< /distcp >
< ok to=”[NODE-NAME]”/ >
< error to=”[NODE-NAME]”/ >
< /action >

Whats is the default in-bult database where Oozie stores the job ids and job status related information?

By default, Oozie is configured to use Embedded Derby.
Oozie works with HSQL, Derby, MySQL, Oracle or PostgreSQL databases.

List the various control nodes in Oozie workflow?

Start
End
Kill
Decision
Fork & Join Control nodes

Explain fork & join control nodes?

The fork node splits one path of execution into multiple concurrent paths of execution.
The join node waits until every concurrent execution path of a previous fork node arrives to it. The fork and join nodes must be used in pairs.

Syntax .
< fork name=”[FORK-NODE-NAME]” >
< path start=”[NODE-NAME]” / >

< path start=”[NODE-NAME]” / >
< /fork >

< join name=”[JOIN-NODE-NAME]” to=”[NODE-NAME]” / >

What is the use of sub-workflow action in Oozie?

The sub-workflow action is included within the workflow action element,it runs a child workflow job.The child workflow job can be in the same Oozie system or in another Oozie system.
The parent workflow job will wait until the child workflow job has completed.
< workflow-app name=”[WF-DEF-NAME]” xmlns=”uri.oozie.workflow.0.1″ >

< action name=”[NODE-NAME]” >
< sub-workflow >
< app-path > [WF-APPLICATION-PATH] < /app-path >
< propagate-configuration/ >
< configuration >
< property >
< name > [PROPERTY-NAME] < /name >
< value > [PROPERTY-VALUE] < /value >
< /property >

< /configuration >
< /sub-workflow >
……

List the different type of Oozie jobs?

Oozie Workflow – It is a collection of actions arranged in a Directed Acyclic Graph (DAG)
Oozie coordinator – Are recurrent Oozie Workflow jobs that are triggered by time and data availability
Oozie Bundle – A higher-level oozie abstraction that batches a set of coordinator applications. The user has the control
to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control

How are workflow nodes classified?

It is classified as two types.
Control flow nodes. nodes that control the start and end of the workflow and workflow job execution path.
Action nodes. nodes that trigger the execution of a computation/processing task

What is application pipeline in Oozie?

It is necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline.

What are the extra files we need when we run a Hive action in Oozie?

hive.hql
hive-site.xml

Why we use Fork and Join nodes of oozie?

— A fork node splits one path of execution into multiple concurrent paths of execution. — A join node waits until every concurrent execution path of a previous fork node arrives to it. — The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.
<fork name=”[FORK-NODE-NAME]”>
<path start=”[NODE-NAME]” />

<path start=”[NODE-NAME]” />
</fork>

<join name=”[JOIN-NODE-NAME]” to=”[NODE-NAME]” />

What is Decision Node in Oozie?

Decision Nodes are switch statements that will run different jobs based on the outcomes of an expression.

How does OOZIE work?

  • Example Workflow Diagram
  • Packaging and deploying an Oozie workflow application
  • Why use Oozie?
  • Features of Oozie

It consists of two parts.

  • Workflow engine. Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
  • Coordinator engine. It runs workflow jobs based on predefined schedules and availability of data.

Oozie is scalable and can manage the timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

at is Database SQL

apacheOozie4

Oozie is very much flexible, as well. One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun failed workflows. One can easily understand how difficult it can be to catch up missed or failed jobs due to downtime or failure. It is even possible to skip a specific failed node.

Why use Oozie?

The main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system.

Dependencies between jobs are specified by a user in the form of Directed Acyclic Graphs. Oozie consumes this information and takes care of their execution in the correct order as specified in a workflow. That way user’s time to manage complete workflow is saved. In addition, Oozie has a provision to specify the frequency of execution of a particular job.

Features of Oozie

  • Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.
  • Using its Web Service APIs one can control jobs from anywhere.
  • Oozie has provision to execute jobs which are scheduled to run periodically.
  • Oozie has provision to send email notifications upon completion of jobs.

What is the need for Apache Oozie?

Apache Oozie provides a great way to handle multiple jobs. There are different types of jobs that users want to schedule to be run later or the tasks that need to follow a specific sequence during execution. These kinds of executions can be made easy with the help of Apache Oozie. Using Apache Oozie, the administrator or the user can execute the various independent jobs parallelly, run the jobs back-to-back following a certain sequence, or can control the jobs from anywhere thus, making it very useful.

What are the main components of the Apache Oozie workflow?

The Apache Oozie workflow consists of the control flow nodes and action nodes.

Below is the explanation of these nodes.

  • Control flow nodes. These nodes define the start and end of the workflow, i.e., start, end, and fail. Besides, it also offers the mechanism that manages the execution path in the workflow, i.e., decision, fork, and join.
  • Action nodes. These nodes offer the mechanism that initiates the execution of the processing or computation task. Oozie supports different actions, including Hadoop MapReduce, Pig, and File system, and system-specific jobs such as HTTP, SSh, and email.

What is the use of Join and Fork nodes in Oozie?

The fork and join nodes in Oozie get used in pairs. The fork node splits the execution path into many concurrent execution paths. The join node joins the two or more concurrent execution paths into a single one. The join node is the children of the fork nodes that concurrently join to make join nodes.

What is Oozie Bundle?

Oozie bundle allows the user to execute the job in batches. The Oozie bundle jobs are started, stopped, suspended, resumed, re-run, or killed in batches, thus providing better operational control.

Explain the Pipeline works In Oozie

The pipeline in Oozie helps in connecting the multiple jobs in a workflow that executes regularly but during different intervals. In this pipeline, the output of multiple executions of workflow becomes the input of the next scheduled job in the workflow that gets executed back-to-back in the pipeline. The joined chain of workflows forms the Oozie pipeline of jobs.

Mention Some Features of Oozie?

  • Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.
  • Using its Web Service APIs one can control jobs from anywhere.
  • Oozie has provision to execute jobs which are scheduled to run periodically.
  • Oozie has provision to send email notifications upon completion of jobs.

Explain Need for Oozie?

With Apache Hadoop becoming the open-source de-facto standard for processing and storing Big Data, many other languages like Pig and Hive have followed – simplifying the process of writing big data applications based on Hadoop.

Although Pig, Hive and many others have simplified the process of writing Hadoop jobs, many times a single Hadoop Job is not sufficient to get the desired output. Many Hadoop Jobs have to be chained, data has to be shared in between the jobs, which makes the whole process very complicated.

Explain Oozie Workflow?

An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG) . Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.

Workflow nodes are classified in control flow nodes and action nodes.
Control flow nodes. nodes that control the start and end of the workflow and workflow job execution path.
Action nodes. nodes that trigger the execution of a computation/processing task.
Workflow definitions can be parameterized. The parameterization of workflow definitions it done using JSP Expression Language syntax, allowing not only to support variables as parameters but also functions and complex expressions.

What Is Oozie Workflow Application?

Workflow application is a ZIP file that includes the workflow definition and the necessary files to run all the actions.
It contains the following files.

  • Configuration file – config-default.xml
  • App files – lib/ directory with JAR and SO files
  • Pig scripts

What Are the Properties That We Have to Mention in properties?

  • Name Node
  • Job Tracker
  • wf.application.path
  • Lib Path
  • Jar Path

What Are the Extra Files We Need When We Run A Hive Action in Oozie?

  • hql
  • hive-site.xml

Explain Briefly About Oozie Bundle?

Oozie Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.

More specififcally, the oozie Bundle system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.
Oozie executes workflow based on.

  • Time Dependency (Frequency)
  • Data Dependency

What Is Application Pipeline in Oozie?

It is necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline.

 How Does Oozie Work?

  • Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing. Oozie workflow consists of action nodes and control-flow nodes.
  • An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.
  • A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node. Start Node, End Node and Error Node fall under this category of nodes.
  • Start Node, designates start of the workflow job.
  • End Node, signals end of the job.
  • Error Node, designates an occurrence of error and corresponding error message to be printed.

At the end of execution of workflow, HTTP callback is used by Oozie to update client with the workflow status. Entry-to or exit-from an action node may also trigger callback.

What Are All the Actions Can Be Performed in Oozie?

  • Email Action
  • Hive Action
  • Shell Action
  • Ssh Action
  • Sqoop Action
  • Writing a custom Action Executor

Why We Use Fork and Join Nodes of Oozie?

  • A fork node splits one path of execution into multiple concurrent paths of execution.
  • A join node waits until every concurrent execution path of a previous fork node arrives to it.
  • The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.

Why Oozie Security?

  • User are not allowed to alter job of another user
  • Hadoop does not support the authentication of end user
  • Oozie has to verify and confirms its user before transferring the job to Hadoop

What additional configuration is needed for Oozie email Action?

SMTP server configuration has to present in oozie-site.xml.
oozie.email.smtp.host – host where the email action may find the SMTP server
oozie.email.smtp.port= The port to connect to for the SMTP server (25 by default). oozie.email.from.address= The from address to be used for mailing all
emails oozie.email.smtp.auth= – Boolean property that toggles if authentication is to be done or not. (false by default).
oozie.email.smtp.username= If authentication is enabled, the username to login as (empty by default).
oozie.email.smtp.password= If authentication is enabled, the username’s password (empty by default).
oozie.email.attachment.enabled= Boolean property that toggles if configured attachments are to be placed into the emails. (false by default).

< email xmlns=”uri.oozie.email-action.0.1″ >
< to > bob@initech.com,the.other.bob@initech.com < /to >
< cc > will@initech.com < /cc >
< bcc > yet.another.bob@initech.com < /bcc >
< subject > Email notifications for ${wf.id()} < /subject >
< body > The wf ${wf.id()} successfully completed. < /body >
< /email >

How will you define Oozie?

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Scalable, reliable and extensible system. It supporting several types of Hadoop jobs (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs like Java programs and shell scripts.

What are the main components of the Apache Oozie workflow?

The Apache Oozie workflow consists of the control flow nodes and action nodes. What are the main components of the Apache Oozie workflow?

Below is the explanation of these nodes.

  • Control flow nodes. These nodes define the start and end of the workflow, i.e., start, end, and fail. Besides, it also offers the mechanism that manages the execution path in the workflow, i.e., decision, fork, and join.
  • Action nodes. These nodes offer the mechanism that initiates the execution of the processing or computation task. Oozie supports different actions, including Hadoop MapReduce, Pig, and File system, and system-specific jobs such as HTTP, SSh, and email.

apacheOozie5

What is the use of Join and Fork nodes in Oozie?

The fork and join nodes in Oozie get used in pairs. The fork node splits the execution path into many concurrent execution paths. The join node joins the two or more concurrent execution paths into a single one. The join node is the children of the fork nodes that concurrently join to make join nodes.

 What is Oozie Bundle?

Oozie bundle allows the user to execute the job in batches. The Oozie bundle jobs are started, stopped, suspended, resumed, re-run, or killed in batches, thus providing better operational control.

apacheOozie6

What’s the latest stable version of oozie?

Oozie-4.3.0 on Dec 2016

How long is the Oozie Log files retained before being deleted?

It is retained upto 30 days or upto a total of 720 log files are generated.

Name the spark action element tag where the details about the driver & executor memory and additional configuration properties are specified?

It is < spark-opts > within spark action
Ex .
< master > yarn-cluster < /master >
< name > Spark Example < /name >
< jar > pi.py
< spark-opts > –executor-memory 20G –num-executors 50
–conf spark.executor.extraJavaOptions=”-XX.+HeapDumpOnOutOfMemoryError -XX.HeapDumpPath=/tmp” < /spark-opts >
< arg > 100 < /arg >

Describe the use of element tag within the action nodes in oozie?

The prepare element, if present, indicates a list of paths to delete or create before starting the job.The path to be created/deleted should be hdfs localtion.

< prepare >
< delete path=”[PATH]”/ >
< mkdir path=”[PATH]”/ >

< /prepare >

What kind of application is Oozie?

Oozie is a server-based application which holds an embedded Tomcat server.

How to check and verfiy if the Oozie workflow.xml file is syntantically correct and parses correctly according to XML format ?

This can be checked via oozie command line tool called “validate”.It performs an XML Schema validation on the specified workflow XML file.
Ex.
$ oozie validate < app-name > /workflow.xml

What is the command line syntax to submit and start either a Workflow/Coordinator or Bundle job?

Submitting a job
—————-
$ oozie job -oozie http.//localhost.8080/oozie -config job.properties -submit
job .. < job-id >

The parameters for the job must be provided in either a properties file or .xml, which must be specified with the -config option.
The workflow/coordiantor/bundle application path must be specified in the file with the oozie.wf.application/coordinator/bundle.path property.Specified path must be an HDFS path.
The job will be created, but it will not be started, it will be in PREP status.

Starting a job
————–
$ oozie job -oozie http.//localhost.8080/oozie -start < job-id >
The start option starts a previously submitted workflow job, coordinator job or bundle job that is in PREP status.The status is then changed to RUNNING.

Name the action node which acts like a switch-case statement in the oozie workflow?

Decision Control Node – A decision node enables a workflow to make a selection on the execution path to follow.

A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.
Predicates are JSP Expression Language (EL) expressions
< workflow-app name=”[WF-DEF-NAME]” xmlns=”uri.oozie.workflow.0.1″ >

< decision name=”[NODE-NAME]” >
< switch >
< case to=”[NODE_NAME]” > [PREDICATE] < /case >

< case to=”[NODE_NAME]” > [PREDICATE] < /case >
< default to=”[NODE_NAME]”/ >
< /switch >
< /decision >

How to make a workflow?

First make a Hadoop job and make sure that it works Make a jar out of classes and then make a workflow.xml file and copy all of the job configuration properties in to the XML file. Input files Output files Input readers and writers mappers and reducers job specific arguments job.properties

What are the properties that we have to mention in .Properties?

Name Node Job Tracker Oozie.wf.application.path Lib Path Jar Path

 How to run Oozie?

$ oozie job -oozie http.//172.20.95.107.11000(oozie server node)/oozie -config job.properties -run

This will give the job id.

To know the status. $ oozie job -oozie http.//172.20.95.107.11000(oozie server node)/oozie -info <job id>

 What are all the actions can be performed in Oozie?

Email Action Hive Action Shell Action Ssh Action Sqoop Action Writing a custom Action Executor 9. How to specify oozie start ,end and error nodes?

<start to=“[NODE-­‐NAME]” />

<end name=“[NODE-­‐NAME]”/>

<error

<message>“[A custom message]”</message>

</error>

How does OOZIE work?

Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.

Oozie workflow consists of action nodes and control-flow nodes.

An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.

A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.

Start NodeEnd Node, and Error Node fall under this category of nodes.

Start Node, designates the start of the workflow job.

End Node, signals end of the job.

Error Node designates the occurrence of an error and corresponding error message to be printed.

At the end of execution of a workflow, HTTP callback is used by Oozie to update the client with the workflow status. Entry-to or exit from an action node may also trigger the callback.

Example Workflow Diagram

apacheOozie7

Packaging and deploying an Oozie workflow application

A workflow application consists of the workflow definition and all the associated resources such as MapReduce Jar files, Pig scripts etc. Applications need to follow a simple directory structure and are deployed to HDFS so that Oozie can access them.

An example directory structure is shown below-

<name of workflow>/</name>
??? lib/
? ??? hadoop-examples.jar
??? workflow.xml

It is necessary to keep workflow.xml (a workflow definition file) in the top level directory (parent directory with workflow name). Lib directory contains Jar files containing MapReduce classes. Workflow application conforming to this layout can be built with any build tool e.g., Ant or Maven.

Such a build need to be copied to HDFS using a command, for example –

% hadoop fs -put hadoop-examples/target/<name of workflow dir> name of workflow

Steps for Running an Oozie workflow job

In this section, we will see how to run a workflow job. To run this, we will use the Oozie command-line tool (a client program which communicates with the Oozie server).

  1. Export OOZIE_URL environment variable which tells the oozie command which Oozie server to use (here we’re using one running locally).
% export OOZIE_URL="http.//localhost.11000/oozie"
  1. Run workflow job using-
% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run

The -config option refers to a local Java properties file containing definitions for the parameters in the workflow XML file, as well as oozie.wf.application.path, which tells Oozie the location of the workflow application in HDFS.

Example contents of the properties file.

nameNode=hdfs.//localhost.8020
jobTracker=localhost.8021
oozie.wf.application.path=${nameNode}/user/${user.name}/<name of workflow>
  1. Get the status of workflow job-

Status of workflow job can be seen using subcommand ‘job’ with ‘-info’ option and specifying job id after ‘-info’.

e.g., % oozie job -info <job id>

Output shows status which is one of RUNNING, KILLED or SUCCEEDED.

  1. Results of successful workflow execution can be seen using Hadoop command like-
% hadoop fs -cat <location of result>

What are some of the useful EL functions in the Oozie workflow?

Below is the list of some useful EL functions of Oozie workflow.

  • name() — It returns the application name in the workflow.
  • id() — This function returns the job id of the currently running workflow job.
  • errorCode(String node) — It returns the error code of the executing action node.
  • lastErrorNod() — This function returns the name of the last executed action node in a workflow that exits with an error.

Explain the different nodes supported in Oozie workflow.

Below is the list of action nodes that Apache Oozie workflow supports in and helps in the computation tasks.

  • Map Reduce Action. This action node initiates the Hadoop Map-Reduce job
  • Pig Action. This node is used to start the Pig job from Apache Oozie workflow.
  • FS (HDFS) Action. This action node allows the Oozie workflow to manipulate all the HDFS-related files and directories. Also, it supports commands such as mkdir, move, chmod, delete, chgrp, and touchz.
  • Java Action. It is the sub-workflow action node that helps in the execution of public static void main(String[] args) method of main java class in Oozie workflow.

Explain the life cycle of the Oozie workflow job

The job in the Apache Oozie workflow transition through the blow states.

  • PREP — This is the state when the user creates the workflow job. During PREP state, the job is only defined and is not running.
  • RUNNING — When the job starts, it changes to the RUNNING state and remains in this state until the job reaches the end state, an error occurs, or the job is suspended.
  • SUSPENDED — The state of the job in Oozie workflow changes to SUSPENDED if the job is suspended in between. The job will remain in this state until it is killed or resumed.
  • SUCCEEDED — The workflow job becomes SUCCEEDED when the job reaches the end node.
  • KILLED — The workflow job transitions to KILLED state when the administrator kills any job in PREP, RUNNING OR SUSPENDED states
  • FAILED — The job state changes into a FAILED state when the running job fails due to an unexpected error.

What Are the Alternatives to Oozie Workflow Scheduler?

  • Azkaban is a batch workflow job scheduler
  • Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data.
  • apache Falcon – Feed management and data processing platform

Explain Types of Oozie Jobs?

Oozie supports job scheduling for the full Hadoop stack like Apache MapReduce, Apache Hive, Apache Sqoop and Apache Pig.

It consists of two parts.
Workflow e2ngine. Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine. It runs workflow jobs based on predefined schedules and availability of data.

What Is Decision Node in Oozie?

Decision Nodes are switch statements that will run different jobs based on the outcomes of an expression.

Explain Oozie Coordinator?

Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a ‘data application pipeline’.

Oozie processes coordinator jobs in a fixed timezone with no DST (typically UTC ), this timezone is referred as ‘Oozie processing timezone’. The Oozie processing timezone is used to resolve coordinator jobs start/end times, job pause times and the initial-instance of datasets. Also, all coordinator dataset instance URI templates are resolved to a datetime in the Oozie processing time-zone.

The usage of Oozie Coordinator can be categorized in 3 different segments.
Small. consisting of a single coordinator application with embedded dataset definitions
Medium. consisting of a single shared dataset definitions and a few coordinator applications
Large. consisting of a single or multiple shared dataset definitions and several coordinator applications

How to Deploy Application?

$ hadoop fs-put wordcount-wf hdfs.//bar.com.9000/usr/abc/wordcount

Mention Workflow Job Parameters?

$ cat job.properties

Oozie.wf.application.path=hdfs.//bar.com.9000/usr/abc/wordcount

Input=/usr/abc/input-data

Output=/usr/abc/output-data

How to Execute Job?

$ oozie job –run –config job.properties
Job.1-20090525161321-oozie-xyz-W

List few EL functions in Oozie workflow?

wf.id() – returns the workflow job ID for the current workflow job.
wf.name() – returns the workflow application name
wf.lastErrorNode() – returns the name of the last workflow action node that exit with an ERROR exit state
wf.errorCode(String node) – returns the error code for the specified action node

So, this brings us to the end of the Apache Oozie Interview Questions blog.This Tecklearn ‘Top Apache Oozie Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in BigData Hadoop Testing or Big Data Domain. If you wish to learn Apache Oozie and build a career in Big Data domain, then check out our interactive, Big Data Hadoop Testing Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/big-data-hadoop-testing/

 

BigData Hadoop Testing Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Hadoop testing training will provide you with the right skills to detect, analyse and rectify errors in Hadoop framework. You will be trained in the Hadoop software, architecture, MapReduce, HDFS and various components like Sqoop, Flume and Oozie. With this Hadoop testing training you will also be fully equipped with experience in various test case scenarios, proof of concepts implementation and real-world scenarios. It is a comprehensive Hadoop Big Data training course designed by industry experts considering current industry job requirements to help you learn Big Data Hadoop Testing.

Why Should you take BigData Hadoop Training?

  • The Average Salary for BigData Hadoop Tester ranges from approximately $34.65 per hour for Senior Tester to $124,599 per year for Senior Software Engineer. – Glassdoor.com
  • Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
  • Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop.

What you will Learn in this Course?

Introduction to Hadoop

  • The Case for Apache Hadoop
  • Why Hadoop?
  • Core Hadoop Components
  • Fundamental Concepts

HDFS

  • HDFS Features
  • Writing and Reading Files
  • NameNode Memory Considerations
  • Overview of HDFS Security
  • Using the Namenode Web UI
  • Using the Hadoop File Shell

Getting Data into HDFS

  • Ingesting Data from External Sources with Flume
  • Ingesting Data from Relational Databases with Sqoop
  • REST Interfaces
  • Best Practices for Importing Data

Hadoop Testing

  • Hadoop Application Testing
  • Roles and Responsibilities of Hadoop Testing Professional
  • Framework MRUnit for Testing of MapReduce Programs
  • Unit Testing
  • Test Execution
  • Test Plan Strategy and Writing Test Cases for Testing Hadoop Application

Big Data Testing

  • BigData Testing
  • Unit Testing
  • Integration Testing
  • Functional Testing
  • Non-Functional Testing
  • Golden Data Set

System Testing

  • Building and Set up
  • Testing SetUp
  • Solary Server
  • Non-Functional Testing
  • Longevity Testing
  • Volumetric Testing

Security Testing

  • Security Testing
  • Non-Functional Testing
  • Hadoop Cluster
  • Security-Authorization RBA
  • IBM Project

Automation Testing

  • Query Surge Tool

Oozie

  • Why Oozie
  • Installation Engine
  • Oozie Workflow Engine
  • Oozie security
  • Oozie Job Process
  • Oozie terminology
  • Oozie bundle

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Top Apache Oozie Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *