Top Apache Pig Interview Questions and Answers

Last updated on Feb 18 2022
Rajnikanth S

Table of Contents

Top Apache Pig Interview Questions and Answers

What is Pig?

Pig is an Apache open-source project which is run on Hadoop, provides engine for data flow in parallel on Hadoop. It includes language called pig latin,which is  for expressing these data flow. It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.

Highlight the key differences between MapReduce and Apache Pig.

MapReduce vs Apache Pig
MapReduce
Apache Pig
1. It is a low-level data processing paradigm
1. It is a high-level data flow platform
2. Complex Java implementations
2. No complex Java implementations
3. Do not provide nested data types
3. Provides nested data types like tuples, bags, and maps
4. Performing data operations is a humongous task
4. Provides many built-in operators to support data operations

 

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs by the compiler. Logical and Physical plans are created during the execution of a pig script.

After performing the basic parsing and semantic checking, the parser produces a logical plan and no data processing takes place during the creation of a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. For each line in the Pig script, syntax check is performed for operators and a logical plan is created. If an error is encountered, an exception is thrown and the program execution ends.

A logical plan contains a collection of operators in the script, but does not contain the edges between the operators.

After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is like a series of MapReduce jobs, but the physical plan does not have any reference on how it will be executed in MapReduce.

What are the different ways of executing Pig script?

There are three ways to execute the Pig script:

  • Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
  • Script File: Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server.
  • Embedded Script: If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file. Then, execute that script file.

What is a bag in Pig Latin?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections of tuples while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

What do you understand by an inner bag and outer bag in Pig?

Outer bag or relation is nothing but a bag of tuples. Here relations are similar as relations in relational databases. For example:

{(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}

An inner bag contains a bag inside a tuple. For Example:

(Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})

(California, {(Linkin Park, California)})

What is UDF?

If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file.

  • LoadFunc abstract class has three main methods for loading data and for most use cases it would suffice to extend it.
  • LoadPush has methods to push operations from Pig runtime into loader implementations.
  • setUdfContextSignature() method will be called by Pig both in the front end and back end to pass a unique signature to the Loader.
  • The load/store UDFs control how data goes into Pig and comes out of Pig.
  • The meaning of getNext() is called by Pig runtime to get the next tuple in the data.
  • The loader should use setLocation() method to communicate the load information to the underlying InputFormat.
  • prepareToRead method enables the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.
  • pushProjection() method tells LoadFunc which fields are required in the Pig script. Pig will use the column index requiredField.index to communicate with the LoadFunc about the fields required by the Pig script.
  • LoadCaster has methods to convert byte arrays to specific types.
  • A loader implementation should implement LoadCaster() if casts (implicit or explicit) from DataByteArray fields to other types need to be supported. LoadCaster has methods to convert byte arrays to specific types.

Does ‘ILLUSTRATE’ run a MapReduce job?

No, illustrate will not pull any MapReduce, it will pull the internal data. On the console, illustrate will not do any job. It just shows the output of each stage and not the final output.

ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.

Syntax: illustrate relation_name;

List the relational operators in Pig.

All Pig Latin statements operate on relations (and operators are called relational operators). Different relational operators in Pig Latin are:

  • COGROUP: Joins two or more tables and then perform GROUP operation on the joined table result.
  • CROSS: CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.
  • DISTINCT: Removes duplicate tuples in a relation.
  • FILTER: Select a set of tuples from a relation based on a condition.
  • FOREACH: Iterate the tuples of a relation, generating a data transformation.
  • GROUP: Group the data in one or more relations.
  • JOIN: Join two or more relations (inner or outer join).
  • LIMIT: Limit the number of output tuples.
  • LOAD: Load data from the file system.
  • ORDER: Sort a relation based on one or more fields.
  • SPLIT: Partition a relation into two or more relations.
  • STORE: Store data in the file system.
  • UNION: Merge the content of two relations. To perform a UNION operation on two relations, their columns and domains must be identical.

The difference between GROUP and COGROUP operators in Pig?

Group and Cogroup operators are identical. For readability, GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. Group operator collects all records with the same key. Cogroup is a combination of group and join, it is a generalization of a group instead of collecting records of one input depends on a key, it collects records of n inputs based on a key. At a time, we can Cogroup up to 127 relations.

You have a file personal_data.txt in the HDFS directory with 100 records. You want to see only the first 5 records from the employee.txt file. How will you do this?

For getting only 5 records from 100 records we use limit operator.

First load the data in Pig:

personal_data = LOAD “/personal_data.txt” USING PigStorage(‘,’) as (parameter1, Parameter2, …);

Then Limit the data to 5 records:

limit_data = LIMIT personal_data 5;

Is Pig script case sensitive?

Pig script is both case sensitive and case insensitive.

User defined functions, the field name, and relations are case sensitive i.e. EMPLOYEE is not same as employee or M=LOAD ‘data’ is not same as M=LOAD ‘Data’.

Whereas Pig script keywords are case insensitive i.e. LOAD is same as load.

It is difficult to say whether Apache Pig is case sensitive or case insensitive. For instance, user defined functions, relations and field names in Pig are case sensitive. On the other hand, keywords in Apache Pig are case insensitive.

What does Flatten do in Pig?

Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple, whereas un-nesting bags is a little complex because it requires creating new tuples.

What are the limitations of the Pig?

Limitations of the Apache Pig are:

  1. As the Pig platform is designed for ETL-type use cases, it’s not a better choice for real-time scenarios.
  2. Apache Pig is not a good choice for pinpointing a single record in huge data sets.
  3. Apache Pig is built on top of MapReduce, which is batch processing oriented.

What do we understand by PIG?

 Pig, it is an Apache open-source project, which operates on Hadoop, providing the engine for the parallel data flow. It contains the language referred as pig Latin, which expresses the data flow. It consists of various operations like sort, joins, filter, etc. & is capable of scripting UDF (User Define Functions) for reading, writing, & processing. Pig uses Map Reduce & HDFS  for storing & the entire task for processing.

What is the difference in Pig and SQL? 

  • Pig Latin shifts from SQL in a declarative style of encoding whereas Hive’s query language is similar to SQL.
  • Pig is above Hadoop and runs on principle, which can sit on top of Dryad too.
  • Hive & Pig, both their commands collect to MapReduce jobs.

Explain the requirement of MapReduce while we program in Apache Pig.

 The programs of Apache Pig are written in a language referred as Pig Latin, which is analogous to SQL language. To carry out the query, we require an engine for execution. Pig engine alters all the queries to MapReduce tasks. Thus MapReduce operates as the primary execution engine needed to execute the programs.

Explain BloomMapFile.

 BloomMapFile is categorized as the class, which broadens MapFile class, and generally used for HBase table arrangement to speed up the relationship test for keys, which uses the filters of dynamic bloom.

What is a bag in Pig?

 A compilation of tuples is known as the bag, in Apache Pig.

How does the user communicate with shell in Apache Pig?

 Users interact with HDFS or any local file system through Grunt, which is the Apache Pig’s communicative shell. To initiate Grunt, users need to invoke the Apache Pig with a no command as follows:

  • Executing command “pig –x local” will prompt – grunt >
  • Pig Latin scripts can run either in local mode or the cluster mode by setting up the configuration in PIG_CLASSPATH.
  • For exiting from grunt shell, users need to press CTRL+D or just key in the exit.

What is a function of illustrate in Apache Pig?

 Illustrate is used for implementing the scripts of Pig on vast sets of data, which generally is time-consuming. That’s why developers execute the scripts of pig on a sample data where it’s possible that the selected sample data, may not execute the script correctly. E.g., if the script consists of join operator then there must be few records in sample data which has the same key, or else join operation may not return the results. For managing these issues, developers use the function, illustrate, which takes a data from the sample and whenever it faces operators like the filter or join, which removes the data, it makes sure that some records go through whereas some are restricted, by modifying records in such so that they follow the condition set. Illustrate shows output of every step but does not execute MapReduce jobs.

Differentiate between HiveQL & PigLatin.

  • PigLatin is procedural language, whereas HiveQL is declarative.
  • In HiveQL it is necessary to specify the schema, whereas in PigLatin it is optional.
  • PigLatin has a nested relative data model, whereas HiveQL has a flat data model.

What are the uses of Apache Pig?

 Pig big data tools, is specifically used for processing iteratively, for traditional ETL data pipelines & research on raw data. Pig operates in situations where the schema is unknown, incomplete, or inconsistent; it is used by all developers who want to use the data before being loaded into the data warehouse. For building prediction models for behavior, it is used by the website to detect the reply of visitors to a variety of images, ads, articles, etc.

Differentiate between COUNT and COUNT_STAR functions in Pig.

 The Function COUNT_STAR (0) comprises of NULL values as it counts, whereas COUNT function doesn’t include the NULL value when counting the number of elements in a bag.

Do Pig support multi-line commands?

 Pig supports single & multi-line commands, both. In the single line command, it carries out the data but doesn’t store the file in the system, but in multiple lines commands it stores the data in HDFS.

If I have a relation R then how can I get top 10 tuples from the relation R?

 Function TOP () returns the top (N) tuples from a relation or a bag of tuples. (N) Is passed as a constraint to function top () with the column, where the values are supposed to be evaluated in comparison to the relation R.

What are the limitations of Pig Script?

 Following are some of the Limitations of the Apache Pig:

  • Apache Pig isn’t preferable for analytics of a single record in huge data sets.
  • Pig platform is specifically designed for ETL-type use cases, it’s not a good choice for synchronized or real time scenarios.
  • Apache Pig is built on top of MapReduce, which is itself batch processing oriented.

How to write Java UDF?

 UDFs can be developed by extending EvalFunc class and overriding execution method.
Example: This UDF replaces a given string with another string

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28
Package kelly.training.pig.udf;

Import java.io.IOException;

Import org.apache.hadoop.conf.configuration;

Import org.apache.pig.EvalFunc;

Import org.apache.pig.data Tuple;

Import org.apache.pig.impl.util.UDFContext;

Public classTransform extends EvalFunc{

      Public string exec(Tuple input) throws IOException {

           if(input == null || input.size[] == 0) {

                    Return null;

}

Configuration conf=UDFContext.getUDFContext().getJobConf();

String from = conf.get(“replace.string”);

if(from == null){

Throw new IOException (“replace.string should not be null”);

}

String to = conf.get(“replace.by.string”);

if(to==null){

Throw new IOException (“replace.by.string should not be null”);

}

Try{

String str = (string) input.get(0);

Return str.replace(from, to);

} catch (exception e){

Throw new IOException(“caught exception processing input row”,e);

}

}

}

What is Grunt Shell?

Grunt Shell is an interactive based shell. Which means where exactly we will get the output than and their itself. Whether it is success (or) fail.

What is Pigstorage?

Loads or stores relations using field delimited text format.

Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuples fields. It is the default storage when none is specified.

Hive used for types of applications?

  1. Summarization

Ex:- Daily/Weekly aggregations of impression/click counts

  1. Complex measure of user engagement
  2. Ad Hoc Analysis

Ex:- How many group admins broken down by state/country

  1. Data Mining (Assembling Training Data)

Ex:- User engagement as a function of user attributes.

  1. Spam Detection
  2. Anomalous patterns for site integrity.
  3. Application API usage patterns
  4. Ad Optimizations
  5. Document indexing
  6. Customer facing business intelligence (Ex: Google analytics) Predictive modeling, hypothesis testing

What is Hive QL?

  1. Support SQL like query Language called HiveQL for select, join, aggregate, union all and subquery in the from clause.
  2. Support DDL statement such as CREATE table with serialization format, partitioning and bucketing columns.
  3. Command to load data from external sources and INSERT into HIVE tables.
  4. Do not support UPDATE and DELETE.
  5. Support multi table INSERT
  6. Support user defined column transformation (UDF) and aggregation (UDAF) function written in Java.

What is the Difference Between Hive & Pig?

Hive
Pig
Language is SQL
Language is Pig Latin
Schema: Table definitions that are stored in a metastore
A schema is optionally defined at runtime
Hive programmatically access is JDBC, ODBC
Pig access is pigserver


The hive have partitions
There is no partitions
Server is optional
No server
Custom Serializer/ Deserializer
Custom Serializer/ Deserializer
DFS direct access at run time
DFS direct access at default
Join/order/set is possible
Join/order/set is possible
Shell command interface is possible
Shell command interface have
Streaming is supported
Streaming is supported
Web interface is possible
There is no web interface

What is the Difference Between MapReduce & Pig?

Mapreduce
Pig
Mapreduce expects the programming language skills for writing the business logic
Pig there is no much of programming skills. As we are writing whole logic will making use of pig transformation (or) operations.
If we can do any change in the Mapreduce reduce program, we need to certain problems we can change the process entire.
Compiling the program
Executing the program
Packing up the program
Deploying the same cluster environment
In the pig, we can completes dealing with simple scripting we can avoid other transaction process.
5 % of the Mapreduce code
5% of the Mapreduce development time
Increases programmer productivity
25% of the Mapreduce execution time
As a general saying of Hadoop Mapreduce program write 200 lines of mapreduce code.
In pig we can that type of Mapreduce program, we can write 10 lines of code.
Mapreduce requires multiple stages, Leading to long development life cycles Rapid prototyping increase productivity.
Pig provides the log analysis
Ad Hoc queries across various large data sets.

Define Apache Pig

Ans. To analyze large data sets representing them as data flows, we use Apache Pig. Basically, to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce task using Java programming, Apache Pig is designed. Moreover, using Apache Pig, we can perform data manipulation operations very easily in Hadoop.

What is the difference between Pig and SQL?

Ans. Here, are the list of major differences between Apache Pig and SQL.

Pig 
SQL
It is a procedural language.


While it is a declarative language.


Here, the schema is optional. Although, without designing a schema, we can store data. However, it stores values as $01, $02 etc.
In SQL, Schema is mandatory.


In Pig, data model is nested relational.


In SQL, data model used is flat relational
Here, we have limited opportunity for query optimization.


While here we have more opportunity for query optimization.


Explain the architecture of Hadoop Pig.

Now, we can see, several components in the Hadoop Pig framework. The major components are:

  1. Parser

At first, Parser handles all the Pig Scripts. Basically, Parser checks the syntax of the script, does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a DAG (directed acyclic graph). That represents the Pig Latin statements as well as logical operators.
Basically, the logical operators of the script are represented as the nodes and the data flows are represented as edges, in the DAG (the logical plan).

  1. Optimizer

Further, DAG is passed to the logical optimizer. That carries out the logical optimizations, like projection and push down.

  1. Compiler

A series of MapReduce jobs have compiled from an optimized logical plan.

  1. Execution engine

At last, these jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce jobs are executed finally on Hadoop, that produces the desired results.

What is the difference between Apache Pig and Hive?

Ans. Basically, to create MapReduce jobs, we use both Pig and Hive. Also, we can say, at times, Hive operates on HDFS as same as Pig does. So, here we are listing few significant points those set Apache Pig apart from Hive.

 

 Pig 
Hive
Pig Latin is a language, Apache Pig uses. Originally, it was created at Yahoo.
HiveQL is a language, Hive uses. It was originally created at Facebook.
It is a data flow language.
Whereas, it is a query processing language.
Moreover, it is a procedural language which fits in pipeline paradigm.
It is a declarative language.
Also, can handle structured, unstructured, and semi-structured data.

Explain Features of Pig.

There are several features of Pig, such as:

  • Rich set of operators

In order to perform several operations, Pig offers many operators, for example, join, sort, filer and many more.

  • Ease of programming

Since you are good at SQL, it is easy to write a Pig script. Because of Pig Latin as same as SQL.

  • Optimization opportunities

In Apache Pig, all the tasks optimize their execution automatically. As a result, the programmers need to focus only on the semantics of the language.

  • Extensibility

Through Pig, it is easy to read, process, and write data. It is possible by using the existing operators. Also, users can develop their own functions.

  • UDF’s

By using Pig, we can create User-defined Functions in other programming languages. Like Java. Also, can invoke or embed them in Pig Scripts.

What is Pig Storage?

In Pig, there is a default load function, that is Pig Storage. Also, we can use pig storage, whenever we want to load data from a file system into the pig. We can also specify the delimiter of the data while loading data using pig storage (how the fields in the record are separated). Also, we can specify the schema of the data along with the type of the data.

While writing evaluate UDF, which method has to be overridden?

We have to override the method exec() while writing UDF in the Pig. Whereas the base class can be different while writing filter UDF, we will have to extend FilterFunc and for evaluate UDF, we will have to extend the EvalFunc. EvaluFunc is parameterized and must provide the return type also.

What is a skewed join?

While we want to perform a join with a skewed dataset, that means a particular value will be repeated many times, is a skewed join.

What is Flatten?

An operator in pig that removes the level of nesting, is Flatten. Sometimes, we have data in a bag or a tuple and we want to remove the level of nesting so that the data structured should become even, we use Flatten.

In addition, each Flatten produces a cross product of every record in the bag with all of the other expressions in the general statement.

What are the complex data types in pig?

The following are the complex data types in Pig:

  • Tuple

An ordered set of fields is what we call a tuple.
For Example: (Ankit, 32)

  • Bag

A collection of tuples is what we call a bag.
For Example: {(Ankit,32),(Neha,30)}

  • Map

A set of key-value pairs is what we call a Map.

For Example: [ ‘name’#’Ankit’, ‘age’#32]

 Why we use BloomMapFile?

In order to extend MapFile, we use the BloomMapFile. That implies its functionality is similar to MapFile.

Also, to provide quick membership test for the keys, BloomMapFile uses dynamic Bloom filters. We use it in HBase table format.

How will you explain COGROUP in Pig?

In Apache Pig, COGROUP works on tuples. On several statements, we can apply operators, which contains a few relations at least 127 relations at every time. When you make use of the operator on tables, then Pig immediately books two tables and join them through some of the columns that are grouped.

Is the keyword ‘DEFINE’ as a function name?

The keyword ‘DEFINE’ is like a function name. As soon as we have registered, we have to define it. Whatever logic you have written in Java program, we have an exported jar and also a jar registered by us. Now the compiler will check the function in the exported jar. When the function is not present in the library, it looks into our jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

The keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). we have to override some functions while using UDF. Certainly, we have to do our job with the help of these functions only. However, the keyword ‘FUNCTIONAL’ is a built-in function i.e a predefined function, therefore it does not work as a UDF.

 Why do we need MapReduce during Pig programming?

Let’s understand it in this way- Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. And, we use Pig Latin for this platform. Now, a program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. Hence, when we write a program in Pig Latin, it was converted into MapReduce jobs by pig complier. As a result, MapReduce acts as an execution engine.

What are the scalar data types in Pig?

In Apache Pig, Scalar data types are:

  • int        -4bytes,
  • float     -4bytes,
  • double  -8bytes,
  • long     -8bytes,
  • char array,
  • byte array

What is the different execution mode available in Pig?

In Pig, there are 3 modes of execution available:

  • Interactive Mode (Also known as Grunt Mode)
  • Batch Mode
  • Embedded Mode

Whether Pig Latin language is case-sensitive or not?

We can say, Pig Latin is sometimes not a case-sensitive, for example, Load is equivalent to load.

A=load ‘b’ is not equivalent to a=load ‘b’

Note: UDF is also case-sensitive, here count is not equivalent to COUNT.

What is the purpose of ‘dump’ keyword in Pig?

The keyword “dump” displays the output on the screen.

For Example- dump ‘processed’

Does Pig give any warning when there is a type mismatch or missing field?

The pig will not show any warning if there is no matching field or a mismatch. However, if any mismatch occurs, it assumes a null value in Pig.

What is Grunt shell?

Grunt shell is also what we call as Pig interactive shell. Basically, it offers a shell for users to interact with HDFS.

Differentiate between Hadoop MapReduce and Pig

Hadoop MapReduce vs Pig
Characteristic MapReduce Pig
Type of Language Compiled Language Scripting Language
Level of Abstraction Low Level of Abstraction Higher Level of Abstraction
Code More lines of code is required. Comparatively less lines of code than Hadoop MapReduce.
Code Efficiency Code efficiency is high. Code efficiency is relatively less.

Compare Apache Pig and SQL.

  • Apache Pig differs from SQL in its usage for ETL, lazy evaluation, store data at any given point of time in the pipeline, support for pipeline splits and explicit declaration of execution pl SQL is oriented around queries which produce a single result. SQL has no in-built mechanism for splitting a data processing stream and applying different operators to each sub-stream.
  • Apache Pig allows user code to be included at any point in the pipeline whereas if SQL where to be used data needs to be imported to the database first and then the process of cleaning and transformation begins.

Explain the need for MapReduce while programming in Apache Pig.

Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL query language. To execute the query, there is a need for an execution engine. The Pig engine converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is needed to run the programs.

 Explain about the BloomMapFile.

BloomMapFile is a class, that extends the MapFile class. It is used in HBase table format to provide quick membership test for the keys using dynamic bloom filters.

 What do you mean by a bag in Pig?

Collection of tuples is referred as a bag in Apache Pig

What is the usage of foreach operation in Pig scripts?

FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag, so that respective action is performed to generate new data items.

Syntax- FOREACH data_bagname GENERATE exp1, exp2

Explain about the different complex data types in Pig.

Apache Pig supports 3 complex data types-

  • Maps- These are key, value stores joined together using #.
  • Tuples- Just similar to the row in a table, where different items are separated by a comma. Tuples can have multiple attributes.
  • Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples.

What are the debugging tools used for Apache Pig scripts?

describe and explain are the important debugging utilities in Apache Pig.

  • explain utility is helpful for Hadoop developers, when trying to debug error or optimize PigLatin scripts. explain can be applied on a particular alias in the script or it can be applied to the entire script in the grunt interactive shell. explain utility produces several graphs in text format which can be printed to a file.
  • describe debugging utility is helpful to developers when writing Pig scripts as it shows the schema of a relation in the script. For beginners who are trying to learn Apache Pig can use the describe utility to understand how each operator makes alterations to data. A pig script can have multiple describes.

What is illustrate used for in Apache Pig?

Executing pig scripts on large data sets, usually takes a long time. To tackle this, developers run pig scripts on sample data but there is possibility that the sample data selected, might not execute your pig script properly. For instance, if the script has a join operator there should be at least a few records in the sample data that have the same key, otherwise the join operation will not return any results. To tackle these kinds of issues, illustrate is used. illustrate takes a sample from the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition. illustrate just shows the output of each stage but does not run any MapReduce task.

Explain about the execution plans of a Pig Script or Differentiate between the logical and physical plan of an Apache Pig script.

Logical and Physical plans are created during the execution of a pig script. Pig scripts are based on interpreter checking. Logical plan is produced after semantic checking and basic parsing and no data processing takes place during the creation of a logical plan. For each line in the Pig script, syntax check is performed for operators and a logical plan is created. Whenever an error is encountered within the script, an exception is thrown and the program execution ends, else for each statement in the script has its own logical plan.

A logical plan contains collection of operators in the script but does not contain the edges between the operators.

After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is more or less like a series of MapReduce jobs but then the plan does not have any reference on how it will be executed in MapReduce. During the creation of physical plan, cogroup logical operator is converted into 3 physical operators namely –Local Rearrange, Global Rearrange and Package. Load and store functions usually get resolved in the physical plan.

What are some of the Apache Pig use cases you can think of?

Apache Pig big data tools, is used in particular for iterative processing, research on raw data and for traditional ETL data pipelines. As Pig can operate in circumstances where the schema is not known, inconsistent or incomplete- it is widely used by researchers who want to make use of the data before it is cleaned and loaded into the data warehouse.

To build behaviour prediction models, for instance, it can be used by a website to track the response of the visitors to various types of ads, images, articles, etc.

 Differentiate between PigLatin and HiveQL

  • It is necessary to specify the schema in HiveQL, whereas it is optional in PigLatin.
  • HiveQL is a declarative language, whereas PigLatin is procedural.
  • HiveQL follows a flat relational data model, whereas PigLatin has nested relational data model.

Is PigLatin a strongly typed language? If yes, then how did you come to the conclusion?

In a strongly typed language, the user has to declare the type of all variables upfront. In Apache Pig, when you describe the schema of the data, it expects the data to come in the same format you mentioned. However, when the schema is not known, the script will adapt to actually data types at runtime. So, it can be said that Pig Latin is strongly typed in most cases but in rare cases it is gently typed, i.e. it continues to work with data that does not live up to its expectations.

What are the various diagnostic operators available in Apache Pig?

  1. Dump Operator- It is used to display the output of pig Latin statements on the screen, so that developers can debug the code.
  2. Describe Operator-Explained in apache pig interview question no- 10
  3. Explain Operator-Explained in apache pig interview question no -10
  4. Illustrate Operator- Explained in apache pig interview question no -11

How will you merge the contents of two or more relations and divide a single relation into two or more relations?

This can be accomplished using the UNION and SPLIT operators.

I have a relation R. How can I get the top 10 tuples from the relation R.?

TOP () function returns the top N tuples from a bag of tuples or a relation. N is passed as a parameter to the function top () along with the column whose values are to be compared and the relation R.

What are the commonalities between Pig and Hive?

  • HiveQL and Pig Latin both convert the commands into MapReduce jobs.
  • They cannot be used for OLAP transactions as it is difficult to execute low latency queries.

What are the different types of UDF’s in Java supported by Apache Pig?

Algebraic, Eval and Filter functions are the various types of UDF’s supported in Pig.

How do users interact with HDFS in Apache Pig ?

Using the grunt shell.

What is the use of having Filters in Apache Pig ?

Just like the where clause in SQL, Apache Pig has filters to extract records based on a given condition or predicate. The record is passed down the pipeline if the predicate or the condition turn to true. Predicate contains various operators like ==, <=,!=, >=.

Example –

X= load ‘inputs’ as(name,address)

Y = filter X by symbol matches ‘Mr.*’;

What is a UDF in Pig?

If the in-built operators do not provide some functions then programmers can implement those functionalities by writing user defined functions using other programming languages like Java, Python, Ruby, etc. These User Defined Functions (UDF’s) can then be embedded into a Pig Latin Script.

Can you join multiple fields in Apache Pig Scripts?

Yes, it is possible to join multiple fields in PIG scripts because the join operations takes records from one input and joins them with another input. This can be achieved by specifying the keys for each input and the two rows will be joined when the keys are equal.

 Does Pig support multi-line commands?

Yes

What are the common Hadoop PIG interview questions, that you have been asked in a Hadoop Job Interview? Let us know in comments below, to help the big data community.

How will you explain co group in Pig?

COGROUP is found in Pig that works in several tuples. The operator can also be applied on several statements which contain or have a few relations at least a hundred and twenty-seven relations at every time. When you are making use of the operator on tables, then Pig will immediately book two tables and post that it will join two of the tables on some of the columns that are grouped.

What is Bloom Map File used for?

The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.

BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What is difference between pig and sql?

Pig latin is procedural version of SQl. pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

Does ‘ILLUSTRATE’ run MR job?

No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

How Pig differs from MapReduce

In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,group by..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for MapReduce.

 Is the keyword ‘DEFINE’ like a function name?

Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly, you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.

 How is Pig Useful For?

In three categories, we can use pig .they are 1)ETL data pipeline 2)Research on raw data 3)Iterative processing

Most common use case for pig is data pipeline. Let us take one example, web-based companies gets the weblogs, so before storing data into warehouse, they do some operations on data like cleaning and aggregation operations. etc. i,e transformations on data.

What are the scalar datatypes in pig?

scalar datatype

  • int    -4bytes,
  • float  -4bytes,
  • double -8bytes,
  • long   -8bytes,
  • chararray,
  • bytearray

What are the different execution mode available in Pig?

There are 3 modes of execution available in pig

  • Interactive Mode (Also known as Grunt Mode)
  • Batch Mode
  • Embedded Mode

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner.

In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

What is the purpose of ‘dump’ keyword in pig?

dump display the output on the screen

dump ‘processed’

Does Pig give any warning when there is a type mismatch or missing field?

No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What are relational operations in pig Latin?

  1. for each
  2. order by
  3. filters
  4. group
  5. distinct
  6. join
  7. limit

What are the different Relational Operators available in pig language?

Relational operators in pig can be categorized into the following list

  • Loading and Storing
  • Filtering
  • Grouping and joining
  • Sorting
  • Combining and Splitting
  • Diagnostic

What are the different modes available in Pig?

Two modes are available in the pig.

  • Local Mode (Runs on localhost file system)
  • MapReduce Mode (Runs on Hadoop Cluster)

Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

Syntax : FOREACH bagname GENERATE expression1, expression2, …..

The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

Why should we use ‘orderby’ keyword in pig scripts?

The order statement sorts your data for you, producing a total order of your output data. The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data

input2 = load ‘daily’ as (exchanges, stocks);

grpds = order input2 by exchanges;

Pig Features?

  1. i) Data Flow Language

User Specifies a Sequence of Steps where each step specifies only a single high-level data transformation.

  1. ii) User Defined Functions (UDF)

iii)Debugging Environment

  1. iv) Nested data Model

What are the advantages of pig language?

The pig is easy to learn: Pig is easy to learn, it overcomes the need for writing complex MapReduce programs to some extent. Pig works in a step by step manner. So it is easy to write, and even better, it is easy to read.

It can handle heterogeneous data: Pig can handle all types of data – structured, semi-structured, or unstructured.

  • Pig is Faster: Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.
  • Pig does more with less: Pig provides the common data operations (filters, joins, ordering, etc.) And nested data types (e.g. Tuples, bags, and maps) which can be used in processing data.
  • Pig is Extensible: Pig is easily extensible by UDFs – including Python, Java, JavaScript, and Ruby so you can use them to load, aggregate and analysis. Pig insulates your code from changes to the Hadoop Java API.

What are the relational operators available related to Grouping and joining in pig language?

Grouping and Joining operators are the most powerful operators in pig language. Because core MapReduce creation for grouping and joins are very typical in low-level MapReduce language.

  1. JOIN
  2. GROUP
  3. COGROUP
  4. CROSS

JOIN is used to join two or more relations. GROUP is used for aggregation of a single relation. COGROUP is used for the aggregation of multiple relations. CROSS is used to create a cartesian product of two or more relations.

Why do we need Pig?

Pig is a high-level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions (UDF) facility in Pig you can have Pig invoke code in many languages like Ruby, Python and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

What are the different String functions available in pig?

  • UPPER
  • LOWER
  • TRIM
  • SUBSTRING
  • INDEXOF
  • STRSPLIT
  • LAST_INDEX_OF

What is a relation in Pig?

A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

What is a tuple?

A tuple is an ordered set of fields and A field is a piece of data.

What is the MapReduce plan in pig architecture?

In MapReduce than the output of Physical plan is converted into an actual MapReduce program. Which then executed across the Hadoop Cluster.

What is the logical plan in pig architecture?

In the Logical plan stage of Pig, statements are parsed for syntax error. Validation of input files and the data structure of the file is also analysed. A DAG (Directed Acyclic Graph) of operators as nodes and data flow as edges are then created. Optimization of pig scripts also materialized to the logical plan.

What is UDF in Pig?

The pig has wide-ranging inbuilt functions, but occasionally we need to write complex business logic, which may not be implemented using primitive functions. Thus, Pig provides support to allow writing User Defined Functions (UDFs) as a way to stipulate custom processing.

Pig UDFs can presently be implemented in Java, Python, JavaScript, Ruby and Groovy. The most far-reaching support is provided for Java functions. You can customize all parts of the processing, including data load/store, column transformation, and aggregation. Java functions are also additional efficient because they are implemented in the same language as Pig and because additional interfaces are supported. Such as the Algebraic Interface and the Accumulator Interface. Limited support is provided for Python, JavaScript, Ruby and Groovy functions.

What is bag data type in pig?

The bag data type worked as a container for tuples and other bags. It is a complex data type in pig Latin language.

Why should we use ‘distinct’ keyword in pig scripts?

The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:

input2 = load ‘daily’ as (exchanges, stocks);

grpds = distinct exchanges;

 What are the different math functions available in pig?

  • ABS
  • ACOS
  • EXP
  • LOG
  • ROUND
  • CBRT
  • RANDOM
  • SQRT

What are the different Eval functions available in pig?

  • AVG
  • CONCAT
  • MAX
  • MIN
  • SUM
  • SIZE
  • COUNT
  • COUNT_STAR
  • DIFF
  • TOKENIZE
  • IsEmpty

Explain about co-group in Pig.

COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.

What are the relational operators available related to combining and splitting in pig language?

UNION and SPLIT used for combining and splitting relations in the pig.

What are different modes of execution in Apache Pig?

Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster.

Does Pig support multi-line commands?

Yes

What are the use cases of Apache Pig?

Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used for:

  • Research on large raw data sets like data processing for search platforms. For example, Yahoo uses Apache Pig to analyse data gathered from Yahoo search engines and Yahoo News Feeds.
  • Processing huge data sets like Web logs, streaming online data, etc.
  • In customer behavior prediction models like e-commerce websites.

How Pig programming gets converted into MapReduce jobs?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. A program written in Pig Latin is a data flow language, which need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler converts the program into MapReduce jobs.

What are the components of Pig Execution Environment?

The components of Apache Pig Execution Environment are:

  • Pig Scripts: Pig scripts are submitted to the Apache Pig execution environment which can be written in Pig Latin using built-in operators and UDFs can be embedded in it.
  • Parser: The Parser does the type checking and checks the syntax of the script. The parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and logical operators.
  • Optimizer: The Optimizer performs the optimization activities like split, merge, transform, reorder operators, etc. The optimizer provides the automatic optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in the pipeline.
  • Compiler: The Apache Pig compiler converts the optimized code into MapReduce jobs automatically.
  • Execution Engine: Finally, the MapReduce jobs are submitted to the execution engine. Then, the MapReduce jobs are executed and the required result is produced.

What are the data types of Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.

Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]. These are also called the primitive data types.

The complex data types supported by Pig Latin are:

  • Tuple: Tuple is an ordered set of fields which may contain different data types for each field.
  • Bag: A bag is a collection of a set of tuples and these tuples are a subset of rows or entire rows of a table.
  • Map: A map is key-value pairs used to represent data elements. The key must be a chararray [] and should be unique like column name, so it can be indexed and value associated with it can be accessed on the basis of the keys. The value can be of any data type.

How Apache Pig deals with the schema and schema-less data?

The Apache Pig handles both, schema as well as schema-less data.

  • If the schema only includes the field name, the data type of field is considered as a byte array.
  • If you assign a name to the field you can access the field by both, the field name and the positional notation, whereas if field name is missing we can only access it by the positional notation i.e. $ followed by the index number.
  • If you perform any operation which is a combination of relations (like JOIN, COGROUP, etc.) and if any of the relation is missing schema, the resulting relation will have null schema.
  • If the schema is null, Pig will consider it as a byte array and the real data type of field will be determined dynamically.

How do users interact with the shell in Apache Pig?

Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system.

To start Grunt, users should use pig –x local command . This command will prompt Grunt shell. To exit from grunt shell, press CTRL+D or just type exit.

List the diagnostic operators in Pig.

Pig supports a number of diagnostic operators that you can use to debug Pig scripts.

  • DUMP: Displays the contents of a relation to the screen.
  • DESCRIBE: Return the schema of a relation.
  • EXPLAIN: Display the logical, physical, and MapReduce execution pl
  • ILLUSTRATE: Gives the step-by-step execution of a sequence of statements.

What does illustrate do in Apache Pig?

Executing Pig scripts on large data sets, usually takes a long time. To tackle this, developers run Pig scripts on sample data, but there is possibility that the sample data selected, might not execute your Pig script properly. For instance, if the script has a join operator there should be at least a few records in the sample data that have the same key, otherwise the join operation will not return any results.

To tackle these kinds of issues, illustrate is used. Illustrate takes a sample of the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition. Illustrate just shows the output of each stage but does not run any MapReduce task.

Is the keyword ‘DEFINE’ like a function name?

Yes, the keyword ‘DEFINE’ is like a function name.

DEFINE statement is used to assign a name (alias) to a UDF function or to a streaming command.

  • The function has a long package name that you don’t want to include in a script, especially if you call the function several times in that script. The constructor for the function takes string parameters. If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set.
  • The streaming command specification is complex. The streaming command specification requires additional parameters (input, output, and so on). So, assigning an alias makes it easier to access.

What is the function of co-group in Pig?

COGROUP takes members of different relations, binds them by similar fields, and creates a bag that contains a single instance of both relations where those relations have common fields. Co-group operation joins the data set by grouping one particular data set only.

It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the first data set record with the common data set and the second bag consists of the second data set records with the common data set.

Can we say co-group is a group of more than 1 data set?

Co-group is a group of data sets. More than one data set, co-group will group all the data sets and join them based on the common field. Hence, we can say that co-group is a group of more than one data set and join of that data set as well.

What is a MapFile?

MapFile is a class which serves file-based map from keys to values.

A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().

The index file is read entirely into memory. Thus, key implementations should try to keep themselves small. Map files are created by adding entries in-order.

What is BloomMapFile used for?

The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What are the different execution modes available in Pig?

The execution modes in Apache Pig are:

  • MapReduce Mode: This is the default mode, which requires access to a Hadoop cluster and HDFS installation. Since, this is a default mode, it is not necessary to specify -x flag (you can execute pig OR pig -x mapreduce). The input and output in this mode are present on HDFS.
  • Local Mode: With access to a single machine, all files are installed and run using a local host and file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this mode are present on local file system.

What is Pig Statistics? What are all stats classes in the Java API package available?

Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed. These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is completed.

The stats classes are in the package org.apache.pig.tools.pigstats:

  • PigStats
  • JobStats
  • OutputStats
  • InputStats.

Why do we need the for each operation in Pig scripts?

 The operation FOREACH in Apache Pig is required to apply to each component in data bag, for which the respective action can be performed to create data items.

Related Blog: Apache NiFi Tutorial

Explain the different data types in Pig.

 Following are the three complex data types that is supported by Apache Pig:

  • Map, which is the key, value store, connected mutually using #.
  • Tuples, similar to the row in the table, where a comma separates various items. Tuples may possess multiple attributes.
  • Bags are a collection of tuples, in a unsynchronized manner, which allows many duplicate tuples.

What is the function of Flatten in Pig?

 Many times there are data in one of the tuple or bag which on removal, lead to next level of nesting for that data. In those cases, Flatten, a modifier, embedded in Pig is used. Flatten uninstalls bags & tuples and replaces all the areas in tuple, whereas the un-nesting bags are more complex of its need in creating a new tuple.

What are describe & explain in Apache Pig scripts?

 Explain & Describe are important utilities for debugging in Apache Pig.

Describe is helpful to all developers when scripting Pig because it displays the schema of the relation in a script. For developers, who are freshers & are learning Apache Pig use this utility to recognize the process of these operator making the modification to this data. Pig script has many describe.
Explain utility is extremely helpful to developers of Hadoop, when they are trying to optimize Pig Latin scripts or debug error. Explain is applied on a specific alias in scripts or is applied on the entire script in the interactive shell of grunt. Explain utility creates many text graphs, which are printed to files.

What do we know about case sensitivity of Pig?

 Firstly, it is hard to find whether Pig is case sensitive or insensitive. E.g., in user-specific functions, field names, and relations in pig those are case sensitive. The function COUNT is not similar to the functions of count or X=load ‘foo’ is not similar to  x=load ‘foo.’ Additionally, keywords in Pig are obviously case insensitive. E.g.  LOAD is similar to load.

Distinguish between physical & logical plans in an Apache Pig script.

 Physical & logical plans are generated while executing a pig script. Pig is based on the function of interpreter checking. The Logical plan is generated after the semantic verification & parsing while the processing of no data takes place in the generation of any logical plan. A consistent plan consists of a compilation of operators but does not consist of edges involving the operators. After generation of the logical plan, the execution of the script goes to physical plan. Physical plan is the explanation of physical operators, which Pig will use, for the execution of the script. It is more or less similar to a sequence of MapReduce works, but the plans don’t have any such reference of its execution in MapReduce. While the generation of any physical plan, the logical operator cogroup is transformed into physical operators, which are – Global Rearrange, Local Rearrange, and Package.

Is Co-group is a group of more than 1 data set?

 A group of data sets is referred to as Co-group. In any case, of more than one data set, co-group, groups all the data sets and then joins them based on a common field. That is why; we can say that co-group is obviously a group of more than one data set.

Is PigLatin strongly typed language?

 Strongly typed language, is characterized where the user should state all the type of variables openly, whereas in Pig, the description of the data, it anticipates the data to approach in the mentioned format. If the schema is unknown, the script adapts to the actual data types at the runtime. That’s why it is stated that PigLatin might be strongly typed in many scenarios, but in some situations, it is otherwise gently typed. It keeps on working with the data, which may not be up to the expectations.

Distinguish between COGROUP & GROUP operators.

 A GROUP & COGROUP operator is same & work within one or many relations. Operator GROUP is usually used for grouping the data in any one single relation, for enhanced readability, while COGROUP is for gathering the data for 2 or higher relations. COGROUP is a mixture of JOIN & GROUP, i.e., it can group the tables, which are based on columns and joins them on grouped pieces. At any given time, cogroup can feature up to 127 relations.

What do we understand by the outer bag and inner bag in Pig?

 The outer bag is just any relation in Pig whereas any relation within a bag is known as the inner bag.

How can we combine the contents of two or more relations & then divide them into a single relation into two or more relations?

 The operation can be easily done by using the SPLIT and UNION operators.

What are the various types of UDF’s in Java supported by Apache Pig?

 Types of User Defined Functions supported in Pig are, Eval Algebraic and Filter functions are.

What are the standard functionalities between Pig and Hive?

 PigLatin and HiveQL both alter the commands to MapReduce work & cannot be used for transactions in OLAP as it is extremely difficult in executing queries of low latency.

If we have a file employee.txt in the Hadoop Data File System directory with minimum 100 records, & want to see the first 25 records only from the employee.txt file. How can we do this?

 Firstly we need to load the file employee.txt with the relation name as Employee. Then we can pull the first ten records of the data from the employee file by using the limit operator – Result = limit employee 25.

Can we join multiple fields in Apache Pig Scripts?

We can join multiple fields in PIG by the join operator, which extracts the records from any one input & joins them with the other specified input. This is done by specifying the keys for each input & both the rows will join as soon as the keys are equal.

Why do we use Filters in Apache Pig?

 As the clause in SQL, Apache Pig has to filter for extraction of the records, which are based on predicate or specified condition. The records are then passed through the pipeline if the condition turns to true. Predicate surrounds a variety of operators like ==, <=,!=, >=. For instance – Y = filter X by symbol matches ‘Mr.*’; X= load ‘inputs’ as(name,address)

What is UDF in Pig?

 If the Built-in operators does not provide some of the basic functions, then developers can apply those functions by writing the user defined functions by using programming languages like Python, Java, Ruby, etc. (UDF’s) better known as User Defined Functions are then rooted into the Pig Latin Script.

Where Does Pig Live?

  1. Pig is installed on user machine.
  2. No need to install anything on the hadoop cluster
  3. Pig and Hadoop versions must be compatible.
  4. Pig submits and executes jobs to the hadoop cluster

What is the Difference Between Pig & SQL?

Pig SQL
Pig is procedural SQL is declarative
Nested relational data model Flat relational data model
Schema is optional Schema is required
Scan Centric analytytic workloads OLTP + OLAA workloads
Limited query optimization Significant opportunity for query optimization

Why Do We Need Apache Pig?

At times, while performing any MapReduce tasks, programmers who are not so good at Java normally used to struggle to work with Hadoop. Hence, Pig is a boon for all such programmers. The reason is:

  • Using Pig Latin, programmers can perform MapReducetasks easily, without having to type complex codes in Java.
  • Since Pig uses multi-query approach, it also helps in reducing the length of codes.
  • It is easy to learn Pig when you are familiar with SQL. It is because Pig Latin is SQL-like language.
  • In order to support data operations, it offers many built-in operators like joins, filters, ordering, and many more. And, it offers nested data types that are missing from MapReduce, for example, tuples, bags, and maps.

What is the difference between Pig and MapReduce?

Pig MapReduce
It is a data flow language. However, it is a data processing paradigm.
Pig is a high-level language. Well, it is a low level and rigid.
In Apache Pig, performing a join operation is pretty simple. But, in MapReduce, it is quite difficult to perform a join operation between datasets.

What are the different UDF’s in Pig?

On the basis of the number of rows, UDF can be processed. They are of two types:

  • UDF that takes one record at a time, for example, Filter and Eval.
  • UDFs that take multiple records at a time, for example, Avg and Sum.

Also, pig gives you the facility to write your own UDF’s for load/store the data.

What are the Optimizations a developer can use during joins?

We use replicated join, to perform join between a small dataset with a large dataset. Moreover, in the replicated join, the small dataset will be copied to all the machines where the mapper is running and the large dataset is divided across all the nodes. Also, it gives us the advantage of Map-side joins.

If your dataset is skewed i.e. if a particular data is repeated multiple times even if you use reduce side join, the particular reducer will be overloaded and it will take a lot of time. Pig itself, calculates skewed join and the skewed key.
And, if you have datasets where the records are sorted in the same field, you can go for sorted join, this also happens in map phase and is very efficient and fast.

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

 Does ‘ILLUSTRATE’ run MR job?

It will pull the internal data, illustrate will not pull any MR. Moreover, illustrate will not do any job, on the console. It just shows the output of each stage and not the final output.

What co-group does in Pig?

Basically, it joins the data set by grouping one particular data set only. Moreover, it groups the elements by their common field and then returns a set of records containing two separate bags. One bag consists of the record of the first data set with the common data set, while and other bag consists of the records of the second data set with the common data set.

What are relational operations in Pig latin?

Relational operations in Pig Latin are:

  • For each
  • Order by
  • Filters
  • Group
  • Distinct
  • Join
  • Limit

How is Pig Useful For?

There are 3 possible categories for which we can use Pig. They are:

1) ETL data pipeline
2) Research on raw data
3) Iterative processing

What does Flatten do in Pig?

Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple, whereas un-nesting bags is a little complex because it requires creating new tuples.

How do users interact with the shell in Apache Pig?

Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system. To start Grunt, users should invoke Apache Pig with no command –

Executing the command “pig –x local” will result in the prompt –

grunt >

This is where PigLatin scripts can be run either in local mode or in cluster mode by setting the configuration in PIG_CLASSPATH.

To exit from grunt shell, press CTRL+D or just type exit.

What do you know about the case sensitivity of Apache Pig?

It is difficult to say whether Apache Pig is case sensitive or case insensitive. For instance, user defined functions, relations and field names in pig are case sensitive i.e. the function  COUNT is not the same as function count or X=load ‘foo’ is not same as x=load ‘foo’. On the other hand, keywords in Apache Pig are case insensitive i.e. LOAD is same as load.

What do you understand by an inner bag and outer bag in Pig?

A relation inside a bag is referred to as inner bag and outer bag is just a relation in Pig

Differentiate between GROUP and COGROUP operators.

Both GROUP and COGROUP operators are identical and can work with one or more relations. GROUP operator is generally used to group the data in a single relation for better readability, whereas COGROUP can be used to group the data in 2 or more relations. COGROUP is more like a combination of GROUP and JOIN, i.e., it groups the tables based on a column and then joins them on the grouped columns. It is possible to cogroup up to 127 relations at a time.

 Explain the difference between COUNT_STAR and COUNT functions in Apache Pig?

COUNT function does not include the NULL value when counting the number of elements in a bag, whereas COUNT_STAR (0 function includes NULL values while counting.

You have a file employee.txt in the HDFS directory with 100 records. You want to see only the first 10 records from the employee.txt file. How will you do this?

The first step would be to load the file employee.txt into with the relation’s name as Employee.

The first 10 records of the employee data can be obtained using the limit operator –

Result= limit employee 10.

Explain about the scalar datatypes in Apache Pig.

integer, float, double, long, byte array and char array are the available scalar datatypes in Apache Pig.

Why do we need MapReduce during Pig programming?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.

Whether pig latin language is  case-sensitive or not?

pig latin is sometimes not a case sensitive.let us see example,Load is equivalent to load.

A=load ‘b’ is not equivalent to a=load ‘b’

UDF are also case sensitive,count is not equivalent to COUNT.

What is grunt shell?

Pig interactive shell is known as Grunt Shell. It provides a shell for users to interact with HDFS.

What co-group does in Pig?

Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

Why should we use ‘filters’ in pig scripts?

Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.

A= load ‘inputs’ as(name,address)

B=filter A by symbol matches ‘CM.*’;

What is the Physical plan in pig architecture?

The physical form of execution of pig script happens at this stage. Physical plan is responsible for converting operators to Physical Plan.

What Is Difference Between Mapreduce and Pig ?

In MR Need to write entire logic for operations like join,group,filter,sum etc ..

  • In Pig Built in functions are available
  • In MR Number of lines of code required is too much even for a simple functionality
  • In Pig 10 lines of pig latin equal to 200 lines of java
  • In MR Time of effort in coding is high
  • In Pig What took 4hrs to write in java took 15 mins in pig latin (approx)
  • In MRLess productivity
  • In PIG High Productivity

 What are the primitive data types in pig?

  1. Int
  2. Long
  3. Float
  4. Double
  5. Char array
  6. Byte array

What are the relational operators available related to loading and storing in pig language?

For Loading data and Storing it into HDFS, Pig uses following operators.

  1. LOAD
  2. STORE

LOADS, load the data from the file system. STORE, stores the data in the file system.

How would you diagnose or do exception handling in the pig?

For exception handling of pig script, we can use following operators.

  • DUMP
  • DESCRIBE
  • ILLUSTRATE
  • EXPLAIN

DUMP displays the results on screen. DESCRIBE displays the schema of a particular relation. ILLUSTRATE displays step by step execution of a sequence of pig statements. EXPLAIN displays the execution plan for pig latin statements.

What is the difference between store and dumps commands?

Dump Command after process the data displayed on the terminal, but it’s not stored anywhere. Whereas store store in local file system or HDFS and output execute in a folder. In the protection environment most often hadoop developer used ‘store’ command to store data in the HDFS.

So, this brings us to the end of the Apache Pig Interview Questions blog.This Tecklearn ‘Top Apache Pig Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Apache Pig or Big Data Domain. If you wish to learn Apache Pig and build a career in Big Data domain, then check out our interactive, Big Data Hadoop Analyst Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/big-data-hadoop-analyst/

Big Data Hadoop Analyst Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Our Big Data and Hadoop training course lets you deep-dive into the concepts of Big Data, equipping you with the skills required for Hadoop Analyst roles. This course will enable an Analyst to work on Big Data and Hadoop which takes into consideration the burgeoning demands of the industry to process and analyse data at high speeds. This training course will give you the right skills to deploy various tools and techniques to be a Hadoop Analyst working with Big Data.

Why Should you take Hadoop Analyst Training?

  • Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com.
  • Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
  • Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop

What you will Learn in this Course?

Hadoop Fundamentals

  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Data Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenarios Explanation

Introduction to Pig

  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

Basic Data Analysis with Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions

Processing Complex Data with Pig

  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data

Multi-Dataset Operations with Pig

  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets

Pig Troubleshooting and Optimization

  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs

Introduction to Hive and Impala

  • What Is Hive?
  • What Is Impala?
  • Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Querying with Hive and Impala

  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Differences Between Hive and Impala Query Syntax
  • Using Hue to Execute Queries
  • Using the Impala Shell

Data Management

  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables
  • Choosing a File Format
  • Managing Metadata
  • Controlling Access to Data

Relational Data Analysis with Hive and Impala

  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing

Working with Impala 

  • How Impala Executes Queries
  • Extending Impala with User-Defined Functions
  • Improving Impala Performance

Analyzing Text and Complex Data with Hive

  • Complex Values in Hive
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Conclusion

Hive Optimization 

  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Bucketing
  • Indexing Data

Extending Hive 

  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries

Choosing the Best Tool for the Job

  • Comparing MapReduce, Pig, Hive, Impala, and Relational Databases

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Top Apache Pig Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *