Deep Dive into Pig Latin Diagnostic Operators

Last updated on May 30 2022
Inderjeet Chopra

Table of Contents

Deep Dive into Pig Latin Diagnostic Operators

Apache Pig – Diagnostic Operators

The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators −
• Dump operator
• Describe operator
• Explanation operator
• Illustration operator
In this blog, we will discuss the Dump operators of Pig Latin.

Dump Operator

The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’
USING PigStorage(‘,’)
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us print the contents of the relation using the Dump operator as shown below.
grunt> Dump student
Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the following output.
2015-10-01 15:05:27,642 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher –
100% complete
2015-10-01 15:05:27,652 [main]
INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.15.0 Hadoop 2015-10-01 15:03:11 2015-10-01 05:27 UNKNOWN

Success!
Job Stats (time in seconds):

JobId job_14459_0004
Maps 1
Reduces 0
MaxMapTime n/a
MinMapTime n/a
AvgMapTime n/a
MedianMapTime n/a
MaxReduceTime 0
MinReduceTime 0
AvgReduceTime 0
MedianReducetime 0
Alias student
Feature MAP_ONLY
Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456,

Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/
student_data.txt”

Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/tmp/temp580182027/
tmp757878456”

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager
spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: job_1443519499159_0004

2015-10-01 15:06:28,403 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLau ncher – Success!
2015-10-01 15:06:28,441 [main] INFO org.apache.pig.data.SchemaTupleBackend –
Key [pig.schematuple] was not set… will not generate code.
2015-10-01 15:06:28,485 [main]
INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths
to process : 1
2015-10-01 15:06:28,485 [main]
INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths
to process : 1

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

Apache Pig – Describe Operator

The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’)
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us describe the relation named student and verify the schema as shown below.
grunt> describe student;
Output
Once you execute the above Pig Latin statement, it will produce the following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }

Apache Pig – Explain Operator

The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’)
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us explain the relation named student using the explain operator as shown below.
grunt> explain student;
Output
It will produce the following output.
$ explain student;

2015-10-05 11:32:43,660 [main]
2015-10-05 11:32:43,660 [main] INFO org.apache.pig.newplan.logical.optimizer
.LogicalPlanOptimizer –
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
#———————————————–
# New Logical Plan:
#———————————————–
student: (Name: LOStore Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)
|
|—student: (Name: LOForEach Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)
| |
| (Name: LOGenerate[false,false,false,false,false] Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33,
31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31]
| | |
| | (Name: Cast Type: int Uid: 31)
| | | | | |—id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*))
| | |
| | (Name: Cast Type: chararray Uid: 32)
| | |
| | |—firstname:(Name: Project Type: bytearray Uid: 32 Input: 1
Column: (*))
| | |
| | (Name: Cast Type: chararray Uid: 33)
| | |
| | |—lastname:(Name: Project Type: bytearray Uid: 33 Input: 2
Column: (*))
| | |
| | (Name: Cast Type: chararray Uid: 34)
| | |
| | |—phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column:
(*))
| | |
| | (Name: Cast Type: chararray Uid: 35)
| | |
| | |—city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column:
(*))
| |
| |—(Name: LOInnerLoad[0] Schema: id#31:bytearray)
| |
| |—(Name: LOInnerLoad[1] Schema: firstname#32:bytearray)
| |
| |—(Name: LOInnerLoad[2] Schema: lastname#33:bytearray)
| |
| |—(Name: LOInnerLoad[3] Schema: phone#34:bytearray)
| |
| |—(Name: LOInnerLoad[4] Schema: city#35:bytearray)
|
|—student: (Name: LOLoad Schema:
id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray
,city#35:bytearray)RequiredFields:null
#———————————————–
# Physical Plan: #———————————————–
student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36
|
|—student: New For Each(false,false,false,false,false)[bag] – scope-35
| |
| Cast[int] – scope-21
| |
| |—Project[bytearray][0] – scope-20
| |
| Cast[chararray] – scope-24
| |
| |—Project[bytearray][1] – scope-23
| |
| Cast[chararray] – scope-27
| |
| |—Project[bytearray][2] – scope-26
| |
| Cast[chararray] – scope-30
| |
| |—Project[bytearray][3] – scope-29
| |
| Cast[chararray] – scope-33
| |
| |—Project[bytearray][4] – scope-32
|
|—student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(‘,’)) – scope19
2015-10-05 11:32:43,682 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler –
File concatenation threshold: 100 optimistic? false
2015-10-05 11:32:43,684 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer –
MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MultiQueryOp timizer – MR plan size after optimization: 1
#————————————————–
# Map Reduce Plan
#————————————————–
MapReduce node scope-37
Map Plan
student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36
|
|—student: New For Each(false,false,false,false,false)[bag] – scope-35
| |
| Cast[int] – scope-21
| |
| |—Project[bytearray][0] – scope-20
| |
| Cast[chararray] – scope-24
| |
| |—Project[bytearray][1] – scope-23
| |
| Cast[chararray] – scope-27
| |
| |—Project[bytearray][2] – scope-26
| |
| Cast[chararray] – scope-30
| |
| |—Project[bytearray][3] – scope-29
| |
| Cast[chararray] – scope-33
| |
| |—Project[bytearray][4] – scope-32
|
|—student:
Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(‘,’)) – scope
19——– Global sort: false
—————-

Apache Pig – Illustrate Operator

The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’)
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us illustrate the relation named student as shown below.
grunt> illustrate student;
Output
On executing the above statement, you will get the following output.
grunt> illustrate student;

INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap – Aliases
being processed per job phase (AliasName[line,offset]): M: student[1,10] C: R:
———————————————————————————————
|student | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray |
———————————————————————————————
| | 002 | siddarth | Battacharya | 9848022338 | Kolkata |
———————————————————————————————
So, this brings us to the end of blog. This Tecklearn ‘Deep dive into Pig Latin Diagnostic Operators’ helps you with commonly asked questions if you are looking out for a job in Apache Pig and Big Data Domain.
If you wish to learn Apache Pig and build a career in Apache Pig or Big Data domain, then check out our interactive, Big Data Hadoop Analyst Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

Big Data Hadoop Analyst

Big Data Hadoop Analyst Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Our Big Data and Hadoop training course lets you deep-dive into the concepts of Big Data, equipping you with the skills required for Hadoop Analyst roles. This course will enable an Analyst to work on Big Data and Hadoop which takes into consideration the burgeoning demands of the industry to process and analyse data at high speeds. This training course will give you the right skills to deploy various tools and techniques to be a Hadoop Analyst working with Big Data.

Why Should you take Hadoop Analyst Training?

• Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com.
• Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
• Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop

What you will Learn in this Course?

Hadoop Fundamentals
• The Motivation for Hadoop
• Hadoop Overview
• Data Storage: HDFS
• Distributed Data Processing: YARN, MapReduce, and Spark
• Data Processing and Analysis: Pig, Hive, and Impala
• Data Integration: Sqoop
• Other Hadoop Data Tools
• Exercise Scenarios Explanation
Introduction to Pig
• What Is Pig?
• Pig’s Features
• Pig Use Cases
• Interacting with Pig
Basic Data Analysis with Pig
• Pig Latin Syntax
• Loading Data
• Simple Data Types
• Field Definitions
• Data Output
• Viewing the Schema
• Filtering and Sorting Data
• Commonly-Used Functions
Processing Complex Data with Pig
• Storage Formats
• Complex/Nested Data Types
• Grouping
• Built-In Functions for Complex Data
• Iterating Grouped Data
Multi-Dataset Operations with Pig
• Techniques for Combining Data Sets
• Joining Data Sets in Pig
• Set Operations
• Splitting Data Sets
Pig Troubleshooting and Optimization
• Troubleshooting Pig
• Logging
• Using Hadoop’s Web UI
• Data Sampling and Debugging
• Performance Overview
• Understanding the Execution Plan
• Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
• What Is Hive?
• What Is Impala?
• Schema and Data Storage
• Comparing Hive to Traditional Databases
• Hive Use Cases
Querying with Hive and Impala
• Databases and Tables
• Basic Hive and Impala Query Language Syntax
• Data Types
• Differences Between Hive and Impala Query Syntax
• Using Hue to Execute Queries
• Using the Impala Shell
Data Management
• Data Storage
• Creating Databases and Tables
• Loading Data
• Altering Databases and Tables
• Simplifying Queries with Views
• Storing Query Results
Data Storage and Performance
• Partitioning Tables
• Choosing a File Format
• Managing Metadata
• Controlling Access to Data
Relational Data Analysis with Hive and Impala
• Joining Datasets
• Common Built-In Functions
• Aggregation and Windowing
Working with Impala
• How Impala Executes Queries
• Extending Impala with User-Defined Functions
• Improving Impala Performance
Analyzing Text and Complex Data with Hive
• Complex Values in Hive
• Using Regular Expressions in Hive
• Sentiment Analysis and N-Grams
• Conclusion
Hive Optimization
• Understanding Query Performance
• Controlling Job Execution Plan
• Bucketing
• Indexing Data
Extending Hive
• SerDes
• Data Transformation with Custom Scripts
• User-Defined Functions
• Parameterized Queries
Choosing the Best Tool for the Job
• Comparing MapReduce, Pig, Hive, Impala, and Relational Databases

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Deep Dive into Pig Latin Diagnostic Operators"

Leave a Message

Your email address will not be published. Required fields are marked *