Explanation of Apache Pig Group and Cogroup Operators

Last updated on May 30 2022
Inderjeet Chopra

Table of Contents

Explanation of Apache Pig Group and Cogroup Operators

Apache Pig – Group Operator

The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation name student_details as shown below.
grunt> student_details = LOAD ‘hdfs://localhost:9000/pig_data/student_details.txt’ USING PigStorage(‘,’)
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;

Verification

Verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns −
• One is age, by which we have grouped the relation.
• The other is a bag, which contains the group of tuples, student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using the describe command as shown below.
grunt> Describe group_data;

group_data: {group: int,student_details: {(id: int,firstname: chararray,
lastname: chararray,age: int,phone: chararray,city: chararray)}}
In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.
$ Illustrate group_data;
It will produce the following output −
————————————————————————————————-
|group_data| group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}|
————————————————————————————————-
| | 21 | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}|
| | 2 | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}|
————————————————————————————————-
Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump operator as shown below.
grunt> Dump group_multiple;

((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

Group All

You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
Now, verify the content of the relation group_all as shown below.
grunt> Dump group_all;

(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})

Apache Pig – Cogroup Operator

The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

Grouping Two Relations using Cogroup

Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
employee_details.txt
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below.
grunt> student_details = LOAD ‘hdfs://localhost:9000/pig_data/student_details.txt’ USING PigStorage(‘,’)
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

grunt> employee_details = LOAD ‘hdfs://localhost:9000/pig_data/employee_details.txt’ USING PigStorage(‘,’)
as (id:int, name:chararray, age:int, city:chararray);
Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.
grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
Verification
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
Output
It will produce the following output, displaying the contents of the relation named cogroup_data as shown below.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)},
{ })
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },
{ (6,Maggy,22,Chennai),(1,Robin,22,newyork) })
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)},
{(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)},
{ })
(25,{ },
{(4,Sara,25,London)})
The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.
For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags −
• the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and
• the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.
In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.
So, this brings us to the end of blog. This Tecklearn ‘Explanation of Apache Pig Group and Cogroup operators’ helps you with commonly asked questions if you are looking out for a job in Apache Pig and Big Data Domain.
If you wish to learn Apache Pig and build a career in Apache Pig or Big Data domain, then check out our interactive, Big Data Hadoop Analyst Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

Big Data Hadoop Analyst

Big Data Hadoop Analyst Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Our Big Data and Hadoop training course lets you deep-dive into the concepts of Big Data, equipping you with the skills required for Hadoop Analyst roles. This course will enable an Analyst to work on Big Data and Hadoop which takes into consideration the burgeoning demands of the industry to process and analyse data at high speeds. This training course will give you the right skills to deploy various tools and techniques to be a Hadoop Analyst working with Big Data.

Why Should you take Hadoop Analyst Training?

• Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com.
• Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
• Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop

What you will Learn in this Course?

Hadoop Fundamentals
• The Motivation for Hadoop
• Hadoop Overview
• Data Storage: HDFS
• Distributed Data Processing: YARN, MapReduce, and Spark
• Data Processing and Analysis: Pig, Hive, and Impala
• Data Integration: Sqoop
• Other Hadoop Data Tools
• Exercise Scenarios Explanation
Introduction to Pig
• What Is Pig?
• Pig’s Features
• Pig Use Cases
• Interacting with Pig
Basic Data Analysis with Pig
• Pig Latin Syntax
• Loading Data
• Simple Data Types
• Field Definitions
• Data Output
• Viewing the Schema
• Filtering and Sorting Data
• Commonly-Used Functions
Processing Complex Data with Pig
• Storage Formats
• Complex/Nested Data Types
• Grouping
• Built-In Functions for Complex Data
• Iterating Grouped Data
Multi-Dataset Operations with Pig
• Techniques for Combining Data Sets
• Joining Data Sets in Pig
• Set Operations
• Splitting Data Sets
Pig Troubleshooting and Optimization
• Troubleshooting Pig
• Logging
• Using Hadoop’s Web UI
• Data Sampling and Debugging
• Performance Overview
• Understanding the Execution Plan
• Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
• What Is Hive?
• What Is Impala?
• Schema and Data Storage
• Comparing Hive to Traditional Databases
• Hive Use Cases
Querying with Hive and Impala
• Databases and Tables
• Basic Hive and Impala Query Language Syntax
• Data Types
• Differences Between Hive and Impala Query Syntax
• Using Hue to Execute Queries
• Using the Impala Shell
Data Management
• Data Storage
• Creating Databases and Tables
• Loading Data
• Altering Databases and Tables
• Simplifying Queries with Views
• Storing Query Results
Data Storage and Performance
• Partitioning Tables
• Choosing a File Format
• Managing Metadata
• Controlling Access to Data
Relational Data Analysis with Hive and Impala
• Joining Datasets
• Common Built-In Functions
• Aggregation and Windowing
Working with Impala
• How Impala Executes Queries
• Extending Impala with User-Defined Functions
• Improving Impala Performance
Analyzing Text and Complex Data with Hive
• Complex Values in Hive
• Using Regular Expressions in Hive
• Sentiment Analysis and N-Grams
• Conclusion
Hive Optimization
• Understanding Query Performance
• Controlling Job Execution Plan
• Bucketing
• Indexing Data
Extending Hive
• SerDes
• Data Transformation with Custom Scripts
• User-Defined Functions
• Parameterized Queries
Choosing the Best Tool for the Job
• Comparing MapReduce, Pig, Hive, Impala, and Relational Databases

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Explanation of Apache Pig Group and Cogroup Operators"

Leave a Message

Your email address will not be published. Required fields are marked *