Concept of Combiners in Hadoop MapReduce

Last updated on May 30 2022
Sanjay Grover

Table of Contents

Concept of Combiners in Hadoop MapReduce

MapReduce – Combiners

A Combiner, also referred to as a semi-reducer, is an optional class that operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class.
The main function of a Combiner is to summarize the map output records with an equivalent key. The output (key-value collection) of the combiner is going to be sent over the network to the particular Reducer task as input.

Combiner

The Combiner class is employed in between the Map class and therefore the Reduce class to scale back the quantity of knowledge transfer between Map and Reduce. Usually, the output of the map task is large and therefore the data transferred to the reduce task is high.
The following MapReduce task diagram shows the COMBINER PHASE.

bigData
bigData

How Combiner Works?

Here may be a brief summary on how MapReduce Combiner works −
• A combiner doesn’t have a predefined interface and it must implement the Reducer interfaces reduce () method.
• A combiner operates on each map output key. It must have an equivalent output key-value types because the Reducer class.
• A combiner can produce summary information from an outsized dataset because it replaces the first Map output.
Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce phase, which makes it easier to process.

MapReduce Combiner Implementation

The following example provides a theoretical idea about combiners. allow us to assume we’ve the subsequent input document named input.txt for MapReduce.
What does one mean by Object
What does one realize Java
What is Java Virtual Machine
How Java enabled High Performance
The important phases of the MapReduce program with Combiner are discussed below.
Record Reader
This is the first phase of MapReduce where the Record Reader reads every line from the input text file as text and yields output as key-value pairs.
Input − Line by line text from the input data .
Output − Forms the key-value pairs. the subsequent is that the set of expected key-value pairs.
<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
Map Phase
The Map phase takes input from the Record Reader, processes it, and produces the output as another set of key-value pairs.
Input − The following key-value pair is the input taken from the Record Reader.
<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
The Map phase reads each key-value pair, divides each word from the worth using StringTokenizer, treats each word as key and therefore the count of that word as value. the subsequent code snippet shows the Mapper class and therefore the map function.

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) 
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Output − The expected output is as follows −
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>
Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection pairs.
Input − The subsequent key-value pair is the input taken from the Map phase.
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

The Combiner phase reads each key-value pair, combines the common words as key and values as collection. Usually, the code and operation for a Combiner is analogous thereto of a Reducer. Following is that the code snippet for Mapper, Combiner and Reducer class declaration.
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
Output − The expected output is as follows −
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes it, and passes the output as key-value pairs. Note that the Combiner functionality is same as the Reducer.
Input − The subsequent key-value pair is the input taken from the Combiner phase.
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
The Reducer phase reads each key-value pair. Following is the code snippet for the Combiner.

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> 
{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException 
{
int sum = 0;
for (IntWritable val : values) 
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Output − The expected output from the Reducer phase is as follows −
<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
Record Writer
This is the last phase of MapReduce where the Record Writer writes every key-value pair from the Reducer phase and sends the output as text.
Input − Each key-value pair from the Reducer phase alongside the Output format.
Output − It gives you the key-value pairs in text format. Following is that the expected output.
What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1
Example Program
The following code block counts the number of words in a program.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) 
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> 
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException 
{
int sum = 0;
for (IntWritable val : values) 
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception 
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Save the above program as WordCount.java. The compilation and execution of the program is given below.

Compilation and Execution

Let us assume we are in the home directory of Hadoop user (for example, /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units
Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. You can download the jar from mvnrepository.com.
Let us assume the downloaded folder is /home/hadoop/.
Step 3 − Use the following commands to compile the WordCount.java program and to create a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units WordCount.java
$ jar -cvf units.jar -C units/ .
Step 4 − Use the following command to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5 − Use the following command to copy the input file named input.txt in the input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/input.txt input_dir
Step 6 − Use the following command to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7 − Use the following command to run the Word count application by taking input files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while till the file gets executed. After execution, the output contains a number of input splits, Map tasks, and Reducer tasks.
Step 8 − Use the following command to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9 − Use the following command to see the output in Part-00000 file. This file is generated by HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Following is the output generated by the MapReduce program.
What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1
So, this brings us to the end of blog. This Tecklearn ‘Concept of Combiners in Hadoop MapReduce’ helps you with commonly asked questions if you are looking out for a job in Big Data and Hadoop Domain.
If you wish to learn Hive and build a career in Big Data or Hadoop domain, then check out our interactive, Big Data Hadoop-Architect (All in 1) Combo Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

BigData Hadoop-Architect (All in 1) | Combo Course

Big Data Hadoop-Architect (All in 1) Combo Training

About the Course

Tecklearn’s Big Data Hadoop-Architect (All in 1) combo includes the following Courses:
• BigData Hadoop Analyst
• BigData Hadoop Developer
• BigData Hadoop Administrator
• BigData Hadoop Tester
• Big Data Security with Kerberos
Why Should you take Big Data Hadoop Combo Training?
• Average salary for a Hadoop Administrator ranges from approximately $104,528 to $141,391 per annum – Indeed.com
• Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com
• Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com
What you will Learn in this Course?
Introduction
• The Case for Apache Hadoop
• Why Hadoop?
• Core Hadoop Components
• Fundamental Concepts
HDFS
• HDFS Features
• Writing and Reading Files
• NameNode Memory Considerations
• Overview of HDFS Security
• Using the Namenode Web UI
• Using the Hadoop File Shell
Getting Data into HDFS
• Ingesting Data from External Sources with Flume
• Ingesting Data from Relational Databases with Sqoop
• REST Interfaces
• Best Practices for Importing Data
YARN and MapReduce
• What Is MapReduce?
• Basic MapReduce Concepts
• YARN Cluster Architecture
• Resource Allocation
• Failure Recovery
• Using the YARN Web UI
• MapReduce Version 1
Planning Your Hadoop Cluster
• General Planning Considerations
• Choosing the Right Hardware
• Network Considerations
• Configuring Nodes
• Planning for Cluster Management
Hadoop Installation and Initial Configuration
• Deployment Types
• Installing Hadoop
• Specifying the Hadoop Configuration
• Performing Initial HDFS Configuration
• Performing Initial YARN and MapReduce Configuration
• Hadoop Logging
Installing and Configuring Hive, Impala, and Pig
• Hive
• Impala
• Pig
Hadoop Clients
• What is a Hadoop Client?
• Installing and Configuring Hadoop Clients
• Installing and Configuring Hue
• Hue Authentication and Authorization
Cloudera Manager
• The Motivation for Cloudera Manager
• Cloudera Manager Features
• Express and Enterprise Versions
• Cloudera Manager Topology
• Installing Cloudera Manager
• Installing Hadoop Using Cloudera Manager
• Performing Basic Administration Tasks Using Cloudera Manager
Advanced Cluster Configuration
• Advanced Configuration Parameters
• Configuring Hadoop Ports
• Explicitly Including and Excluding Hosts
• Configuring HDFS for Rack Awareness
• Configuring HDFS High Availability
Hadoop Security
• Why Hadoop Security Is Important
• Hadoop’s Security System Concepts
• What Kerberos Is and How it Works
• Securing a Hadoop Cluster with Kerberos
Managing and Scheduling Jobs
• Managing Running Jobs
• Scheduling Hadoop Jobs
• Configuring the Fair Scheduler
• Impala Query Scheduling
Cluster Maintenance
• Checking HDFS Status
• Copying Data Between Clusters
• Adding and Removing Cluster Nodes
• Rebalancing the Cluster
• Cluster Upgrading
Cluster Monitoring and Troubleshooting
• General System Monitoring
• Monitoring Hadoop Clusters
• Common Troubleshooting Hadoop Clusters
• Common Misconfigurations
Introduction to Pig
• What Is Pig?
• Pig’s Features
• Pig Use Cases
• Interacting with Pig
Basic Data Analysis with Pig
• Pig Latin Syntax
• Loading Data
• Simple Data Types
• Field Definitions
• Data Output
• Viewing the Schema
• Filtering and Sorting Data
• Commonly-Used Functions
Processing Complex Data with Pig
• Storage Formats
• Complex/Nested Data Types
• Grouping
• Built-In Functions for Complex Data
• Iterating Grouped Data
Multi-Dataset Operations with Pig
• Techniques for Combining Data Sets
• Joining Data Sets in Pig
• Set Operations
• Splitting Data Sets
Pig Troubleshooting and Optimization
• Troubleshooting Pig
• Logging
• Using Hadoop’s Web UI
• Data Sampling and Debugging
• Performance Overview
• Understanding the Execution Plan
• Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
• What Is Hive?
• What Is Impala?
• Schema and Data Storage
• Comparing Hive to Traditional Databases
• Hive Use Cases
Querying with Hive and Impala
• Databases and Tables
• Basic Hive and Impala Query Language Syntax
• Data Types
• Differences Between Hive and Impala Query Syntax
• Using Hue to Execute Queries
• Using the Impala Shell
Data Management
• Data Storage
• Creating Databases and Tables
• Loading Data
• Altering Databases and Tables
• Simplifying Queries with Views
• Storing Query Results
Data Storage and Performance
• Partitioning Tables
• Choosing a File Format
• Managing Metadata
• Controlling Access to Data
Relational Data Analysis with Hive and Impala
• Joining Datasets
• Common Built-In Functions
• Aggregation and Windowing
Working with Impala
• How Impala Executes Queries
• Extending Impala with User-Defined Functions
• Improving Impala Performance
Analyzing Text and Complex Data with Hive
• Complex Values in Hive
• Using Regular Expressions in Hive
• Sentiment Analysis and N-Grams
• Conclusion
Hive Optimization
• Understanding Query Performance
• Controlling Job Execution Plan
• Bucketing
• Indexing Data
Extending Hive
• SerDes
• Data Transformation with Custom Scripts
• User-Defined Functions
• Parameterized Queries
Importing Relational Data with Apache Sqoop
• Sqoop Overview
• Basic Imports and Exports
• Limiting Results
• Improving Sqoop’s Performance
• Sqoop 2
Introduction to Impala and Hive
• Introduction to Impala and Hive
• Why Use Impala and Hive?
• Comparing Hive to Traditional Databases
• Hive Use Cases
Modelling and Managing Data with Impala and Hive
• Data Storage Overview
• Creating Databases and Tables
• Loading Data into Tables
• HCatalog
• Impala Metadata Caching
Data Formats
• Selecting a File Format
• Hadoop Tool Support for File Formats
• Avro Schemas
• Using Avro with Hive and Sqoop
• Avro Schema Evolution
• Compression
Data Partitioning
• Partitioning Overview
• Partitioning in Impala and Hive
Capturing Data with Apache Flume
• What is Apache Flume?
• Basic Flume Architecture
• Flume Sources
• Flume Sinks
• Flume Channels
• Flume Configuration
Spark Basics
• What is Apache Spark?
• Using the Spark Shell
• RDDs (Resilient Distributed Datasets)
• Functional Programming in Spark
Working with RDDs in Spark
• A Closer Look at RDDs
• Key-Value Pair RDDs
• MapReduce
• Other Pair RDD Operations
Writing and Deploying Spark Applications
• Spark Applications vs. Spark Shell
• Creating the SparkContext
• Building a Spark Application (Scala and Java)
• Running a Spark Application
• The Spark Application Web UI
• Configuring Spark Properties
• Logging
Parallel Programming with Spark
• Review: Spark on a Cluster
• RDD Partitions
• Partitioning of File-based RDDs
• HDFS and Data Locality
• Executing Parallel Operations
• Stages and Tasks
Spark Caching and Persistence
• RDD Lineage
• Caching Overview
• Distributed Persistence
Common Patterns in Spark Data Processing
• Common Spark Use Cases
• Iterative Algorithms in Spark
• Graph Processing and Analysis
• Machine Learning
• Example: k-means
Preview: Spark SQL
• Spark SQL and the SQL Context
• Creating DataFrames
• Transforming and Querying DataFrames
• Saving DataFrames
• Comparing Spark SQL with Impala
Hadoop Testing
• Hadoop Application Testing
• Roles and Responsibilities of Hadoop Testing Professional
• Framework MRUnit for Testing of MapReduce Programs
• Unit Testing
• Test Execution
• Test Plan Strategy and Writing Test Cases for Testing Hadoop Application
Big Data Testing
• BigData Testing
• Unit Testing
• Integration Testing
• Functional Testing
• Non-Functional Testing
• Golden Data Set
System Testing
• Building and Set up
• Testing SetUp
• Solary Server
• Non-Functional Testing
• Longevity Testing
• Volumetric Testing
Security Testing
• Security Testing
• Non-Functional Testing
• Hadoop Cluster
• Security-Authorization RBA
• IBM Project
Automation Testing
• Query Surge Tool
Oozie
• Why Oozie
• Installation Engine
• Oozie Workflow Engine
• Oozie security
• Oozie Job Process
• Oozie terminology
• Oozie bundle
Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Concept of Combiners in Hadoop MapReduce"

Leave a Message

Your email address will not be published. Required fields are marked *