• Home
  • Big Data
  • Top Big Data Hadoop Interview Questions and Answers

Top Big Data Hadoop Interview Questions and Answers

Last updated on Feb 18 2022
Rajnikanth S

Table of Contents

Top Big Data Hadoop Interview Questions and Answers

Which companies use Hadoop?

Yahoo! (it is the biggest contributor to the creation of Hadoop; its search engine uses Hadoop); Facebook (developed Hive for analysis); Amazon; Netflix; Adobe; eBay; Spotify; Twitter; and Adobe.

What is the use of RecordReader in Hadoop?

Though InputSplit defines a slice of work, it does not describe how to access it. Here is where the RecordReader class comes into the picture, which takes the byte-oriented data from its source and converts it into record-oriented key–value pairs such that it is fit for the Mapper task to read it. Meanwhile, InputFormat defines this Hadoop RecordReader instance.

What is Speculative Execution in Hadoop?

One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. There are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent tasks as backup. This backup mechanism in Hadoop is speculative execution.

It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of the remaining tasks (which are slower) across the nodes that are free currently. When these tasks are finished, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Speculative execution is by default true in Hadoop. To disable it, we can set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.

What is the different vendor-specific distributions of Hadoop?

The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).

What are the different Hadoop configuration files?

The different Hadoop configuration files include:

  • hadoop-env.sh
  • mapred-site.xml
  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • Master and Slaves

What are the three modes in which Hadoop can run?

The three modes in which Hadoop can run are:

  1. Standalone mode: This is the default mode. It uses the local File System and a single Java process to run the Hadoop services.
  2. Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
  3. Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.

What are the differences between regular File System and HDFS?

  1. Regular File System: In regular File System, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
  2. HDFS: Data is distributed and maintained on multiple systems. If a Data Node crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.

What are the two types of metadata that a NameNode server holds?

The two types of metadata that a NameNode server holds are:

  • Metadata in Disk – This contains the edit log and the FSImage
  • Metadata in RAM – This contains the information about DataNodes

What is the difference between a federation and high availability?

h1

If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?

By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is 128 MB, 128MB, and 94 MB.

Why is HDFS fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.

Explain the architecture of HDFS. 

The architecture of HDFS is as shown:

h2

For an HDFS service, we have a NameNode that has the master process running on one of the machines and DataNodes, which are the slave nodes.

NameNode

NameNode is the master service that hosts metadata in disk and RAM. It holds information about the various DataNodes, their location, the size of each block, etc.

DataNode

DataNodes hold the actual data blocks and send block reports to the NameNode every 10 seconds. The DataNode stores and retrieves the blocks when the NameNode asks. It reads and writes the client’s request and performs block creation, deletion, and replication based on instructions from the NameNode.

  • Data that is written to HDFS is split into blocks, depending on its size. The blocks are randomly distributed across the nodes. With the auto-replication feature, these blocks are auto-replicated across multiple machines with the condition that no two identical blocks can sit on the same machine.
  • As soon as the cluster comes up, the DataNodes start sending their heartbeats to the NameNodes every three seconds. The NameNode stores this information; in other words, it starts building metadata in RAM, which contains information about the DataNodes available in the beginning. This metadata is maintained in RAM, as well as in the disk.

How does rack awareness work in HDFS?

HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across the racks of a Hadoop Cluster.

h3

By default, each block of data is replicated three times on various DataNodes present on different racks. Two identical blocks cannot be placed on the same DataNode. When a cluster is “rack-aware,” all the replicas of a block cannot be placed on the same rack. If a DataNode crashes, you can retrieve the data block from different DataNodes.

How can you restart NameNode and all the daemons in Hadoop?

The following commands will help you restart NameNode and all the daemons:

You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.

You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.

Which command will help you find the status of blocks and FileSystem health?

To check the status of the blocks, use the command:

hdfs fsck <path> -files -blocks

To check the health status of FileSystem, use the command:

hdfs fsck / -files –blocks –locations > dfs-fsck.log

What would happen if you store too many small files in a cluster on HDFS?

Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.

How do you copy data from the local system onto HDFS? 

The following command will copy data from the local file system onto HDFS:

hadoop fs –copyFromLocal [source] [destination]

Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv

In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS.

When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.

dfsadmin -refreshNodes

This is used to run the HDFS client and it refreshes node configuration for the NameNode.

rmadmin -refreshNodes

This is used to perform administrative tasks for ResourceManager.

Is there any way to change the replication of files on HDFS after they are already written to HDFS?

Yes, the following are ways to change the replication of files on HDFS:

We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.

If you want to change the replication factor for a particular file or directory, use:

$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file

Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv

Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?

In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block.

Under-replicated blocks:

These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes.

Over-replicated blocks:

These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes.

What is the distributed cache in MapReduce?

A distributed cache is a mechanism wherein the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing.

To copy the file to HDFS, you can use the command:

hdfs dfs-put /user/Simplilearn/lib/jar_file.jar

To set up the application’s JobConf, use the command:

DistributedCache.addFileToClasspath(newpath(“/user/Simplilearn/lib/jar_file.jar”), conf)

Then, add it to the driver class.

What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

RecordReader

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read.

Combiner

This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase.

Partitioner

The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

Why is MapReduce slower in processing data in comparison to other processing frameworks?

This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:

MapReduce is slower because:

  • It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data.
  • During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
  • In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.

Is it possible to change the number of mappers to be created in a MapReduce job?

By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

Name some Hadoop-specific data types that are used in a MapReduce program.

This is an important question, as you would need to know the different data types if you are getting into the field of Big Data.

For every data type in Java, you have an equivalent in Hadoop. Therefore, the following are some Hadoop-specific data types that you could use in your MapReduce program:

  • IntWritable
  • FloatWritable
  • LongWritable
  • DoubleWritable
  • BooleanWritable
  • ArrayWritable
  • MapWritable
  • ObjectWritable

h4

How is identity mapper different from chain mapper?

h6

What are the major configuration parameters required in a MapReduce program?

We need to have the following configuration parameters:

  • Input location of the job in HDFS
  • Output location of the job in HDFS
  • Input and output formats
  • Classes containing a map and reduce functions
  • JAR file for mapper, reducer and driver classes

What do you mean by map-side join and reduce-side join in MapReduce?

h7

What is the role of the OutputCommitter class in a MapReduce job?

As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.

Example: org.apache.hadoop.mapreduce.OutputCommitter

public abstract class OutputCommitter extends OutputCommitter

MapReduce relies on the OutputCommitter for the following:

  • Set up the job initialization
  • Cleaning up the job after the job completion
  • Set up the task’s temporary output
  • Check whether a task needs a commit
  • Committing the task output
  • Discard the task commit

Explain the process of spilling in MapReduce.

Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.

For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB.

How can you set the mappers and reducers for a MapReduce job?

The number of mappers and reducers can be set in the command line using:

-D mapred.map.tasks=5 –D mapred.reduce.tasks=2

In the code, one can configure JobConf variables:

job.setNumMapTasks(5); // 5 mappers

job.setNumReduceTasks(2); // 2 reducers

What happens when a node running a map task fails before sending the output to the reducer?

If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.

Can we write the output of MapReduce in different formats?

Yes. Hadoop supports various input and output formats, such as:

  • TextOutputFormat – This is the default output format and it writes records as lines of text.
  • SequenceFileOutputFormat – This is used to write sequence files when the output files need to be fed into another MapReduce job as input files.
  • MapFileOutputFormat – This is used to write the output as map files.
  • SequenceFileAsBinaryOutputFormat – This is another variant of SequenceFileInputFormat. It writes keys and values to a sequence file in binary format.
  • DBOutputFormat – This is used for writing to relational databases and HBase. This format also sends the reduce output to a SQL table.

What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?

In Hadoop v1, MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling.

Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn’t run in v1.

To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources.

In Hadoop v2, the following features are available:

  • Scalability – You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks.
  • Compatibility – The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.
  • Resource utilization – YARN allows the dynamic allocation of cluster resources to improve resource utilization.
  • Multitenancy – YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.

Explain how YARN allocates resources to an application with the help of its architecture. 

There is a client/application/API which talks to ResourceManager. The ResourceManager manages the resource allocation in the cluster. It has two internal components, scheduler, and application manager. The ResourceManager is aware of the resources that are available with every node manager. The scheduler allocates resources to various running applications when they are running in parallel. It schedules resources based on the requirements of the applications. It does not monitor or track the status of the applications.

Applications Manager is what accepts job submissions. It monitors and restarts the application masters in case of failures. Application Master manages the resource needs of individual applications. It interacts with the scheduler to acquire the required resources, and with NodeManager to execute and monitor tasks, which tracks the jobs running. It monitors each container’s resource utilization.

A container is a collection of resources, such as RAM, CPU, or network bandwidth. It provides the rights to an application to use a specific amount of resources.

Let us have a look at the architecture of YARN:

 

Whenever a job submission happens, ResourceManager requests the NodeManager to hold some resources for processing. NodeManager then guarantees the container that would be available for processing. Next, the ResourceManager starts a temporary daemon called application master to take care of the execution. The App Master, which the applications manager launches, will run in one of the containers. The other containers will be utilized for execution. This is briefly how YARN takes care of the allocation.

Which of the following has replaced JobTracker from MapReduce v1?

  1. NodeManagerh8
  2. ApplicationManager
  3. ResourceManager
  4. Scheduler

The answer is ResourceManager. It is the name of the master process in Hadoop v2.

Write the YARN commands to check the status of an application and kill an application.

The commands are as follows:

  1. a) To check the status of an application:

yarn application -status ApplicationID

  1. b) To kill or terminate an application:

yarn application -kill ApplicationID

Can we have more than one ResourceManager in a YARN-based cluster?

Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.

There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.

What are the different schedulers available in YARN?

The different schedulers available in YARN are:

  • FIFO scheduler – This places application in a queue and runs them in the order of submission (first in, first out). It is not desirable, as a long-running application might block the small running applications
  • Capacity scheduler – A separate dedicated queue allows the small job to start as soon as it is submitted. The large job finishes later compared to using the FIFO scheduler
  • Fair scheduler – There is no need to reserve a set amount of capacity since it will dynamically balance resources between all the running jobs

What happens if a ResourceManager fails while executing an application in a high availability cluster?

In a high availability cluster, there are two ResourceManagers: one active and the other standby. If a ResourceManager fails in the case of a high availability cluster, the standby will be elected as active and instructs the ApplicationMaster to abort. The ResourceManager recovers its running state by taking advantage of the container statuses sent from all node managers.

In a cluster of 10 DataNodes, each having 16 GB RAM and 10 cores, what would be the total processing capacity of the cluster?

Every node in a Hadoop cluster will have one or multiple processes running, which would need RAM. The machine itself, which has a Linux file system, would have its own processes that need a specific amount of RAM usage. Therefore, if you have 10 DataNodes, you need to allocate at least 20 to 30 percent towards the overheads, Cloudera-based services, etc. You could have 11 or 12 GB and six or seven cores available on every machine for processing. Multiply that by 10, and that’s your processing capacity.

What happens if requested memory or CPU cores go beyond the size of container allocation?

If an application starts demanding more memory or more CPU cores that cannot fit into a container allocation, your application will fail. This happens because the requested memory is more than the maximum container size.

Now that you have learned about HDFS, MapReduce, and YARN, let us move to the next section. We’ll go over questions about Hive, Pig, HBase, and Sqoop.

Before moving into the Hive interview questions, let us summarize what Hive is all about. Facebook adopted the Hive to overcome MapReduce’s limitations. MapReduce proved to be difficult for users as they found it challenging to code because not all of them were well-versed with the coding languages. Users required a language similar to SQL, which was well-known to all the users. This gave rise to Hive.

Hive is a data warehouse system used to query and analyze large datasets stored in HDFS. It uses a query language called HiveQL, which is similar to SQL. Hive works on structured data. Let us now have a look at a few Hive questions.

What are the different components of a Hive architecture?

The different components of the Hive are:

  • User Interface: This calls the execute interface to the driver and creates a session for the query. Then, it sends the query to the compiler to generate an execution plan for it
  • Metastore: This stores the metadata information and sends it to the compiler for the execution of a query
  • Compiler: This generates the execution plan. It has a DAG of stages, where each stage is either a metadata operation, a map, or reduces a job or operation on HDFS
  • Execution Engine: This acts as a bridge between the Hive and Hadoop to process the query. Execution Engine communicates bidirectionally with Metastore to perform operations, such as create or drop tables.

What is a partition in Hive and why is partitioning required in Hive

Partition is a process for grouping similar types of data together based on columns or partition keys. Each table can have one or more partition keys to identify a particular partition.

Partitioning provides granularity in a Hive table. It reduces the query latency by scanning only relevant partitioned data instead of the entire data set. We can partition the transaction data for a bank based on month — January, February, etc. Any operation regarding a particular month, say February, will only have to scan the February partition, rather than the entire table data.

Why does Hive not store metadata information in HDFS?

We know that the Hive’s data is stored in HDFS. However, the metadata is either stored locally or it is stored in RDBMS. The metadata is not stored in HDFS, because HDFS read/write operations are time-consuming. As such, Hive stores metadata information in the metastore using RDBMS instead of HDFS. This allows us to achieve low latency and is faster.

What are the components used in Hive query processors?

The components used in Hive query processors are:

  • Parser
  • Semantic Analyzer
  • Execution Engine
  • User-Defined Functions
  • Logical Plan Generation
  • Physical Plan Generation
  • Optimizer
  • Operators
  • Type checking

Suppose there are several small CSV files present in /user/input directory in HDFS and you want to create a single Hive table from these files. The data in these files have the following fields: {registration_no, name, email, address}. What will be your approach to solve this, and where will you create a single Hive table for multiple smaller files without degrading the performance of the system?

Using SequenceFile format and grouping these small files together to form a single sequence file can solve this problem. Below are the steps:

Write a query to insert a new column(new_col INT) into a hive table (h_table) at a position before an existing column (x_col).

The following query will insert a new column:

ALTER TABLE h_table

CHANGE COLUMN new_col INT

BEFORE x_col

What are the key differences between Hive and Pig?

Pig is a scripting language that is used for data processing. Let us have a look at a few questions involving Pig:

What are the different ways of executing a Pig script?

The different ways of executing a Pig script are as follows:

  • Grunt shell
  • Script file
  • Embedded script

What are the major components of a Pig execution environment?

The major components of a Pig execution environment are:

  • Pig Scripts: They are written in Pig Latin using built-in operators and UDFs, and submitted to the execution environment.
  • Parser: Completes type checking and checks the syntax of the script. The output of the parser is a Directed Acyclic Graph (DAG).
  • Optimizer: Performs optimization using merge, transform, split, etc. Optimizer aims to reduce the amount of data in the pipeline.
  • Compiler: Converts the optimized code into MapReduce jobs automatically.
  • Execution Engine: MapReduce jobs are submitted to execution engines to generate the desired results.

Explain the different complex data types in Pig.

Pig has three complex data types, which are primarily Tuple, Bag, and Map.

Tuple 

A tuple is an ordered set of fields that can contain different data types for each field. It is represented by braces ().

Example: (1,3)

Bag 

A bag is a set of tuples represented by curly braces {}.

Example: {(1,4), (3,5), (4,6)}

Map 

A map is a set of key-value pairs used to represent data elements. It is represented in square brackets [ ].

Example: [key#value, key1#value1,….]

What are the various diagnostic operators available in Apache Pig?

Pig has Dump, Describe, Explain, and Illustrate as the various diagnostic operators.

Dump 

The dump operator runs the Pig Latin scripts and displays the results on the screen.

Load the data using the “load” operator into Pig.

Display the results using the “dump” operator.

Describe 

Describe operator is used to view the schema of a relation.

Load the data using “load” operator into Pig

View the schema of a relation using “describe” operator

Explain 

Explain operator displays the physical, logical and MapReduce execution plans.

Load the data using “load” operator into Pig

Display the logical, physical and MapReduce execution plans using “explain” operator

Illustrate 

Illustrate operator gives the step-by-step execution of a sequence of statements.

Load the data using “load” operator into Pig

Show the step-by-step execution of a sequence of statements using “illustrate” operator

State the usage of the group, order by, and distinct keywords in Pig scripts.

The group statement collects various records with the same key and groups the data in one or more relations.

Example: Group_data = GROUP Relation_name BY AGE

The order statement is used to display the contents of relation in sorted order based on one or more fields.

Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)

Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.

Example: Relation_2 = DISTINCT Relation_name1

What are the relational operators in Pig?

The relational operators in Pig are as follows:

COGROUP

It joins two or more tables and then performs GROUP operation on the joined table result.

CROSS

This is used to compute the cross product (cartesian product) of two or more relations.

FOREACH

This will iterate through the tuples of a relation, generating a data transformation.

JOIN

This is used to join two or more tables in a relation.

LIMIT

This will limit the number of output tuples.

SPLIT

This will split the relation into two or more relations.

UNION

It will merge the contents of two relations.

ORDER

This is used to sort a relation based on one or more fields.

What is the use of having filters in Apache Pig?

FilterOperator is used to select the required tuples from a relation based on a condition. It also allows you to remove unwanted records from the data file.

Example: Filter the products with a whole quantity that is greater than 1000

A = LOAD ‘/user/Hadoop/phone_sales’ USING PigStorage(‘,’) AS (year:int, product:chararray, quantity:int);

B = FILTER A BY quantity > 1000

h9

Suppose there’s a file called “test.txt” having 150 records in HDFS. Write the PIG command to retrieve the first 10 records of the file.

To do this, we need to use the limit operator to retrieve the first 10 records from a file.

Load the data in Pig:

test_data = LOAD “/user/test.txt” USING PigStorage(‘,’) as (field1, field2,….);

Limit the data to first 10 records:

Limit_test_data = LIMIT test_data 10;

Now let’s have a look at questions from HBase. HBase is a NoSQL database that runs on top of Hadoop. It is a four-dimensional database in comparison to RDBMS databases, which are usually two-dimensional.

What are the key components of HBase?

This is one of the most common interview questions.

Region Server

Region server contains HBase tables that are divided horizontally into “Regions” based on their key values. It runs on every node and decides the size of the region. Each region server is a worker node that handles read, writes, updates, and delete request from clients.

ss12

HMaster

This assigns regions to RegionServers for load balancing, and monitors and manages the Hadoop cluster. Whenever a client wants to change the schema and any of the metadata operations, HMaster is used.

ss13

ZooKeeper

This provides a distributed coordination service to maintain server state in the cluster. It looks into which servers are alive and available, and provides server failure notifications. Region servers send their statuses to ZooKeeper indicating if they are ready to reading and write operations.

ss15

Explain what row key and column families in HBase is.

The row key is a primary key for an HBase table. It also allows logical grouping of cells and ensures that all cells with the same row key are located on the same server.

Column families consist of a group of columns that are defined during table creation, and each column family has certain column qualifiers that a delimiter separates.

ss11

Why do we need to disable a table in HBase and how do you do it?

The HBase table is disabled to allow modifications to its settings. When a table is disabled, it cannot be accessed through the scan command.

To disable the employee table, use the command:

disable ‘employee_table’

To check if the table is disabled, use the command:

is_disabled ‘employee_table’

ss10

Write the code needed to open a connection in HBase.

The following code is used to open a connection in HBase:

Configuration myConf = HBaseConfiguration.create();

HTableInterface usersTable = new HTable(myConf, “users”);

What does replication mean in terms of HBase?

The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.

The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.

disable ‘hbase1’

alter ‘hbase1’, {NAME => ‘family_name’, REPLICATION_SCOPE => ‘1’}

enable ‘hbase1’

Can you import/export in an HBase table?

Yes, it is possible to import and export tables from one HBase cluster to another.

HBase export utility:

hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location”

Example: hbase org.apache.hadoop.hbase.mapreduce.Export “employee_table” “/export/employee_table”

HBase import utility:

create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}

hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location”

Example: create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}

hbase org.apache.hadoop.hbase.mapreduce.Import “emp_table_import” “/export/employee_table”

What is compaction in HBase?

Compaction is the process of merging HBase files into a single file. This is done to reduce the amount of memory required to store the files and the number of disk seeks needed. Once the files are merged, the original files are deleted.

How does Bloom filter work?

The HBase Bloom filter is a mechanism to test whether an HFile contains a specific row or row-col cell. The Bloom filter is named after its creator, Burton Howard Bloom. It is a data structure that predicts whether a given element is a member of a set of data. These filters provide an in-memory index structure that reduces disk reads and determines the probability of finding a row in a particular file.

Does HBase have any concept of the namespace?

A namespace is a logical grouping of tables, analogous to a database in RDBMS. You can create the HBase namespace to the schema of the RDBMS database.

To create a namespace, use the command:

create_namespace ‘namespace name’

To list all the tables that are members of the namespace, use the command: list_namespace_tables ‘default’

To list all the namespaces, use the command:

list_namespace

  1. How does the Write Ahead Log (WAL) help when a RegionServer crashes?

If a RegionServer hosting a MemStore crash, the data that existed in memory, but not yet persisted, is lost. HBase recovers against that by writing to the WAL before the write completes. The HBase cluster keeps a WAL to record changes as they happen. If HBase goes down, replaying the WAL will recover data that was not yet flushed from the MemStore to the HFile.

What are the default file formats to import data using Sqoop?

The default file formats are Delimited Text File Format and SequenceFile Format. Let us understand each of them individually:

Delimited Text File Format

This is the default import format and can be specified explicitly using the –as-textfile argument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows.

1,here is a message,2010-05-01

2,strive to learn,2010-01-01

3,third message,2009-11-12

SequenceFile Format

SequenceFile is a binary format that stores individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate

these data types for you. This format supports the exact storage of all data in binary representations and is appropriate for storing binary data.

What is the importance of the eval tool in Sqoop?

The Sqoop eval tool allows users to execute user-defined queries against respective database servers and preview the result in the console.

Explain how does Sqoop imports and exports data between RDBMS and HDFS with its architecture.

  1. Introspect database to gather metadata (primary key information)
  2. Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to

HDFS

  1. Introspect database to gather metadata (primary key information)
  2. Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to RDBMS. Sqoop will export Hadoop files back to RDBMS tables.

Write the HBase command to list the contents and update the column families of a table.

The following code is used to list the contents of an HBase table:

scan ‘table_name’

Example: scan ‘employee_table’

To update column families in the table, use the following command:

alter ‘table_name’, ‘column_family_name’

Example: alter ‘employee_table’, ‘emp_address’

What are catalog tables in HBase?

The catalog has two tables: hbasemeta and -ROOT-

The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell’s list command. It keeps a list of all the regions in the system and the location of hbase:meta is

stored in ZooKeeper. The -ROOT- table keeps track of the location of the .META table.

What is hotspotting in HBase and how can it be avoided?

In HBase, all read and write requests should be uniformly distributed across all of the regions in the RegionServers. Hotspotting occurs when a given region serviced by a single RegionServer receives most or all of the read or write requests.

Hotspotting can be avoided by designing the row key in such a way that data is written should go to multiple regions across the cluster. Below are the techniques to do so:

  • Salting
  • Hashing
  • Reversing the key

Suppose you have a database “test_db” in MySQL. Write the command to connect this database and import tables to Sqoop.

The following commands show how to import the test_db database and test_demo table, and how to present it to Sqoop.

Explain how to export a table back to RDBMS with an example.

Suppose there is a “departments” table in “retail_db” that is already imported into Sqoop and you need to export this table back to RDBMS.

  1. i) Create a new “dept” table to export in RDBMS (MySQL)

ss16

2. ii) Export “departments” table to the “dept” table

ss17

 

How will you update the columns that are already exported? Write a Sqoop command to show all the databases in the MySQL server?

To update a column of a table which is already exported, we use the command:

–update-key parameter

The following is an example:

sqoop export –connect

jdbc:mysql://localhost/dbname –username root

–password cloudera –export-dir /input/dir

–table test_demo –-fields-terminated-by “,”

–update-key column_name

What is Codegen in Sqoop?

The Codegen tool in Sqoop generates the Data Access Object (DAO) Java classes that encapsulate and interpret imported records.

The following example generates Java code for an “employee” table in the “testdb” database.

$ sqoop codegen \

–connect jdbc:mysql://localhost/testdb \

–username root \

–table employee

Can Sqoop be used to convert data in different formats? If not, which tools can be used for this purpose?

Yes, Sqoop can be used to convert data into different formats. This depends on the different arguments that are used for importing.

What do you know about the term “Big Data”?

 Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.

What are the five V’s of Big Data?

The five V’s of Big data is as follows:

  • Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
  • Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
  • Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
  • Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
  • Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

5 V’s of Big Data

Note: This is one of the basic and significant questions asked in the big data interview. You can choose to explain the five V’s in detail if you see the interviewer is interested to know more. However, the names can even be mentioned if you are asked about the term “Big Data”.

Tell us how big data and Hadoop are related to each other.

 Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

Note: This question is commonly asked in a big data interview. You can go further to answer this question and try to explain the main components of Hadoop.

How is big data analysis helpful in increasing business revenue?

Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

Explain the steps to be followed to deploy a Big Data solution.

Followings are the three steps that are followed to deploy a Big Data Solution –

  1. Data Ingestion

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.

Steps of Deploying Big Data Solution

  1. Data Storage

After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.

iii. Data Processing

The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.

What is the role of the JDBC driver in the Sqoop setup? Is the JDBC driver enough to connect Sqoop to the database?

JDBC driver is a standard Java API used for accessing different databases in RDBMS using Sqoop. Each database vendor is responsible for writing their own implementation that will communicate with the corresponding database with its native protocol. Each user needs to download the drivers separately and install them onto Sqoop prior to its use.

JDBC driver alone is not enough to connect Sqoop to the database. We also need connectors to interact with different databases. A connector is a pluggable piece that is used to fetch metadata and allows Sqoop to overcome the differences in SQL dialects supported by various databases, along with providing optimized data transfer.

ss18

Define respective components of HDFS and YARN

The two main components of HDFS are-

  • NameNode – This is the master node for processing metadata information for data blocks within the HDFS
  • DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode

In addition to serving the client requests, the NameNode executes either of two following roles –

  • CheckpointNode – It runs on a different host from the NameNode
  • BackupNode- It is a read-only NameNode which contains file system metadata information excluding the block locations

The two main components of YARN are

  • ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
  • NodeManager– It executes tasks on each single Data Node

Why is Hadoop used for Big Data Analytics?

 Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of

  • Storage
  • Processing
  • Data collection

Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.

Do you prefer good data or good models? Why?

How to Approach: This is a tricky question but generally asked in the big data interview. It asks you to choose between good data or good models. As a candidate, you should try to answer it from your experience. Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data.

As we already mentioned, answer it from your experience. However, don’t say that having both good data and good models is important as it is hard to have both in real life projects.

Will you optimize algorithms or code to make them run faster?

How to Approach: The answer to this question should always be “Yes.” Real world performance matters and it doesn’t depend on the data or model you are using in your project.

The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big data interview.

How do you approach data preparation?

How to Approach: Data preparation is one of the crucial steps in big data projects. A big data interview may involve at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.

As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

How would you transform unstructured data into structured data?

How to Approach: Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.

By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you give an answer to this question specifically, you will definitely be able to crack the big data interview.

Which hardware configuration is most beneficial for Hadoop jobs?

Dual processors or core machines with a configuration of  4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.

What happens when two users try to access the same file in the HDFS?

HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.

How to recover a NameNode when it is down?

The following steps need to execute to make the Hadoop cluster up and running:

  1. Use the FsImage which is file system metadata replica to start a new NameNode.
  2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
  3. Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.

In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.

What do you understand by Rack Awareness in Hadoop?

It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.

What is the difference between “HDFS Block” and “Input Split”?

The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.

Input Split is a logical division of data by mapper for mapping operation.

Hadoop is one of the most popular Big Data frameworks, and if you are going for a Hadoop interview prepare yourself with these basic level interview questions for Big Data Hadoop. These questions will be helpful for you whether you are going for a Hadoop developer or Hadoop Admin interview.

What are the common input formats in Hadoop?

 Below are the common input formats in Hadoop –

  • Text Input Format – The default input format defined in Hadoop is the Text Input Format.
  • Sequence File Input Format – To read files in a sequence, Sequence File Input Format is used.
  • Key Value Input Format – The input format used for plain text files (files broken into lines) is the Key Value Input Format.

Explain some important features of Hadoop.

Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are –

  • Open Source – Hadoop is an open-source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
  • Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
  • Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
  • Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine.
  • Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
  • High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.

Explain the different modes in which Hadoop run.

 Apache Hadoop runs in the following three modes –

  • Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
  • Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
  • Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.

Explain the core components of Hadoop.

 Hadoop is an open-source framework that is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are –

  • HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.

Core Components of Hadoop

  • Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. It is responsible for the parallel processing of high volume of data by dividing data into independent tasks. The processing is done in two phases Map and Reduce. The Map is the first phase of processing that specifies complex logic code and the Reduce is the second phase of processing that specifies light-weight operations.
  • YARN – The processing framework in Hadoop is YARN. It is used for resource management and provides multiple data processing engines i.e. data science, real-time streaming, and batch processing.

What are the configuration parameters in a “MapReduce” program?

The main configuration parameters in “MapReduce” framework are:

  • Input locations of Jobs in the distributed file system
  • Output location of Jobs in the distributed file system
  • The input format of data
  • The output format of data
  • The class which contains the map function
  • The class which contains the reduce function
  • JAR file which contains the mapper, reducer and the driver classes

What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?

Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.

  • The default block size in Hadoop 1 is: 64 MB
  • The default block size in Hadoop 2 is: 128 MB

Yes, we can change block size by using the parameter – dfs.block.size located in the hdfs-site.xml file.

What is Distributed Cache in a MapReduce Framework

Distributed Cache is a feature of Hadoop MapReduce framework to cache files for applications. Hadoop framework makes cached files available for every map/reduce tasks running on the data nodes. Hence, the data files can access the cache file as a local file in the designated job.

What are the three running modes of Hadoop?

The three running modes of Hadoop are as follows:

  1. Standalone or local: This is the default mode and does not need any configuration. In this mode, all the following components of Hadoop uses local file system and runs on a single JVM –
  • NameNode
  • DataNode
  • ResourceManager
  • NodeManager
  1. Pseudo-distributed:In this mode, all the master and slave Hadoop services are deployed and executed on a single node.

iii. Fully distributed: In this mode, Hadoop master and slave services are deployed and executed on separate nodes.

Explain JobTracker in Hadoop

JobTracker is a JVM process in Hadoop to submit and track MapReduce jobs.

JobTracker performs the following activities in Hadoop in a sequence –

  • JobTracker receives jobs that a client application submits to the job tracker
  • JobTracker notifies NameNode to determine data node
  • JobTracker allocates TaskTracker nodes based on available slots.
  • it submits the work on allocated TaskTracker Nodes,
  • JobTracker monitors the TaskTracker nodes.
  • When a task fails, JobTracker is notified and decides how to reallocate the task.

It is not easy to crack Hadoop developer interview but the preparation can do everything. If you are a fresher, learn the Hadoop concepts and prepare properly. Have a good knowledge of the different file systems, Hadoop versions, commands, system security, etc.  Here are few questions that will help you pass the Hadoop developer interview.

What are the different configuration files in Hadoop?

 The different configuration files in Hadoop are –

core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.

mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name

hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.

yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.

How can you achieve security in Hadoop?

Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service while using Kerberos, at a high level. Each step involves a message exchange with a server.

  1. Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.
  2. Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).
  3. Service Request – It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.

What is commodity hardware?

Commodity hardware is a low-cost system identified by less-availability and low-quality. The commodity hardware comprises of RAM as it performs a number of services that require RAM for the execution. One doesn’t require high-end hardware configuration or supercomputers to run Hadoop, it can be run on any commodity hardware.

How is NFS different from HDFS?

 There are a number of distributed file systems that work in their own way. NFS (Network File System) is one of the oldest and popular distributed file storage systems whereas HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big data. The main differences between NFS and HDFS are as follows –

ss19

How do Hadoop MapReduce work?

There are two phases of MapReduce operation.

  • Map phase – In this phase, the input data is split by map tasks. The map tasks run in parallel. These split data is used for analysis purpose.
  • Reduce phase- In this phase, the similar split data is aggregated from the entire collection and shows the result.

What is MapReduce? What is the syntax you use to run a MapReduce program?

MapReduce is a programming model in Hadoop for processing large data sets over a cluster of computers, commonly known as HDFS. It is a parallel programming model.

The syntax to run a MapReduce program is – hadoop_jar_file.jar /input_path /output_path.

What are the Port Numbers for NameNode, Task Tracker, and Job Tracker?

  • NameNode – Port 50070
  • Task Tracker – Port 50060
  • Job Tracker – Port 50030

What are the different file permissions in HDFS for files or directory levels?

Hadoop distributed file system (HDFS) uses a specific permissions model for files and directories. Following user levels are used in HDFS –

  • Owner
  • Group
  • Others.

For each of the user mentioned above following permissions are applicable –

  • read (r)
  • write (w)
  • execute(x).

Above mentioned permissions work differently for files and directories.

For files –

  • The r permission is for reading a file
  • The w permission is for writing a file.

For directories –

  • The r permission lists the contents of a specific directory.
  • The w permission creates or deletes a directory.
  • The X permission is for accessing a child directory.

What are the basic parameters of a Mapper?

The basic parameters of a Mapper are

  • LongWritable and Text
  • Text and IntWritable

The interviewer has more expectations from an experienced Hadoop developer, and thus his questions are one-level up. So, if you have gained some experience, don’t forget to cover command based, scenario-based, real-experience based questions. Here we bring some sample interview questions for experienced Hadoop developers.

How to restart all the daemons in Hadoop?

To restart all the daemons, it is required to stop all the daemons first. The Hadoop directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.

Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh command to start all the daemons again.

What is the use of jps command in Hadoop?

The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.

Explain the process that overwrites the replication factors in HDFS.

 There are two methods to overwrite the replication factors in HDFS –

Method 1: On File Basis

In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is:

$hadoop fs – setrep –w2/my/test_file

Here, test_file is the filename that’s replication factor will be set to 2.

Method 2: On Directory Basis

In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified.

$hadoop fs –setrep –w5/my/test_dir

Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.

What will happen with a NameNode that doesn’t have any data?

A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.

Explain NameNode recovery process.

The NameNode recovery process involves the below-mentioned steps to make Hadoop cluster running:

  • In the first step in the recovery process, file system metadata replica (FsImage) starts a new NameNode.
  • The next step is to configure DataNodes and Clients. These DataNodes and Clients will then acknowledge new NameNode.
  • During the final step, the new NameNode starts serving the client on the completion of last checkpoint FsImage loading and receiving block reports from the DataNodes.

Note: Don’t forget to mention, this NameNode recovery process consumes a lot of time on large Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS high availability architecture is recommended to use.

How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?

CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons.

However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.

Why is HDFS only suitable for large data sets and not the correct tool to use for many small files?

This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.

Why do we need Data Locality in Hadoop? Explain.

Datasets in HDFS store as blocks in DataNodes the Hadoop cluster. During the execution of a MapReduce job the individual Mapper processes the blocks (Input Splits). If the data does not reside in the same node where the Mapper is executing the job, the data needs to be copied from the DataNode over the network to the mapper DataNode.

Now if a MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from other DataNode in the cluster simultaneously, it would cause serious network congestion which is a big performance issue of the overall system. Hence, data proximity to the computation is an effective and cost-effective solution which is technically termed as Data locality in Hadoop. It helps to increase the overall throughput of the system.

Data locality can be of three types:

  • Data local – In this type data and the mapper resides on the same node. This is the closest proximity of data and the most preferred scenario.
  • Rack Local – In this scenarios mapper and data reside on the same rack but on the different data nodes.
  • Different Rack – In this scenario mapper and data reside on the different racks.

DFS can handle a large volume of data then why do we need Hadoop framework?

Hadoop is not only for storing large data but also to process those big data. Though DFS(Distributed File System) too can store the data, but it lacks below features-

  • It is not fault tolerant
  • Data movement over a network depends on bandwidth.

What is Sequencefileinputformat?

Hadoop uses a specific file format which is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.

What are the real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance, and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.

Here are some of the instances where Hadoop is used:

  • Managing traffic on streets
  • Streaming processing
  • Content management and archiving e-mails
  • Processing rat brain neuronal signals using a Hadoop computing cluster
  • Fraud detection and prevention
  • Advertisements targeting platforms are using Hadoop to capture and analyze click stream, transaction, video, and social media data
  • Managing content, posts, images, and videos on social media platforms
  • Analyzing customer data in real time for improving business performance
  • Public sector fields such as intelligence, defense, cyber security, and scientific research
  • Getting access to unstructured data such as output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data

How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system that lets you store and handle massive amounts of data on a cloud of machines, handling data redundancy.

The primary benefit of this is that since data is stored in several nodes, it is better to process it in a distributed manner. Each node can process the data stored on it instead of spending time on moving the data over the network.

On the contrary, in the relational database computing system, we can query data in real time, but it is not efficient to store data in tables, records, and columns when the data is huge.

Hadoop also provides a scheme to build a column database with Hadoop HBase for runtime queries on rows.

In what all modes Hadoop can be run?

Hadoop can be run in three modes:

  • Standalone mode:The default mode of Hadoop, it uses local file system for input and output operations. This mode is mainly used for the debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, and hdfs-site.xml files. This mode works much faster when compared to other modes.
  • Pseudo-distributed mode (Single-node Cluster):In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node, and thus both Master and Slave nodes are the same.

Fully distributed mode (Multi-node Cluster): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

Explain the major difference between HDFS block and InputSplit.

In simple terms, a block is the physical representation of data while split is the logical representation of data present in the block. Split acts as an intermediary between the block and the mapper.
Suppose we have two blocks:

Block 1: ii nntteell

Block 2: Tecklearn

Now considering the map, it will read Block 1 from ii to ll but does not know how to process Block 2 at the same time. Here comes Split into play, which will form a logical group of Block 1 and Block 2 as a single block.

It then forms a key–value pair using InputFormat and records reader and sends map for further processing with InputSplit. If you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640 MB (64 MB each) and there are limited resources, you can assign ‘split size’ as 128 MB. This will form a logical group of 128 MB, with only 5 maps executing at a time.

However, if the ‘split size’ property is set to false, the whole file will form one InputSplit and is processed by a single map, consuming more time when the file is bigger.

What is distributed cache? What are its benefits?

Distributed cache in Hadoop is a service by MapReduce framework to cache files when needed.

Once a file is cached for a specific job, Hadoop will make it available on each DataNode both in system and in memory, where map and reduce tasks are executing. Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are as follows:

  • It distributes simple, read-only text/data files and/or complex types such as jars, archives, and others. These archives are then un-archived at the slave node.
  • Distributed cache tracks the modification timestamps of cache files, which notify that the files should not be modified until a job is executed.

Explain the difference between NameNode, Checkpoint NameNode, and Backup Node.

NameNode is the core of HDFS that manages the metadata—the information of which file maps to which block locations and which blocks are stored on which DataNode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses the following files for namespace:

  • fsimage file: It keeps track of the latest Checkpoint of the namespace.
  • edits file: It is a log of changes that have been made to the namespace since Checkpoint

Checkpoint NameNode has the same directory structure as NameNode and creates Checkpoints for namespace at regular intervals by downloading the fsimage, editing files, and margining them within the local directory. The new image after merging is then uploaded to NameNode. There is a similar node like Checkpoint, commonly known as the Secondary Node, but it does not support the ‘upload to NameNode’ functionality.

Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of the file system namespace and doesn’t require getting hold of changes after regular intervals. The Backup Node needs to save the current state in-memory to an image file to create a new Checkpoint

What are the most common input formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop
  • KeyValue Input Format: Used for plain text files where the files are broken into lines
  • Sequence File Input Format: Used for reading files in sequence

Define DataNode. How does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each DataNode sends a heartbeat message to notify that it is alive. If the NameNode does not receive a message from the DataNode for 10 minutes, the NameNode considers the DataNode to be dead or out of place and starts the replication of blocks that were hosted on that DataNode such that they are hosted on some other DataNode. A BlockReport contains a list of the all blocks on a DataNode. Now, the system starts to replicate what were stored in the dead DataNode.

The NameNode manages the replication of the data blocks from one DataNode to another. In this process, the replication data gets transferred directly between DataNodes such that the data never passes the NameNode.

What are the core methods of a Reducer?

The three core methods of a Reducer are as follows:

setup(): This method is used for configuring various parameters such as input data size and distributed cache.

public void setup (context)

reduce(): Heart of the Reducer is always called once per key with the associated reduced task.

public void reduce(Key, Value, context)

cleanup(): This method is called to clean the temporary files, only once at the end of the task.

public void cleanup (context)

What is a SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key–value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer, and Sorter classes. The three SequenceFile formats are as follows:

  1. Uncompressed key–value records
  2. Record compressed key–value records—only ‘values’ are compressed here
  3. Block compressed key–value records—both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable

What is the role of a JobTracker in Hadoop?

A JobTracker’s primary function is resource management (managing the TaskTrackers), tracking resource availability, and task life cycle management (tracking the tasks’ progress and fault tolerance).

  • It is a process that runs on a separate node, often not on a DataNode.
  • The JobTracker communicates with the NameNode to identify data location.
  • It finds the best TaskTracker nodes to execute the tasks on the given nodes.
  • It monitors individual TaskTrackers and submits the overall job back to the client.
  • It tracks the execution of MapReduce workloads local to the slave node.

 

What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists.

To run the MapReduce job, you need to ensure that the output directory does not exist in the HDFS.

To delete the directory before running the job, we can use shell:

Hadoop fs –rmr /path/to/your/output/

Or the Java API:

FileSystem.getlocal(conf).delete(outputDir, true);

How can you debug Hadoop code?

First, we should check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, we need to determine the location of RM logs.

  1. Run:

ps –ef | grep –I ResourceManager

Then, look for the log directory in the displayed result. We have to find out the job ID from the displayed list and check if there is any error message associated with that job.

  1. On the basis of RM logs, we need to identify the worker node that was involved in the execution of the task.
  2. Now, we will login to that node and run the below code:

ps –ef | grep –iNodeManager

  1. Then, we will examine the Node Manager log. The majority of errors come from the user-level logs for each MapReduce job.

How to configure Replication Factor in HDFS?

The hdfs-site.xml file is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all the files placed in HDFS.
We can also modify the replication factor on a per-file basis using the below:

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

We can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

How to compress a Mapper output not touching Reducer output?

To achieve this compression, we should set:

conf.set(“mapreduce.map.output.compress”, true)

conf.set(“mapreduce.output.fileoutputformat.compress”, false)

What is the difference between Map-side Join and Reduce-side Join?

Map-side Join at Map side is performed when data reaches the Map. We need a strict structure for defining Map-side Join.

On the other hand, Reduce-side Join (Repartitioned Join) is simpler than Map-side Join since here the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory ‘/’ select * from emp;

We can write our query for the data we want to import from Hive to HDFS. The output we receive will be stored in part files in the specified HDFS path.

What is fsck?

 fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.

What are the main differences between NAS (Network-attached storage) and HDFS?

 The main differences between NAS (Network-attached storage) and HDFS –

  • HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
  • Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

What is the Command to format the NameNode?

$ hdfs namenode -format

If you have some considerable experience of working in Big Data world, you will be asked a number of questions in your big data interview based on your previous experience. These questions may be simply related to your experience or scenario based. So, get prepared with these best Big data interview questions and answers –

Do you have any Big Data experience? If so, please share it with us.

How to Approach: There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.

So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.

So, this brings us to the end of the Big Data Hadoop Interview Questions blog.This Tecklearn ‘Top Big Data Hadoop Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Big Data Hadoop or Big Data Domain. If you wish to learn Big Data Hadoop and build a career in Big Data domain, then check out our interactive, Big Data Hadoop Architect Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/bigdata-hadoop-architect-all-in-1-combo-course/

Big Data Hadoop-Architect (All in 1) Combo Training

About the Course

Tecklearn’s BigData Hadoop-Architect (All in 1) combo includes the following Courses:

  • BigData Hadoop Analyst
  • BigData Hadoop Developer
  • BigData Hadoop Administrator
  • BigData Hadoop Tester
  • Big Data Security with Kerberos

Why Should you take Big Data Hadoop Combo Training?

  • Average salary for a Hadoop Administrator ranges from approximately $104,528 to $141,391 per annum – Indeed.com
  • Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com
  • Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com

What you will Learn in this Course?

Introduction

  • The Case for Apache Hadoop
  • Why Hadoop?
  • Core Hadoop Components
  • Fundamental Concepts

HDFS

  • HDFS Features
  • Writing and Reading Files
  • NameNode Memory Considerations
  • Overview of HDFS Security
  • Using the Namenode Web UI
  • Using the Hadoop File Shell

Getting Data into HDFS

  • Ingesting Data from External Sources with Flume
  • Ingesting Data from Relational Databases with Sqoop
  • REST Interfaces
  • Best Practices for Importing Data

YARN and MapReduce

  • What Is MapReduce?
  • Basic MapReduce Concepts
  • YARN Cluster Architecture
  • Resource Allocation
  • Failure Recovery
  • Using the YARN Web UI
  • MapReduce Version 1

Planning Your Hadoop Cluster

  • General Planning Considerations
  • Choosing the Right Hardware
  • Network Considerations
  • Configuring Nodes
  • Planning for Cluster Management

Hadoop Installation and Initial Configuration

  • Deployment Types
  • Installing Hadoop
  • Specifying the Hadoop Configuration
  • Performing Initial HDFS Configuration
  • Performing Initial YARN and MapReduce Configuration
  • Hadoop Logging

Installing and Configuring Hive, Impala, and Pig

  • Hive
  • Impala
  • Pig

Hadoop Clients

  • What is a Hadoop Client?
  • Installing and Configuring Hadoop Clients
  • Installing and Configuring Hue
  • Hue Authentication and Authorization

Cloudera Manager

  • The Motivation for Cloudera Manager
  • Cloudera Manager Features
  • Express and Enterprise Versions
  • Cloudera Manager Topology
  • Installing Cloudera Manager
  • Installing Hadoop Using Cloudera Manager
  • Performing Basic Administration Tasks Using Cloudera Manager

Advanced Cluster Configuration

  • Advanced Configuration Parameters
  • Configuring Hadoop Ports
  • Explicitly Including and Excluding Hosts
  • Configuring HDFS for Rack Awareness
  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and How it Works
  • Securing a Hadoop Cluster with Kerberos

Managing and Scheduling Jobs

  • Managing Running Jobs
  • Scheduling Hadoop Jobs
  • Configuring the Fair Scheduler
  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the Cluster
  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • General System Monitoring
  • Monitoring Hadoop Clusters
  • Common Troubleshooting Hadoop Clusters
  • Common Misconfigurations

Introduction to Pig

  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

Basic Data Analysis with Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions

Processing Complex Data with Pig

  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data

Multi-Dataset Operations with Pig

  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets

Pig Troubleshooting and Optimization

  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs

Introduction to Hive and Impala

  • What Is Hive?
  • What Is Impala?
  • Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Querying with Hive and Impala

  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Differences Between Hive and Impala Query Syntax
  • Using Hue to Execute Queries
  • Using the Impala Shell

Data Management

  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables
  • Choosing a File Format
  • Managing Metadata
  • Controlling Access to Data

Relational Data Analysis with Hive and Impala

  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing

Working with Impala 

  • How Impala Executes Queries
  • Extending Impala with User-Defined Functions
  • Improving Impala Performance

Analyzing Text and Complex Data with Hive

  • Complex Values in Hive
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Conclusion

Hive Optimization 

  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Bucketing
  • Indexing Data

Extending Hive 

  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries

Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Modelling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Parallel Programming with Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Preview: Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL with Impala

Hadoop Testing

  • Hadoop Application Testing
  • Roles and Responsibilities of Hadoop Testing Professional
  • Framework MRUnit for Testing of MapReduce Programs
  • Unit Testing
  • Test Execution
  • Test Plan Strategy and Writing Test Cases for Testing Hadoop Application

Big Data Testing

  • BigData Testing
  • Unit Testing
  • Integration Testing
  • Functional Testing
  • Non-Functional Testing
  • Golden Data Set

System Testing

  • Building and Set up
  • Testing SetUp
  • Solary Server
  • Non-Functional Testing
  • Longevity Testing
  • Volumetric Testing

Security Testing

  • Security Testing
  • Non-Functional Testing
  • Hadoop Cluster
  • Security-Authorization RBA
  • IBM Project

Automation Testing

  • Query Surge Tool

Oozie

  • Why Oozie
  • Installation Engine
  • Oozie Workflow Engine
  • Oozie security
  • Oozie Job Process
  • Oozie terminology
  • Oozie bundle

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Top Big Data Hadoop Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *