• Home
  • Big Data
  • Top Hadoop Administration Interview Questions and Answers

Top Hadoop Administration Interview Questions and Answers

Last updated on Feb 18 2022
Rajnikanth S

Table of Contents

Top Hadoop Administration Interview Questions and Answers

What is Hadoop and its components?

When “Big Data” emerged as a problem, Hadoop evolved as a solution for it. It is a framework which provides various services or tools to store and process Big Data. It also helps to analyze Big Data and to make business decisions which are difficult using the traditional method.

How do you read a file from HDFS?

The following are the steps for doing this:

  1. The client uses a Hadoop client program to make the request.
  2. Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
  3. The client contacts the NameNode and requests the file it would like to read.
  4. Client validation is checked by username or by strong authentication mechanism like Kerberos.
  5. The client’s validated request is checked against the owner and permissions of the file.
  6. If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
  7. The client now contacts the most appropriate datanode directly and reads the block data. This process repeats until all blocks in the file have been read or the client closes the file stream. Administrator

Name the daemons required to run a Hadoop cluster?

Daemons required to run a Hadoop Cluster
Daemon Description
DataNode It stores the data in the Hadoop File System which contains more than one DataNode, with data replicated across them
NameNode It is the core of an HDFS that keeps the directory tree of all files is present in the file system, and tracks where the file data is kept across the cluster
SecondaryNameNode It is a specially dedicated node in HDFS cluster that keep checkpoints of the file system metadata present on namenode
NodeManager It is responsible for launching and managing containers on a node which execute tasks as specified by the AppMaster
ResourceManager It is the master that helps in managing the distributed applications running on the YARN system by arbitrating all the available cluster resources

Explain checkpointing in Hadoop and why is it important?

Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient Namenode recovery and restart and is an important indicator of overall cluster health.

Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is  to store the HDFS namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential that this metadata are safely persisted to stable storage for fault tolerance.

This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?

Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup multiple directories, one needs to specify a comma separated list of pathnames as values under config paramters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.

Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup multiple directories, one needs to specify a comma separated list of pathnames as values under config paramters dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.

What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?

Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are:

  • FIFO (First in First Out) Scheduler
  • Fair Scheduler
  • Capacity Scheduler

How do you decide which scheduler to use?

The CS scheduler can be used under the following situations:

  • When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
  • When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
  • When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
  • When you demand scheduler determinism.

The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:

  • When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
  • When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
  • When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?

DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.

If these parameters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?

Filesystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.

FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks and missing replicas.

What is default block size in HDFS and what are the benefits of having smaller block sizes?

Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and what are they used for?

user@machine:hadoop$ bin/hadoop moduleName-cmdargs…

The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.

 

The two modules relevant to HDFS are : dfs and dfsadmin.

The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.

What is Rack awareness? And why is it necessary?

Rack awareness is about distributing data nodes across multiple racks.HDFS follows the rack awareness algorithm to place the data blocks. A rack holds multiple servers. And for a cluster, there could be multiple racks. Let’s say there is a Hadoop cluster set up with 12 nodes. There could be 3 racks with 4 servers on each. All 3 racks are connected so that all 12 nodes are connected and that form a cluster. While deciding on the rack count, the important point to consider is the replication factor. If there is 100GB of data that is going to flow every day with the replication factor 3. Then it’s 300GB of data that will have to reside on the cluster. It’s a better option to have the data replicated across the racks. Even if any node goes down, the replica will be in another rack.

What is the default block size and how is it defined?

128MB and it is defined in hdfs-site.xml and also this is customizable depending on the volume of the data and the level of access.  Say, 100GB of data flowing in a day, the data gets segregated and stored across the cluster. What will be the number of files? 800 files. (1024*100/128) [1024 à converted a GB to MB.] There are two ways to set the customize data block size.

  1. hadoop fs -D fs.local.block.size=134217728 (in bits)
  2. In hdfs-site.xml add this property à block.size with the bits size.

If you change the default size to 512MB as the data size is huge, then the no.of files generated will be 200. (1024*100/512)

How do you get the report of hdfs file system? About disk availability and no.of active nodes?

Command: sudo -u hdfs dfsadmin –report

These are the list of information it displays,

  1. Configured Capacity – Total capacity available in hdfs
  2. Present Capacity – This is the total amount of space allocated for the resources to reside beside the metastore and fsimage usage of space.
  3. DFS Remaining – It is the amount of storage space still available to the HDFS to store more files
  4. DFS Used – It is the storage space that has been used up by HDFS.
  5. DFS Used% – In percentage
  6. Under replicated blocks – No. of blocks
  7. Blocks with corrupt replicas – If any corrupted blocks
  8. Missing blocks
  9. Missing blocks (with replication factor 1)

What is Hadoop balancer and why is it necessary?

The data spread across the nodes are not distributed in the right proportion, meaning the utilization of each node might not be balanced. One node might be over utilized and the other could be under-utilized. This leads to having high costing effect while running any process and it would end up running on heavy usage of those nodes. In order to solve this, Hadoop balancer is used that will balance the utilization of the data in the nodes. So whenever a balancer is executed, the data gets moved across where the under-utilized nodes get filled up and the over utilized nodes will be freed up.

Difference between Cloudera and Ambari?

Cloudera Manager Ambari
Administration tool for Cloudera Administration tool for Horton works
Monitors and manages the entire cluster and reports the usage and any issues Monitors and manages the entire cluster and reports the usage and any issues
Comes with Cloudera paid service Open source

What are the main actions performed by the Hadoop admin?

Monitor health of cluster  -There are many application pages that have to be monitored if any processes run. (Job history server, YARN resource manager, Cloudera manager/ambary depending on the distribution)

turn on security – SSL or Kerberos

Tune performance  – Hadoop balancer

Add new data nodes as needed  – Infrastructure changes and configurations

Optional to turn on MapReduce Job History Tracking Server à Sometimes restarting the services would help release up cache memory. This is when the cluster with an empty process.

What is Kerberos?

It’s an authentication required for each service to sync up in order to run the process. It is recommended to enable Kerberos.  Since we are dealing with the distributed computing, it is always good practice to have encryption while accessing the data and processing it. As each node are connected and any information passage is across a network. As Hadoop uses Kerberos, passwords not sent across the networks. Instead, passwords are used to compute the encryption keys. The messages are exchanged between the client and the server. In simple terms, Kerberos provides identity to each other (nodes) in a secure manner with the encryption.

Configuration in core-site.xml
Hadoop.security.authentication: Kerberos

What is the important list of hdfs commands?

Commands Purpose
hdfs dfs –ls <hdfs path> To list the files from the hdfs filesystem.
Hdfs dfs –put <local file> <hdfs folder> Copy file from the local system to the hdfs filesystem
Hdfs dfs –chmod 777 <hdfs file> Give a read, write, execute permission to the file
Hdfs dfs –get <hdfs folder/file> <local filesystem> Copy the file from hdfs filesystem to the local filesystem
Hdfs dfs –cat <hdfs file> View the file content from the hdfs filesystem
Hdfs dfs –rm <hdfs file> Remove the file from the hdfs filesystem. But it will be moved to trash file path (it’s like a recycle bin in windows)
Hdfs dfs –rm –skipTrash <hdfs filesystem> Removes the file permanently from the cluster.
Hdfs dfs –touchz <hdfs file> Create a file in the hdfs filesystem

How to check the logs of a Hadoop job submitted in the cluster and how to terminate already running process?

yarn logs –applicationId <application_id>    — The application master generates logs on its container and it will be appended with the id it generates. This is will be helpful to monitor the process running status and the log information.

yarn application –kill <application_id>     — If an existing process that was running in the cluster needs to be terminated, kill command is used where the application id is used to terminate the job in the cluster.

What makes Hadoop an ideal choice for programmers’ according to you?

 Hadoop comes with many pros. It has been observed that it offer some of the best benefits of the programmers as compared with any other framework. It makes it easy for programmers to write the code reliably and detect the same errors in same. It is purely based on Java and thus there are no compatibility issues. As far as the matter of functions and distribute systems is concerned, Hadoop has become the number one choice of several programmers all over the world. In addition to this, handling bulk data very easily is another good thing about this framework.

What exactly do you know about the “Big Data” in the Hadoop?

 Relational database management tools often fail to perform their tasks and some stages. This is common when they have to handle a large amount of data. Big data is nothing but an array of complex data sets. It is actually an approach that makes it easy for businesses to get the maximum information from their data by properly searching, analyzing, sharing, transferring, capturing, as well as visualizing the same.

Name the 5 Vs which are associated with the Hadoop Framework?

 These are:

  • Velocity
  • Veracity
  • Velocity
  • Value

What exactly do you know about the Hadoop Components? Tell why they are significant.

 Hadoop is basically an approach that makes it easy for the users to handle big data without facing any problem. All the business decisions can simply be made by getting the most useful information in no time. Hadoop has been equipped with some of the best components that make it easy for the users to keep up the pace even if the data is too large. Hadoop has been equipped with two prime components and they are:

  • Processing Framework
  • Storage Unit

Give abbreviation for YARN and HDFS

 YARN stands for Yet Another Resource Negotiator
HDFS stands for Hadoop Distributed File System

Where exactly the Data is stored in the Hadoop in a distributed environment? On what topology does it base on?

 Hadoop has a powerful data storage unit which is tagged as “Hadoop Distributed File System”. Any form of data can easily be stored in it in the form of blocks. It makes use of master and slave topology. In case the need of extended storage is felt, the same can be extended to fulfill the same. Hadoop is best in this aspect.

What is Name and Master Node in the Hadoop?

 These are actually related to storage in the Hadoop. Name Node is basically considered as a master node and is responsible for maintaining the Meta data information which is related to different blocks based on some of the factors related with them. Data Nodes are considered as Slave Nodes which is mainly responsible for storage and management of data in the basic format.

What exactly do you know about the Resource and Node Manager in the Hadoop Framework?

 Both Resource and Node Manager are associated with the YARN. Resource Manager is responsible for receiving the requests related to data processing. It then passes the same to the parallel Node Managers and ensures the processing has been taken place in a proper manner. On the other side Node Manager make sure the proper execution of task on every single Data Node.

Which node is responsible for storing and modifying the FSImage in the Hadoop technology?

 The Secondary Name Node is responsible for this. It generally performs this task with the help of other parallel nodes and make sure that the task has been processed at its level best. It also generated the reports related to same which are sent along with the data for the analysis of same in the step wise manner.

What do you mean by NAS? Compare it with HDFS

 NAS stands for Network-attached Storage and is generally regarded as the storage server which is file-level. It is connected with a server and is mainly responsible to make sure that the access has been provided to a group of users. When it comes to storing and accessing the files, all the responsibilities are beard by the NAS which can be a software, or a hardware. On the other side, HDFS is a distributed file system and is actually based on commodity hardware.

In the Hadoop technology, data can be stored in two different manners. Explain them and which one you prefer

 Well, it is possible to store data in a distributed manner among all the machines within a cluster. Another approach is to choose a dedicated machine for the same. Distributed approach is a good option because failure of one machine doesn’t interrupt the entire functionality within an organization. Although back up can be created for the first case, accessing the backup data and bringing it into the main server can create a lot of time issues. Thus second option is a good one. It is reliable. However, all the machines within a cluster need to be protected in every aspect in case of confidential data or information.

Can you compare HDFS and Relational Database Management System and bring some key differences?

 When it comes to Hadoop, it really doesn’t matter whether the data that needs to be stored is structured, unstructured or semi-structured. Also, the schema of data is totally unfamiliar to the Hadoop. On the other side, RDBMS always have structured data. It cannot process the others. The schema of same is always known to it. When it comes to processing capabilities, RDMS has limited number of options while Hadoop enables the same without any strict upper limit on the same. Another key difference is Hadoop is open source, while the RDBMS is licensed.

What exactly do you know about a Checkpoint? Which Name Node is responsible for performing the same?

 The process of modifying FSImages is considered as checkpoint or checkpointing for the FSImages. This is actually an operation that always makes sure of saving of time during the operations. It is performed by the Secondary Name Node.

What is the default replication factor when the Name Node replicates the data to other nodes? Is it possible to change the same?

 The default replication factor is 3. Yes, it can be changed as per need.

Among the Name and Data Node, which one according to you have high memory space and Why?

 Name Node only store meta data which is related to the different blocks and because of this reason it needs high memory space. Data Nodes don’t need large memory space.

Suppose you have two situations and they are having small amount of data distributed across different files and a large amount of data in one file. In which situation you will use HDFS

 Well, the HDFS works more reliably with the large data when the same is stored on a single file. In Name Node, the concerned information is available in the RAM and thus it cannot deal with large number of files. In case files are more, there will be large amount of Meta data it needs to deal with. It is almost impossible to store such a large volume of Meta data in the RAM.

What do you mean by the term “Block” in the Hadoop?

 A block is nothing but a general location which is smaller unit of a prime storage location. This is because HDFS store data in block form. It can also be considered as independent unit.

What exactly is the function of jps command in Hadoop?

 Hadoop daemons must remain active all the time during a process is going on. Their failure causes a lot of challenges and issues. Jps Command is used to check whether they are working properly or not.

What is Rack Awareness?

 It is basically an algorithm that guides the Name Node on how the blocks are to be stored. Its main aim is to put a limit on the traffic in the network. It also manages and controls the replicas of each block.

What are the modes in which you can run Hadoop?

 These are:

  1. Pseudo distributed Mode
  2. Fully Distributed Mode
  3. Standalone Mode

How will you handle the issue of frequent crashing of Data Node in case it declares its presence due to some reason?

 Hadoop can easily utilize the commodity hardware which makes it easy for the users to add or to remove data node in case the same crashes too frequently. They can easily be scaled in case data grow at a very quick rate.

What is the general limit on Meta-Data for a file, a directory, or a block that need to be stored on a Name Node?

 A general rule is that it shouldn’t exceed150 bytes for the proper functioning of the Name Node.

What is the default block size in Hadoop 1 and in Hadoop 2?

 In Hadoop 1 it is 64 MB while the same is 128 MB in case of Hadoop 2.

When the Schema validation is done in the Hadoop approach?

 It is done mainly after the loading of the data. Sometime it even leads to bugs but that can be managed at a later stage. Actually, it follows the scheme on read protocol.

For what purpose Hadoop is a good option to consider and Why?

 Hadoop is a good option to consider for OLAP systems, data discovery, as well as for Data Analysis. Hadoop has features that make the bulk data handling very easy. Because all these tasks have a lot of data to handle, the Hadoop approach can easily be trusted.

What are the benefits of Hadoop 2 over Hadoop 1?

 Both are good enough to be trusted. However, some features of Hadoop 2 make it an ideal choice to consider over the Hadoop 1. One of the leading reasons is with 2, it is possible to run multiple applications at the same time without any issue which was not possible in earlier version. Also, the data handling abilities of Hadoop 2 is better and in fact quicker than the 1. In addition to this, processing takes place through a Resource Manager that always makes sure of error free results.

In which architecture Active and Passive Name Nodes are present and what role did they play?

 They both are available in HA i.e. High Availability architecture. Active Node runs in the cluster. Secondary Node is nothing but actually a secondary Node that is considered only when the Active is not present. It is because of no other reason than this, the passive has same data as active. It can also be considered as a back for the data available in Active.

Is it possible to add or remove nodes in a Hadoop Cluster?

 Yes, this can simply be done. It is one of the prime tasks of a Hadoop administrator.
18. Can multiple clients access the similar files in the HDFS?
HDFC only supports the exclusive writers. In case one client is already accessing a file and a request from another client came to access the same. The HDFS reject the request until the task of first client already accessing a file is completed.

What if the Data Nodes fail? How can Name Node take its place in Hadoop?

 A signal is periodically sent to the Name Node by the Data Node. This is actually a signal that represents all is fine with the Data Node. In case no signal is received, it is considered as dead. Using the replicas created, the Name Node replaces the Data Node. However, there is not always a need to replace the whole data. Only the failed blocks can be considered.

What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?

The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:

  • Hadoop-env.sh
  • Core-site.xml
  • Hdfs-site.xml
  • Mapred-site.xml
  • Masters
  • Slaves

These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters and slaves configurations files on namenode machine need to be updated.

Masters – Ip address/hostname of node where secondary namenode will run.

Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.

How will you decide whether you need to use the Capacity Scheduler or the Fair Scheduler?

Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time. Fair Scheduler can be used under the following circumstances –

  1. i) If you want the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.
  2. ii) If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.

iii) Use fair scheduling if there is lot of variability in the utilization between pools.

Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput.Capacity Scheduler can be used under the following circumstances –

  1. i) If the jobs require scheduler determinism then Capacity Scheduler can be useful.
  2. ii) CS’s memory-based scheduling method is useful if the jobs have varying memory requirements.

iii) If you want to enforce resource allocation because you know very well about the cluster utilization and workload then use Capacity Scheduler.

What are the daemons required to run a Hadoop cluster?

NameNode, DataNode, TaskTracker and JobTracker

How will you restart a NameNode?

The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.

Explain about the different schedulers available in Hadoop.

  • FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
  • COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
  • Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

List few Hadoop shell commands that are used to perform a copy operation.

  • fs –put
  • fs –copyToLocal

What is jps command used for?

jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

What are the important hardware considerations when deploying Hadoop in production environment?

  • Memory-System’s memory requirements will vary between the worker services and management services based on the application.
  • Operating System – a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
  • Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
  • Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
  • Network – Two TOR switches per rack provide better redundancy.
  • Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.

How many NameNodes can you run on a single Hadoop cluster?

Only one.

What happens when the NameNode on the Hadoop cluster goes down?

The file system goes offline whenever the NameNode is down.

What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?

This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.

Use the command -/etc/init.d/hadoop-0.20-namenode status.

In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.

2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.

Which command is used to verify if the HDFS is corrupt or not?

Hadoop FSCK (File System Check) command is used to check missing blocks.

List some use cases of the Hadoop Ecosystem

Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

How can you kill a Hadoop job?

Hadoop job –kill jobID

I want to see all the jobs running in a Hadoop cluster. How can you do this?

Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.

Is it possible to copy files across multiple clusters? If yes, how can you accomplish this?

Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying.

Which is the best operating system to run Hadoop?

Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.

What are the network requirements to run Hadoop?

  • SSH is required to run – to launch server processes on the slave nodes.
  • A password less SSH connection is required between the master, secondary machines and all the slaves.

The mapred.output.compress property is set to true, to make sure that all output files are compressed for efficient space usage on the Hadoop cluster. In case under a particular condition if a cluster user does not require compressed data for a job. What would you suggest that he do?

If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.

What is the best practice to deploy a secondary NameNode?

It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.

How often should the NameNode be reformatted?

The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.

If Hadoop spawns 100 tasks for a job and one of the job fails. What does Hadoop do?

The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.

How can you add and remove nodes from the Hadoop cluster?

  • To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
  • To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.

You increase the replication level but notice that the data is under replicated. What could have gone wrong?

Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.

Explain about the different configuration files and where are they located.

The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml

What daemons are needed to run a Hadoop cluster?

DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster.

Which OS are supported by Hadoop deployment?

The main OS use for Hadoop is Linux. However, by using some additional software, it can be deployed on Windows platform.

What are the common Input Formats in Hadoop?

Three widely used input formats are:

  1. Text Input: It is default input format in Hadoop.
  2. Key Value: It is used for plain text files
  3. Sequence: Use for reading files in sequence

What modes can Hadoop code be run in?

Hadoop can be deployed in

  1. Standalone mode
  2. Pseudo-distributed mode
  3. Fully distributed mode.

What is the main difference between RDBMS and Hadoop?

RDBMS is used for transactional systems to store and process the data whereas Hadoop can be used to store the huge amount of data.

What are the important hardware requirements for a Hadoop cluster?

There are no specific requirements for data nodes.

However, the namenodes need a specific amount of RAM to store filesystem image in memory. This depends on the particular design of the primary and secondary namenode.

How would you deploy different components of Hadoop in production?

You need to deploy jobtracker and namenode on the master node then deploy datanodes on multiple slave nodes.

What do you need to do as Hadoop admin after adding new datanodes?

You need to start the balancer for redistributing data equally between all nodes so that Hadoop cluster will find new datanodes automatically. To optimize the cluster performance, you should start rebalancer to redistribute the data between datanodes.

What are the Hadoop shell commands can use for copy operation?

The copy operation command are:

fs –copyToLocal

fs –put

fs –copyFromLocal.

What is the Importance of the namenode?

The role of namenonde is very crucial in Hadoop. It is the brain of the Hadoop. It is largely responsible for managing the distribution blocks on the system. It also supplies the specific addresses for the data based when the client made a request.

What is rack awareness?

It is a method which decides how to put blocks base on the rack definitions. Hadoop will try to limit the network traffic between datanodes which is present in the same rack. So that, it will only contact remote.

What is the use of ‘jps’ command?

The ‘jps’ command helps us to find that the Hadoop daemons are running or not. It also displays all the Hadoop daemons like namenode, datanode, node manager, resource manager, etc. which are running on the machine.

Name some of the essential Hadoop tools for effective working with Big Data?

“Hive,” HBase, HDFS, ZooKeeper, NoSQL, Lucene/SolrSee, Avro, Oozie, Flume, Clouds, and SQL are some of the Hadoop tools that enhance the performance of Big Data.

How many times do you need to reformat the namenode?

The namenode only needs to format once in the beginning. After that, it will never formated. In fact, reformatting of the namenode can lead to loss of the data on entire the namenode.

What is speculative execution?

If a node is executing a task slower then the master node. Then there is needs to redundantly execute one more instance of the same task on another node. So the task finishes first will be accepted and the other one likely to be killed. This process is known as “speculative execution.”

What is Big Data?

Big data is a term which describes the large volume of data. Big data can be used to make better decisions and strategic business moves.

What is the main difference between an “Input Split” and “HDFS Block”?

“Input Split” is the logical division of the data while the “HDFS Block” is the physical division of the data.

Explain how you will restart a NameNode?

The easiest way of doing is to run the command to stop running sell script.

Just click on stop.all.sh. then restarts the NameNode by clocking on start-all-sh.

What happens when the NameNode is down?

If the NameNode is down, the file system goes offline.

 Is it possible to copy files between different clusters? If yes, How can you achieve this?

Yes, we can copy files between multiple Hadoop clusters. This can be done using distributed copy.

Is there any standard method to deploy Hadoop?

No, there are now standard procedure to deploy data using Hadoop. There are few general requirements for all Hadoop distributions. However, the specific methods will always different for each Hadoop admin.

What is distcp?

Distcp is a Hadoop copy utility. It is mainly used for performing MapReduce jobs to copy data. The key challenges in the Hadoop environment is copying data across various clusters, and distcp will also offer to provide multiple datanodes for parallel copying of the data.

What is a checkpoint?

Checkpointing is a method which takes a FsImage. It edits log and compacts them into a new FsImage. Therefore, instead of replaying an edit log, the NameNode can be load in the final in-memory state directly from the FsImage. This is surely more efficient operation which reduces NameNode startup time.

So, this brings us to the end of the Hadoop Administration Interview Questions blog.This Tecklearn ‘Top Hadoop Administration Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Hadoop Administration or Big Data Domain. If you wish to learn Hadoop Administration and build a career in Big Data domain, then check out our interactive, Big Data Hadoop Administration Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/big-data-hadoop-administrator/

BigData Hadoop Administrator

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Our Big Data and Hadoop Administrator training course lets you deep-dive into the concepts of Big Data, equipping you with the skills required for Hadoop administration roles. Tecklearn’s Hadoop Administration Certification Training will guide you to gain expertise in maintaining complex Hadoop Clusters. You will learn exclusive Hadoop Admin activities like Planning of the Cluster, Installation, Cluster Configuration, Cluster Monitoring and Tuning. This Hadoop Administration certification course includes fundamentals of Hadoop, Hadoop clusters, HDFS, MapReduce and HBase. The training will make you proficient in working with Hadoop clusters and deploying that knowledge on real-world projects.

Why Should you take Hadoop Administration?

  • Average salary for a Hadoop Administrator ranges from approximately $104,528 to $141,391 per annum – Indeed.com
  • Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes
  • Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop Administration

What you will Learn in this Course?

Introduction

  • The Case for Apache Hadoop
  • Why Hadoop?
  • Core Hadoop Components
  • Fundamental Concepts

HDFS (Hadoop Distributed File System)

  • HDFS Features
  • Writing and Reading Files
  • NameNode Memory Considerations
  • Overview of HDFS Security
  • Using the Namenode Web UI
  • Using the Hadoop File Shell

Getting Data into HDFS

  • Ingesting Data from External Sources with Flume
  • Ingesting Data from Relational Databases with Sqoop
  • REST Interfaces
  • Best Practices for Importing Data

YARN and MapReduce

  • What Is MapReduce?
  • Basic MapReduce Concepts
  • YARN Cluster Architecture
  • Resource Allocation
  • Failure Recovery
  • Using the YARN Web UI
  • MapReduce Version 1

Planning Your Hadoop Cluster

  • General Planning Considerations
  • Choosing the Right Hardware
  • Network Considerations
  • Configuring Nodes
  • Planning for Cluster Management

Hadoop Installation and Initial Configuration

  • Deployment Types
  • Installing Hadoop
  • Specifying the Hadoop Configuration
  • Performing Initial HDFS Configuration
  • Performing Initial YARN and MapReduce Configuration
  • Hadoop Logging

Installing and Configuring Hive, Impala, and Pig

  • Hive
  • Impala
  • Pig

Hadoop Clients

  • What is a Hadoop Client?
  • Installing and Configuring Hadoop Clients
  • Installing and Configuring Hue
  • Hue Authentication and Authorization

Cloudera Manager

  • The Motivation for Cloudera Manager
  • Cloudera Manager Features
  • Express and Enterprise Versions
  • Cloudera Manager Topology
  • Installing Cloudera Manager
  • Installing Hadoop Using Cloudera Manager
  • Performing Basic Administration Tasks Using Cloudera Manager

Advanced Cluster Configuration

  • Advanced Configuration Parameters
  • Configuring Hadoop Ports
  • Explicitly Including and Excluding Hosts
  • Configuring HDFS for Rack Awareness
  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and How it Works
  • Securing a Hadoop Cluster with Kerberos

Managing and Scheduling Jobs

  • Managing Running Jobs
  • Scheduling Hadoop Jobs
  • Configuring the Fair Scheduler
  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the Cluster
  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • General System Monitoring
  • Monitoring Hadoop Clusters
  • Common Troubleshooting Hadoop Clusters
  • Common Misconfigurations

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "Top Hadoop Administration Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *