Detailed understanding of Hadoop Distributed File System (HDFS)

Last updated on May 30 2022
Sanjay Grover

Table of Contents

Detailed understanding of Hadoop Distributed File System (HDFS)

What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

o Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is giving importance to whole data rather than time to fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a lot of memory for name node’s memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine. The file system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.
HDFS DataNode and NameNode Image:

bigData 14
bigData

HDFS Read Image:

bigData 15
bigData

HDFS Write Image:

bigData 16
bigData

Since all the metadata is stored in name node, it is very important. If it fails the file system can not be used as there would be no way of knowing how to reconstruct the files from blocks present in data node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It performs periodic check points.It communicates with the name node and take snapshot of meta data which helps minimize downtime and loss of data.

Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are given below.
To Format $ hadoop namenode -format
To Start $ start-dfs.sh

HDFS Basic File Operations

1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.
$ hadoop fs -mkdir /user/test
o Copy the file “data.txt” from a file kept in local folder /usr/home/Desktop to HDFS folder /user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
o Display the content of HDFS folder
$ Hadoop fs -ls /user/test
2. Copying data from HDFS to local file system
o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt
Recursive deleting
o hadoop fs -rmr <arg>
Example:
o hadoop fs -rmr /user/sonoo/

HDFS Other commands

The below is used in the commands
“<path>” means any file or directory name.
“<path>…” means one or more file or directory names.
“<file>” means any filename.
“<src>” and “<dest>” are path names in a directed operation.
“<localSrc>” and “<localDest>” are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success.
o get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>
Works like -get, but deletes the HDFS copy on success.
o setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
o test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
o stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

HDFS Features and Goals

The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of Hadoop which is used for data storage. It is designed to run on commodity hardware.
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-cost hardware. It can easily handle the application that contains large data sets.
Let’s see some of the important features and goals of HDFS.

Features of HDFS

o Highly Scalable – HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
o Replication – Due to some unfavorable conditions, the node containing the data may be loss. So, to overcome such problems, HDFS always maintains the copy of data on a different machine.
o Fault tolerance – In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active.
o Distributed data storage – This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable – HDFS is designed in such a way that it can easily portable from platform to another.

Goals of HDFS

o Handling the hardware failure – The HDFS contains multiple server machines. Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access – The HDFS applications usually run on the general-purpose file system. This application requires streaming access to their data sets.
o Coherence Model – The application that runs on HDFS require to follow the write-once-ready-many approach. So, a file once created need not to be changed. However, it can be appended and truncate.
So, this brings us to the end of blog. This Tecklearn ‘Detailed understanding of Hadoop Distributed File Systems (HDFS) ’ helps you with commonly asked questions if you are looking out for a job in Big Data and Hadoop Domain.
If you wish to learn Hive and build a career in Big Data or Hadoop domain, then check out our interactive, Big Data Hadoop-Architect (All in 1) Combo Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

BigData Hadoop-Architect (All in 1) | Combo Course

Big Data Hadoop-Architect (All in 1) Combo Training

About the Course

Tecklearn’s Big Data Hadoop-Architect (All in 1) combo includes the following Courses:
• BigData Hadoop Analyst
• BigData Hadoop Developer
• BigData Hadoop Administrator
• BigData Hadoop Tester
• Big Data Security with Kerberos

Why Should you take Big Data Hadoop Combo Training?

• Average salary for a Hadoop Administrator ranges from approximately $104,528 to $141,391 per annum – Indeed.com
• Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com
• Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com

What you will Learn in this Course?

Introduction
• The Case for Apache Hadoop
• Why Hadoop?
• Core Hadoop Components
• Fundamental Concepts
HDFS
• HDFS Features
• Writing and Reading Files
• NameNode Memory Considerations
• Overview of HDFS Security
• Using the Namenode Web UI
• Using the Hadoop File Shell
Getting Data into HDFS
• Ingesting Data from External Sources with Flume
• Ingesting Data from Relational Databases with Sqoop
• REST Interfaces
• Best Practices for Importing Data
YARN and MapReduce
• What Is MapReduce?
• Basic MapReduce Concepts
• YARN Cluster Architecture
• Resource Allocation
• Failure Recovery
• Using the YARN Web UI
• MapReduce Version 1
Planning Your Hadoop Cluster
• General Planning Considerations
• Choosing the Right Hardware
• Network Considerations
• Configuring Nodes
• Planning for Cluster Management
Hadoop Installation and Initial Configuration
• Deployment Types
• Installing Hadoop
• Specifying the Hadoop Configuration
• Performing Initial HDFS Configuration
• Performing Initial YARN and MapReduce Configuration
• Hadoop Logging
Installing and Configuring Hive, Impala, and Pig
• Hive
• Impala
• Pig
Hadoop Clients
• What is a Hadoop Client?
• Installing and Configuring Hadoop Clients
• Installing and Configuring Hue
• Hue Authentication and Authorization
Cloudera Manager
• The Motivation for Cloudera Manager
• Cloudera Manager Features
• Express and Enterprise Versions
• Cloudera Manager Topology
• Installing Cloudera Manager
• Installing Hadoop Using Cloudera Manager
• Performing Basic Administration Tasks Using Cloudera Manager
Advanced Cluster Configuration
• Advanced Configuration Parameters
• Configuring Hadoop Ports
• Explicitly Including and Excluding Hosts
• Configuring HDFS for Rack Awareness
• Configuring HDFS High Availability
Hadoop Security
• Why Hadoop Security Is Important
• Hadoop’s Security System Concepts
• What Kerberos Is and How it Works
• Securing a Hadoop Cluster with Kerberos
Managing and Scheduling Jobs
• Managing Running Jobs
• Scheduling Hadoop Jobs
• Configuring the Fair Scheduler
• Impala Query Scheduling
Cluster Maintenance
• Checking HDFS Status
• Copying Data Between Clusters
• Adding and Removing Cluster Nodes
• Rebalancing the Cluster
• Cluster Upgrading
Cluster Monitoring and Troubleshooting
• General System Monitoring
• Monitoring Hadoop Clusters
• Common Troubleshooting Hadoop Clusters
• Common Misconfigurations
Introduction to Pig
• What Is Pig?
• Pig’s Features
• Pig Use Cases
• Interacting with Pig
Basic Data Analysis with Pig
• Pig Latin Syntax
• Loading Data
• Simple Data Types
• Field Definitions
• Data Output
• Viewing the Schema
• Filtering and Sorting Data
• Commonly-Used Functions
Processing Complex Data with Pig
• Storage Formats
• Complex/Nested Data Types
• Grouping
• Built-In Functions for Complex Data
• Iterating Grouped Data
Multi-Dataset Operations with Pig
• Techniques for Combining Data Sets
• Joining Data Sets in Pig
• Set Operations
• Splitting Data Sets
Pig Troubleshooting and Optimization
• Troubleshooting Pig
• Logging
• Using Hadoop’s Web UI
• Data Sampling and Debugging
• Performance Overview
• Understanding the Execution Plan
• Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
• What Is Hive?
• What Is Impala?
• Schema and Data Storage
• Comparing Hive to Traditional Databases
• Hive Use Cases
Querying with Hive and Impala
• Databases and Tables
• Basic Hive and Impala Query Language Syntax
• Data Types
• Differences Between Hive and Impala Query Syntax
• Using Hue to Execute Queries
• Using the Impala Shell
Data Management
• Data Storage
• Creating Databases and Tables
• Loading Data
• Altering Databases and Tables
• Simplifying Queries with Views
• Storing Query Results
Data Storage and Performance
• Partitioning Tables
• Choosing a File Format
• Managing Metadata
• Controlling Access to Data
Relational Data Analysis with Hive and Impala
• Joining Datasets
• Common Built-In Functions
• Aggregation and Windowing
Working with Impala
• How Impala Executes Queries
• Extending Impala with User-Defined Functions
• Improving Impala Performance
Analyzing Text and Complex Data with Hive
• Complex Values in Hive
• Using Regular Expressions in Hive
• Sentiment Analysis and N-Grams
• Conclusion
Hive Optimization
• Understanding Query Performance
• Controlling Job Execution Plan
• Bucketing
• Indexing Data
Extending Hive
• SerDes
• Data Transformation with Custom Scripts
• User-Defined Functions
• Parameterized Queries
Importing Relational Data with Apache Sqoop
• Sqoop Overview
• Basic Imports and Exports
• Limiting Results
• Improving Sqoop’s Performance
• Sqoop 2
Introduction to Impala and Hive
• Introduction to Impala and Hive
• Why Use Impala and Hive?
• Comparing Hive to Traditional Databases
• Hive Use Cases
Modelling and Managing Data with Impala and Hive
• Data Storage Overview
• Creating Databases and Tables
• Loading Data into Tables
• HCatalog
• Impala Metadata Caching
Data Formats
• Selecting a File Format
• Hadoop Tool Support for File Formats
• Avro Schemas
• Using Avro with Hive and Sqoop
• Avro Schema Evolution
• Compression
Data Partitioning
• Partitioning Overview
• Partitioning in Impala and Hive
Capturing Data with Apache Flume
• What is Apache Flume?
• Basic Flume Architecture
• Flume Sources
• Flume Sinks
• Flume Channels
• Flume Configuration
Spark Basics
• What is Apache Spark?
• Using the Spark Shell
• RDDs (Resilient Distributed Datasets)
• Functional Programming in Spark
Working with RDDs in Spark
• A Closer Look at RDDs
• Key-Value Pair RDDs
• MapReduce
• Other Pair RDD Operations
Writing and Deploying Spark Applications
• Spark Applications vs. Spark Shell
• Creating the SparkContext
• Building a Spark Application (Scala and Java)
• Running a Spark Application
• The Spark Application Web UI
• Configuring Spark Properties
• Logging
Parallel Programming with Spark
• Review: Spark on a Cluster
• RDD Partitions
• Partitioning of File-based RDDs
• HDFS and Data Locality
• Executing Parallel Operations
• Stages and Tasks
Spark Caching and Persistence
• RDD Lineage
• Caching Overview
• Distributed Persistence
Common Patterns in Spark Data Processing
• Common Spark Use Cases
• Iterative Algorithms in Spark
• Graph Processing and Analysis
• Machine Learning
• Example: k-means
Preview: Spark SQL
• Spark SQL and the SQL Context
• Creating DataFrames
• Transforming and Querying DataFrames
• Saving DataFrames
• Comparing Spark SQL with Impala
Hadoop Testing
• Hadoop Application Testing
• Roles and Responsibilities of Hadoop Testing Professional
• Framework MRUnit for Testing of MapReduce Programs
• Unit Testing
• Test Execution
• Test Plan Strategy and Writing Test Cases for Testing Hadoop Application
Big Data Testing
• BigData Testing
• Unit Testing
• Integration Testing
• Functional Testing
• Non-Functional Testing
• Golden Data Set
System Testing
• Building and Set up
• Testing SetUp
• Solary Server
• Non-Functional Testing
• Longevity Testing
• Volumetric Testing
Security Testing
• Security Testing
• Non-Functional Testing
• Hadoop Cluster
• Security-Authorization RBA
• IBM Project
Automation Testing
• Query Surge Tool
Oozie
• Why Oozie
• Installation Engine
• Oozie Workflow Engine
• Oozie security
• Oozie Job Process
• Oozie terminology
• Oozie bundle
Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Detailed understanding of Hadoop Distributed File System (HDFS)"

Leave a Message

Your email address will not be published. Required fields are marked *