HBase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). HBase is not a relational data store, and it does not support structured query language like SQL.

In HBase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.

Mention what are the key components of HBase?

Zookeeper: It does the co-ordination work between client and HBase Maser
HBase Master: HBase Master monitors the Region Server
RegionServer: RegionServer monitors the Region
Region: It contains in memory data store(MemStore) and Hfile)
Catalog Tables: Catalog tables consist of ROOT and META

Explain what does HBase consists of?

HBase consists of a set of tables
And each table contains rows and columns like traditional database
Each table must contain an element defined as a Primary Key
HBase column denotes an attribute of an object

Explain why to use HBase?

High-capacity storage system
Distributed design to cater large tables
Column-Oriented Stores
Horizontally Scalable
High performance & Availability
Base goal of HBase is millions of columns, thousands of versions and billions of rows
Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations

When you should use HBase?

Data size is huge: When you have tons and millions of records to operate
Complete Redesign: When you are moving RDBMS to HBase, you consider it as a complete re-design then mere just changing the ports
SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
Infrastructure Investment: You need to have enough cluster for HBase to be really useful

Mention how many operational commands in HBase?

Operational command in HBase is about five types

Get
Put
Delete
Scan
Increment

Explain what is WAL and Hlog in HBase?

WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, cash of server failure WAL work as a life-line and retrieves the lost data’s.

In HBase what is column families?

Column families comprise the basic unit of physical storage in HBase to which features like compressions are applied.

Explain what is the row key?

Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.

Explain deletion in HBase? Mention what are the three types of tombstone markers in HBase?

When you delete the cell in HBase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. HBase deleted are actually removed during compactions.

Three types of tombstone markers are there:

Version delete marker: For deletion, it marks a single version of a column
Column delete marker: For deletion, it marks all the versions of a column
Family delete marker: For deletion, it marks of all column for a column family

Explain how does HBase actually delete a row?

In HBase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in HBase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.

Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.

Explain what happens if you alter the block size of a column family on an already occupied database?

When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.

Mention the difference between HBase and Relational Database?

HBase	Relational Database
It is schema-less It is a column-oriented data store It is used to store de-normalized data It contains sparsely populated tables Automated partitioning is done in HBase	It is a schema-based database It is a row-oriented data store It is used to store normalized data It contains thin tables There is no such provision or built-in support for partitioning

What is HBase Fsck class?

There is a tool name called back is available in HBase, which is implemented by the HBase Fsck class. It offers several command-line switches that influence its behavior.

What are the main key structures of HBase?

Row key and Column key are the two most important key structures using in HBase

Discuss how you can use filters in Apache HBase

Filters in HBase Shell. It was introduced in Apache HBase 0.92 which helps you to conduct server-side filtering for accessing HBase over HBase shell or thrift.

HBase support syntax structure like SQL yes or No?

No, unfortunately, SQL support for HBase is not available currently. However, by using Apache Phoenix, we can retrieve data from HBase through SQL queries.

Explain JMX concerning HBSE

Java Management Extensions or JMX is an export status of Java applications is the standard for them.

What is the use of Master Server?

Master sever helps you to assign a region to the region server as well. It also helps you to handle the load balancing we use the Master Server.

Define the Term Thrift

Apache Thrift is written in C++. It provides schema compilers for various programming languages like C++, Perl, PHP, Python, Ruby, and more.

Why use HColumn Descriptor class?

The detail regarding column family such as compression settings, Number of versions, are stored .in HColumn Descriptor.

What is a cell in HBase?

A cell in HBase is the smallest unit of an HBase table. It helps you to holds a piece of data in the form of a tuple {row, column, version}

What is a Bloom filter?

HBase supports Bloom Filter helps you to improve the overall throughput of the cluster. An HBase Bloom Filter is a space-efficient mechanism to test whether a H File includes certain row or row-col cell.

What is the meaning of compaction in HBase?

At the time of heavy incoming writes, it is impossible to achieve optimal performance by having one file per store. HBase helps you to combines all these HFiles to reduce the number of disk seeds for every read. This process is known as for as Compaction in HBase.

How will you implement joins in HBase?

HBase, not support joins directly but uses MapReduce jobs join queries can be implemented by retrieving data with the help of different HBase tables.

Tell me about the types of HBase Operations?

Ans. Two types of HBase Operations are:

Read Operation
Write Operation

What is the use of HBase HMaster?

Main responsibilities of a master are:

Coordinating the region servers
Admin functions

Which technique can you use in HBase to access HFile directly without the help of HBase?

To access HFile directly without using HBase, we use HFile.main() method.

Can the region server will be located on all DataNodes?

Yes, Region Servers run on the same servers as a DataNodes

Name the filter which accepts the page size as the parameter in HBase

A filter named PageFilter accepts the page size as the parameter.

This document has been composed with the instant HTML converter tools.

What are the key components of HBase?

The key components of HBase are Zookeeper, RegionServer and HBase Master.

Key components of HBase
Component	Description
Region Server	A table can be divided into several regions. A group of regions is served to the clients by a Region Server
HMaster	It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
ZooKeeper	Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.

When would you use HBase?

HBase is used in cases where we need random read and write operations and it can perform a number of operations per second on a large data sets.
HBase gives strong data consistency.
It can handle very large tables with billions of rows and millions of columns on top of commodity hardware cluster.

Define column families?

Column Family is a collection of columns, whereas row is a collection of column families.

Define standalone mode in HBase?

It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.

What is decorating Filters?

It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data. These types of filters are known as decorating filter. It includes SkipFilter and WhileMatchFilter.

What is RegionServer?

A table can be divided into several regions. A group of regions is served to the clients by a Region Server.

What is the data manipulation commands of HBase?

Data Manipulation commands of HBase are:

put– Puts a cell value at a specified column in a specified row in a particular table.
get– Fetches the contents of a row or a cell.
delete– Deletes a cell value in a table.
deleteall– Deletes all the cells in a given row.
scan– Scans and returns the table data.
count– Counts and returns the number of rows in a table.
truncate– Disables, drops, and recreates a specified table.

Which code is used to open a connection in HBase?

Following code is used to open a HBase connection, here users is my HBase table:

1 Configuration myConf = HBaseConfiguration.create();

2 HTable table = new HTable(myConf, “users”);

What is the use of truncate command?

It is used to disable, drop and recreate the specified tables.

What happens when you issue a delete command in HBase?

Once you issue a delete command in HBase for cell, column or column family, it is not deleted instantly. A tombstone marker in inserted. Tombstone is a specified data, which is stored along with standard data. This tombstone makes hides all the deleted data.

The actual data is deleted at the time of major compaction. In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile. In this process, the same column families are placed together in the new HFile. It drops deleted and expired cell in this process. All the results from scan and get filters the deleted cells.

What is the use of get() method?

get() method is used to read the data from the table.

Define the difference between Hive and HBase?

Apache Hive is a data warehousing infrastructure built on top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a SQL-like language, that gets translated into MapReduce jobs. Hive performs batch processing on Hadoop.

Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase partitions the tables, and the tables are further splitted into column families.

Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of Hadoop. We can use them together. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from HBase to Hive and vice-versa.

Explain the data model of HBase.

HBase comprises of:

Set of tables.
Each table consists of column families and rows.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key.
Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.

What are different tombstone markers in HBase?

There are three types of tombstone markers in HBase:

Version Marker: Marks only one version of a column for deletion.
Column Marker: Marks the whole column (i.e. all version) for deletion.
Family Marker: Marks the whole column family (i.e. all the columns in the column family) for deletion

HBase blocksize is configured on which level?

The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.

Which command is used to run HBase Shell?

./bin/HBase shell command is used to run the HBase shell. Execute this command in HBase directory.

Which command is used to show the current HBase user?

whoami command is used to show HBase user.

What is the full form of MSLAB?

MSLAB stands for Memstore-Local Allocation Buffer. Whenever a request thread needs to insert data into a MemStore, it doesn’t allocates the space for that data from the heap at large, but rather allocates memory arena dedicated to the target region.

Define LZO?

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.

What are the filters are available in Apache HBase?

The filters that are supported by HBase are:

ColumnPrefixFilter: takes a single argument, a column prefix. It returns only those key-values present in a column that starts with the specified column prefix.
TimestampsFilter: takes a list of timestamps. It returns those key-values whose timestamps match any of the specified timestamps.
PageFilter: takes one argument, a page size. It returns page size, number of rows from the table.
MultipleColumnPrefixFilter: takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes.
ColumnPaginationFilter: takes two arguments, a limit and an offset. It returns limit number of columns after offset number of columns. It does this for all the rows.
SingleColumnValueFilter: takes a column family, a qualifier, a comparison operator and a comparator. If the specified column is not found, all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted.
RowFilter: takes a comparison operator and a comparator. It compares each row key with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that row.
QualifierFilter: takes a comparison operator and a comparator. It compares each qualifier name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that column.
ColumnRangeFilter: takes either minColumn, maxColumn, or both. Returns only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not. If you don’t want to set the minColumn or the maxColumn, you can pass in an empty argument.
ValueFilter: takes a comparison operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.
PrefixFilter: takes a single argument, a prefix of a row key. It returns only those key-values present in a row that start with the specified row prefix.
SingleColumnValueExcludeFilter: takes the same arguments and behaves same as SingleColumnValueFilter. However, if the column is found and the condition passes, all the columns of the row will be omitted except for the tested column value.
ColumnCountGetFilter: takes one argument, a limit. It returns the first limit number of columns in the table.
InclusiveStopFilter: takes one argument, a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.
DependentColumnFilter: takes two arguments required arguments, a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp.
FirstKeyOnlyFilter: takes no arguments. Returns the key portion of the first key-value pair.
KeyOnlyFilter: takes no arguments. Returns the key portion of each key-value pair.
FamilyFilter: takes a comparison operator and comparator. It compares each family name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that family.
CustomFilter: You can create a custom filter by implementing the Filter class.

While reading data from HBase, from which three places data will be reconciled before returning the value?

The read process will go through the following process sequentially:

For reading the data, the scanner first looks for the Row cell in Block cache. Here all the recently read key value pairs are stored.
If Scanner fails to find the required result, it moves to the MemStore, as we know this is the write cache memory. There, it searches for the most recently written files, which has not been dumped yet in HFile.
At last, it will use bloom filters and block cache to load the data from the HFile.

Can you explain data versioning?

In addition to being a schema-less database, HBase is also versioned.

Every time you perform an operation on a cell, HBase implicitly stores a new version. Creating, modifying and deleting a cell are all treated identically, they are all new versions. When a cell exceeds the maximum number of versions, the extra records are dropped during the major compaction.

Instead of deleting an entire cell, you can operate on a specific version within that cell. Values within a cell are versioned and it is identified the timestamp. If a version is not mentioned, then the current timestamp is used to retrieve the version. The default number of cell version is three.

What is a Bloom filter and how does it help in searching rows?

HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.

Without Bloom Filter, the only way to decide if a row key is present in a HFile is to check the HFile’s block index, which stores the start row key of each block in the HFile. There are many rows drops between the two start keys. So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.

Compare HBase & Cassandra

Criteria	HBase	Cassandra
Basis for the cluster	Hadoop	Peer-to-peer
Best suited for	Batch Jobs	Data writes
The API	REST/Thrift	Thrift

Give the name of the key components of HBase

The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.

What is S3?

S3 stands for simple storage service and it is a one of the file system used by HBase.

What is the use of get() method?

get() method is used to read the data from the table.

What is the reason of using HBase?

HBase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.

In how many modes HBase can run?

There are two run modes of HBase i.e. standalone and distributed.

Define the difference between hive and HBase?

HBase is used to support record level operations but hive does not support record level operations.

Define column families?

It is a collection of columns whereas row is a collection of column families.

Define standalone mode in HBase?

It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.

What is decorating Filters?

It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.

What is the full form of YCSB?

YCSB stands for Yahoo! Cloud Serving Benchmark.

What is the use of YCSB?

It can be used to run comparable workloads against different storage systems.

Learn more about the use of YCSB in HBase in this HBase Tutorial.

Which operating system is supported by HBase?

HBase supports those OS which supports java like windows, Linux.

What is the most common file system of HBase?

The most common file system of HBase is HDFS i.e. Hadoop Distributed File System.

Define Pseudodistributed mode?

A pseudodistributed mode is simply a distributed mode that is run on a single host.

What is regionserver?

It is a file which lists the known region server names.

Define MapReduce.

MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way.

What are the operational commands of HBase?

Operational commands of HBase are Get, Delete, Put, Increment, and Scan.

Which code is used to open the connection in HBase?

Following code is used to open a connection:

Configuration myConf = HBaseConfiguration.create();

HTableInterface usersTable = new HTable(myConf, “users”);

Which command is used to show the version?

Version command is used to show the version of HBase.

Syntax – HBase> version

What is use of tools command?

This command is used to list the HBase surgery tools.

What is the use of shutdown command?

It is used to shut down the cluster.

What is the use of truncate command?

It is used to disable, recreate and drop the specified tables.

Which command is used to run HBase Shell?

$ ./bin/HBase shell command is used to run the HBase shell.

Which command is used to show the current HBase user?

The whoami command is used to show HBase user.

How to delete the table with the shell?

To delete table first disable it then delete it.

What is use of InputFormat in MapReducr process?

InputFormat the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.

What is the full form of MSLAB?

MSLAB stands for Memstore-Local Allocation Buffer.

Define LZO?

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed, and written in ANSIC.

What is HBaseFsck?

HBase comes with a tool called hbck which is implemented by the HBaseFsck class. It provides various command-line switches that influence its behavior.

What is REST?

Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.

Define Thrift?

Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.

What are the fundamental key structures of HBase?

The fundamental key structures of HBase are row key and column key.

What is JMX?

The Java Management Extensions technology is the standard for Java applications to export their status.

What is nagios?

Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.

What is the syntax of describe Command?

The syntax of describe command is –

HBase> describe tablename

What the use is of exists command?

The exists command is used to check that the specified table is exists or not.

What is the use of MasterServer?

MasterServer is used to assign a region to the region server and also handle the load balancing.

What is HBase Shell?

HBase shell is a java API by which we communicate with HBase.

What is the use of ZooKeeper?

The zookeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization.

Define catalog tables in HBase?

Catalog tables are used to maintain the metadata information.

Define cell in HBase?

The cell is the smallest unit of HBase table which stores the data in the form of a tuple.

What is the use of HColumnDescriptor class?

HColumnDescriptor stores the information about a column family like compression settings , Number of versions etc.

What is the function of HMaster?

It is a MasterServer which is responsible for monitoring all regionserver instances in a cluster.

How many compaction types are in HBase?

There are two types of Compaction i.e. Minor Compaction and Major Compaction.

Define HRegionServer in HBase

It is a RegionServer implementation which is responsible for managing and serving regions.

Which filter accepts the pagesize as the parameter in HBase?

PageFilter accepts the pagesize as the parameter.

Which method is used to access HFile directly without using HBase?

HFile.main() method used to access HFile directly without using HBase.

Which type of data HBase can store?

HBase can store any type of data that can be converted into the bytes.

What is the use of Apache HBase?

Apache HBase is used when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

What are the features of Apache HBase?

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and an REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible JRuby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?

In HBase 0.96, the project moved to a modular structure. Adjust your project’s dependencies to rely upon the HBase-client module or another module as appropriate, rather than a single JAR. You can model your Maven depency after one of the following, depending on your targeted version of HBase. See Section 3.5, “Upgrading from 0.94.x to 0.96.x” or Section 3.3, “Upgrading from 0.96.x to 0.98.x” for more information.

Maven Dependency for HBase 0.98
org.apache.HBase
HBase-client
0.98.5-hadoop2
Maven Dependency for HBase 0.96
org.apache.HBase
HBase-client
0.96.2-hadoop2
Maven Dependency for HBase 0.94
org.apache.HBase
HBase
0.94.3

How should I design my schema in HBase?

HBase schemas can be created or updated using ‘The Apache HBase Shell’ or by using ‘Admin in the Java API’.
Tables must be disabled when making ColumnFamily modifications, for example:

Configuration config = HBaseConfiguration.create();

Admin admin = new Admin(conf);

String table = “myTable”;

admin.disableTable(table);

HColumnDescriptor cf1 = …;

admin.addColumn(table, cf1); // adding new ColumnFamily

HColumnDescriptor cf2 = …;

admin.modifyColumn(table, cf2); // modifying existing ColumnFamily

admin.enableTable(table);

What is the Hierarchy of Tables in Apache HBase?

The hierarchy for tables in HBase is as follows:

Tables >> Column Families >> Rows

Columns >> Cells

When a table is created, one or more column families are defined as high-level categories for storing data corresponding to an entry in the table. As is suggested by HBase being “column-oriented”, column family data for all table entries, or rows, are stored together.
For a given (row, column family) combination, multiple columns can be written at the time the data is written. Therefore, two rows in an HBase table need not necessarily share the same columns, only column families. For each (row, column-family, column) combination HBase can store multiple cells, with each cell associated with a version, or timestamp corresponding to when the data was written. HBase clients can choose to only read the most recent version of a given cell, or read all versions.

How can I troubleshoot my HBase cluster?

Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.

An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example, one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.

RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS.

Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Long GC pauses above.

Compare HBase with Cassandra?

Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.

Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or — even better — billions of rows. Anything less, and you’re advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.

In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is, therefore, important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition — such as a column that stores the state field of a customer’s mailing address.

HBase lacks built-in support for secondary indexes but offers a number of mechanisms that provide secondary index functionality. These are described in HBase’s online reference guide and on HBase community.

Compare HBase with Hive?

Hive can help the SQL savvy to run MapReduce jobs. Since its JDBC compliant, it also integrates with existing SQL-based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.

HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.

Hive and HBase are two different Hadoop-based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.

What version of Hadoop do I need to run HBase?

Different versions of HBase require different versions of Hadoop. Consult the table below to find which version of Hadoop you will need:

HBase Release Number	Hadoop Release Number
0.1.x	0.16.x
0.2.x	0.17.x
0.18.x	0.18.x
0.19.x	0.19.x
0.20.x	0.20.x
0.90.x	(current stable)

Releases of Hadoop can be found here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes. Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.

Also note that after HBase-0.2.x, the HBase release numbering schema will change to align with the Hadoop release number on which it depends.

What exactly do you know about the HBase and what exactly do you find different in it as compare to others platforms in its class?

It is one of the best available Database Management systems which are based on Hadoop. As compared to others, it is actually not a relational DBMS and it cannot be considered when it comes to any structured query language. All the clusters are generally managed by a master node in this approach and this is exactly what that makes it simply the best. 116. Can you name a few operational commands which are present in HBase?

These are: Put, Scan, Delete, Get and last is Increment

What would be the best reasons to prefer HBase as the DBMS according to you?

One of the best things about HBase is it is scalable in all the aspects and modules. The users can simply make sure of catering a very large number of tables in a short time period. In addition to this, it has a vast support available for all the CRUD operations. It is capable to store more data and can manage the same simply. Also, the stores are column oriented and there are a very large number of rows and column available that enable users to keep the pace up all the time.

How many tombstone markers are there in the HBase? Name them

There is total 3 tombstone markers which you can consider anytime. They are Version delete, Family delete and Column Delete.

Tell a few scenarios when you will consider HBase?

When there is a need to shift an entire database, this approach is generally opted. In addition to this, during the data operations which are large to handle, HBase can be consider. Moreover, when there are a lot of features such as inner joins and transactions maintenance need to be used frequently, the HBase can be considered easily.

How can you say that the HBase is capable to offer high availability?

There is a special feature known as region replication. There are several replicas available that define the entire region in a table. It is the load balancer in the HBase which simply make sure that the replicas are not hosted again and again in the servers with similar regions. This is exactly what that makes sure of the high availability of HBase all the time.

What do you mean by WAL?

It stands for Write Ahead Log. It is basically a log which is responsible for recording all the changes in the data irrespective of the mode of their change. Generally, it is considered as the standard sequence file. It is actually very useful to consider after the issues like server crash or failure. The users can still access data through it during such problems.

Can you tell a few important components of the HBase that are useful the data managers?

With HBase, the users are able to simply handle more amount of data through a special component “Region”. In has another component called as “Zookeeper” which is mainly responsible for the co-ordination of the maser and the client on the other side. There are “Catalog Tables” which consists of Root and Meta Data simply available with them.

What is Thrift?

Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.

What is Nagios?

Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.

What is the use of ZooKeeper?

The ZooKeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization. It helps in maintaining server state inside the cluster by communicating through sessions.

Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available. It also provides server failure notifications so that, recovery measures can be executed.

Define catalog tables in HBase?

Catalog tables are used to maintain the metadata information.

Define compaction in HBase?

HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compactions.

Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles.
Major Compaction: In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile.

Can you directly delete a call from the HBase?

No, it is not possible in most of the cases. When the users actually do so, the cells get invisible and remain present in the server in the form of a tombstone marker. They are generally removed by the compaction’s periods. The direct deleting doesn’t work in most of the cases.

What is the significance of Data management according to you?

Generally, organizations have to work with bulk data. When the same is structured or managed, it is easy to utilize or to deploy it for any task. Of course, it cut down the overall time period required to accomplish a task if it is well-managed. The users are always free to keep up the pace simply with the structured or the properly managed data. There are a lot of other reasons too that matters and always let the users assure error-free outcomes.

What do you know about the set of tables in the HBase?

They consist of a long series of the rows and columns. It seems quite similar to that of a traditional database. There is one element in every table and the same is called as the primary key. The columns generally denote an attribute of the concerned objects.

Can you tell anyone basic condition to be fulfilled when it comes to getting the best out from the HBase?

The users must make sure that there are enough nodes and clusters so that HBase can perform its task reliably and easily. With more nodes, more efficiency can simply be assured by the users.

Is HBase an OS independent approach?

Yes, it is totally independent of the operating system and the users are free to consider it on Windows, Linux, Unix etc. the only basic requirement is it should have a Java support installed on it.

You might have used a relational database; can you tell some of the major differences you noticed in it as compared to HBase?

Well, the first difference is HBase is not based on schema whereas relation database is. The automated partitioning can easily be done in HBase while relational database lacks this feature. There are more tables in HBase than in the relational database. Also, it is a row-oriented data store while HBase is a column-oriented data store.

What is Apache HBase?

It is a column-oriented database which is used to store the sparse data sets. It is run on the top of Hadoop file distributed system. Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API or through a Thrift or REST gateway, making it accessible by any language. Some of the key properties of HBase include:

NoSQL: HBase is not a traditional relational database (RDBMS). HBase relaxes the ACID (Atomicity, Consistency, Isolation, Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.

Wide-Column: HBase stores data in a table-like format with the ability to store billions of rows with millions of columns. Columns can be grouped together in “column families” which allows physical distribution of row values onto different cluster nodes.

Distributed and Scalable: HBase group rows into “regions” which define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers.

Consistent: HBase is architected to have “strongly-consistent” reads and writes, as opposed to other NoSQL databases that are “eventually consistent”. This means that once a write has been performed, all read requests for that data will return the same value.

How can you make sure of logical grouping of cells in the HBase?

This can be assured by paying attention to the Row key. The users are free to make sure that all the cells with similar row key can be located to each other and have the presence on a similar server. If the need of defining is realized, the Row key can be considered.

Tell something about the procedure of deleting a row in HBase?

The best part about the HBase is everything written on the RAM gets stored automatically on the Disk. There are some barring compaction remains present with the same. These compactions can be categorized into two parts and they are major and minor. The major compaction can easily delete the files while there is a restriction on the minor ones for the same.

What do you know about an Hfile and with whom it is actually related to in an HBase?

It is basically a defined storage format for the HBase and generally, it is related to a column family. There is no strict upper limit on them in the column families. The users can easily deploy an Hfile for storing data that belong to different families.

Is it possible for the users to alter the column family’s block size?

Yes, it is possible. Generally, when this is done by the users, the fresh version of data simply occupies the new block size without affecting the old data. The entire old data consume the new one during the compaction.

Compare HBase and Hive and tell the noticeable differences

Both are based on Hadoop but both are different from one another. Hive is generally considered as one of the best available data warehouse infrastructure. The operations of HBase are limited when compared to the Hive. However, when it comes to handling the real-time operations, the HBase is good. On the other hand, the Hive is preferred only when the querying of data is the prime need.

At the record and at the table level, what are the different operational commands you can find?

At table level, the commonly used commands are drop, list, scan and disable whereas on the other side, get, put, scan and increment are the commands related to record level.

What do you mean by the region server?

Generally, the databases have a huge volume of data to deal with. It is not always possible and necessary that all the data is linked to a single server. There is a central controller and the same should specify the server with which a specific data is concerned with or placed on. The same is known as Region server. It is also considered as a file on the system that let the users display the defined server names which are associated.

What is standalone mode in the HBase?

When the users don’t need the HBase to use the HDFS, this mode can be turned on. It is basically a default mode in the HBase and the users are generally free to use it anytime they want. Instead of HDFS, the HBase make use of a file system when this mode is activated by the user. It is possible to save a lot of time while enabling this mode during the performance of some important tasks. It is also possible to apply or to remove various time restrictions on the data during this mode.

What is HBase shell?

It is basically a Java API that is used for establishing a connection with the HBase. The users need not to worry about anything when it comes to dealing with this problem. Also, the users are free to keep up the pace without worrying about anything about the connectivity when the HBase shell is deployed.

Tell a few important features of the Apache HBase?

HBase is capable to be used while performing a lot of tasks which needs modular or linear scaling
2. All the tables are distributed on the cluster through regions.
3. With respect to the growth in the data, the regions automatically grow and split
4. There are several bloom filters that HBase support
5. The use of Block Cache in the HBase is totally allowed
6. HBase is capable to handle the volume query optimization when the data needs are complex

Name any 5 important filters in HBase with which you are familiar?

Page Filter, Family Filter, Column Filter, Row Filter and Inclusive Stop Filter

Is it possible for the users to perform iteration through the rows? Why or why not?

Yes, it is possible. However, when the same task is performed in a reverse order, it is not allowed. This is because the column values are generally stored on a disk and their length should be completely defined. Also, the bytes which are related to the value should be written after it. For performing this task in the reverse order, these values should be stored one more time and this can create compatibility problems and can affect the memory of the HBase. Thus, it is not allowed.

Why HBase is a schema-less database?

This is because the users need not to worry about defining the data prior of the time. You only need to define the column family name and nothing else. This makes the HBase a schema-less database.

What is the procedure to write data in the HBase?

During any modification or change in the data, it is first sent to a commit log which is also known as WAL. It is after this the data is stored in the memory. In case the data exceed beyond the defined limit, the same is transferred to the disk as an Hfile. The users are free to discard the commit logs and can proceed with the stored data.

Define TTL in HBase?

It is basically a technique that is useful when it comes to data retention. It is possible for the users to preserve the version of a cell for a defined time period. The same get deleted automatically upon the completion of such a time.

What is HBase?

It is a column-oriented database developed by the Apache Software Foundation. Running on top of a Hadoop cluster, HBase is used to store semi-structured and unstructured data. So, it does not have a rigid schema like that of a traditional relational database. Also, it does not support an SQL syntax structure. HBase stores and operates on data through a master node regulating the cluster and region servers.

What are the reasons for using HBase?

HBase offers a high-capacity storage system and random read and write operations. It can handle large datasets, performing several operations per second. The distributed and horizontally scalable design makes HBase a popular choice for real-time applications.

Explain the key components of HBase.

The working parts of HBase include Zookeeper, HBase Master, RegionServer, Region, and Catalog Tables. The purpose of each element can be described as follows:

Zookeeper coordinates between the client and the HBase Master
HBase Master monitors the RegionServer and takes care of the admin functions
RegionServer supervises the Region
Region contains the MemStore and HFile
Catalog Tables comprise ROOT and META

Basically, HBase consists of a set of tables with each table having rows, columns, and a primary key. It is the HBase column that denotes an object’s attribute.

What are the different types of operational commands in HBase?

There are five crucial operational commands in HBase: Get, Delete, Put, Increment, and Scan.

Get is used to read the table. Executed via HTable.get, it returns data or attributes of a specific row from the table. Delete removes rows from a table, whereas Put adds or updates rows. Increment enables increment operations on a single row. Finally, Scan is used to iterate over multiple rows for certain attributes.

What do you understand by WAL and Hlog?

WAL stands for Write Ahead Log and is quite similar to the BIN log in MySQL. It records all the changes in the data.
HLog is Hadoop’s standard in-memory sequence file that maintains the HLogkey store.

WAL and HLog serve as lifelines in the events of server failure and data loss. If the RegionServer crashes or becomes unavailable, WAL files ensure that the data changes can be replayed.

Describe some situations wherein you would use HBase.

It is suitable to use HBase when:

The size of your data is vast, requiring you to operate on millions of records.
You are implementing a complete redesign and overhauling the conventional RDBMS.
You have the resources to undertake infrastructure investment in clusters.
There are particular SQL-less commands, such as transactions, typed columns, inner lines, etc.

What do you mean by columns families and row keys?

Column families constitute the basic storage units in HBase. These are defined during table creation and stored together on the disk, later allowing for the application of features like compression.

A row key enables the logical grouping of cells. It is prefixed to the combined key, letting the application define the sort order. In this way, all the cells with the same row key can be saved on the same server.

How does HBase differ from a relational database?

HBase is different from a relational database as it is a schema-less, column-oriented data store containing sparsely populated tables. A relational database is schema-based, row-oriented, and stores normalized data in thin tables. Moreover, HBase has the advantage of automated partitioning, whereas there is no such built-in support in RDBMS.

Read: DBMS vs. RDBMS: Difference Between DBMS & RDBMS

What constitutes a cell in HBase?

Cells are the smallest units of HBase tables, holding the data in the form of tuples. A tuple is a data structure having multiple parts. In HBase, it consists of {row, column, version}.

Define compaction in HBase.

Compaction is the process used to merge HFiles into a single file before the old files are removed from the database.

Can you access HFile directly without using HBase?

Yes, there is a unique technique to access HFile directly without the aid of HBase. The HFile.main method can be used for this purpose.

Discuss deletion and tombstone markers in HBase.

In HBase, a normal deletion process results in a tombstone marker. The deleted cells become invisible, but the data represented by them is actually removed during compaction. HBase has three types of tombstone markers:

Version delete marker: It marks a single version of a column for deletion
Column delete marker: It marks all versions of a column
Family delete marker: It sets up all columns of a column family for deletion

Here, it needs to be noted that a row in HBase would be entirely deleted after major compaction. Therefore, when you delete and add more data, the Gets may be masked by tombstone markers, and you may not see the inserted values until after the compactions.

What happens when you alter the block size of a column family?

If your database is already occupied and you wish to alter your column family’s block size in HBase, the old data may remain in the old block size. During compaction, the old and new data would behave like this:

Existing data would take the new block size and continue to be read correctly.
New files would have the new block size.

In this way, all data transform to the desired block size before the next major compaction.

Define the different modes that HBase can run.

HBase can either run in standalone mode or the distributed mode. Standalone is the default mode of HBase that uses the local files system instead of HDFS. As for the distributed mode, it can be further subdivided into:

Pseudo-distributed mode: All daemons run on a single node
Fully-distributed mode: Daemons run across all nodes in the cluster

How would you implement joins in HBase?

HBase uses MapReduce jobs to process terabytes of data in a scalable fashion. It does not directly support joins, but the join queries are implemented by retrieving data from HBase tables.

Checkout: Hadoop Interview Questions

Discuss the purpose of filters in HBase.

Filters were introduced in Apache HBase 0.92 to help users access HBase over Shell or Thrift. So, they take care of your server-side filtering needs. There are also decorating filters that extend the uses of filters to gain additional control over returned data. Here are some examples of filters in HBase:

Bloom Filter: Typically used for real-time queries, it is a space-efficient way of knowing whether an HFile includes a specific row or cell
Page Filter: Accepting the page size as a parameter, the Page Filter can optimize the scan of individual HRegions

Compare HBase with (i) Cassandra (ii) Hive.

(i) HBase and Cassandra:

Both Cassandra and HBase are NoSQL databases designed to manage large datasets. However, the syntax of

Cassandra Query Language (CQL) is modeled after SQL. In both data stores, the row key forms the primary index. Cassandra can create secondary indexes on column values. Hence, it can improve data access in columns with high levels of repetition. HBase lacks this provision but has other mechanisms to bring in the secondary index functionality. These methods can be easily found in online reference guides.

(ii) HBase and Hive:

Both of them are Hadoop-based technologies. As discussed above, HBase is a NoSQL key/value database. On the other hand, Hive is an SQL-like engine capable of running sophisticated MapReduce jobs. You can perform read and write data operations from Hive to HBase and vice-versa. While Hive is more suitable for analytical tasks, HBase is an excellent solution for real-time querying.

Where the compression feature can be applied in the HBase?

It is generally done when the users need to use any of the feature related to the physical storage assessment. There are no complex restrictions that need to be fulfilled for this.

Define compaction in HBase?

Compaction is a process which is used to merge the Hfiles into the one file and after the merging file is created and then old file is deleted. There are different types of tombstone markers which make cells invisible and these tombstone markers are deleted during compaction.

Become Master of Apache HBase by going through this online HBase Course.

What is REST?

What is the use of HColumnDescriptor class?

HColumnDescriptor stores the information about a column family like compression settings, number of versions etc. It is used as input when creating a table or adding a column.

Which filter accepts the pagesize as the parameter in HBase?

PageFilter accepts the pagesize as the parameter. Implementation of Filter interface that limits results to a specific page size. It terminates scanning once the number of filter-passed the rows greater than the given page size.

Syntax: PageFilter (<page_size>)

How will you design or modify schema in HBase programmatically?

HBase schemas can be created or updated using the Apache HBase Shell or by using Admin in the Java API.

Creating table schema:
1. Configuration config = HBaseConfiguration.create();

2. HBaseAdmin admin = new HBaseAdmin(conf);

3. // execute command through admin

4. // Instantiating table descriptor class

5. HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf(“employee”));

6. // Adding column families to t1

7. t1.addFamily(new HColumnDescriptor(“professional”));

8. t1.addFamily(new HColumnDescriptor(“personal”));

9. // Create the table through admin

10. admin.createTable(t1);

For modification:
1.String table = “myTable”;

2. admin.disableTable(table);

3. admin.modifyColumn(table, cf2); // modifying existing ColumnFamily

4. admin.enableTable(table);

What is HBase Fsck?

HBase comes with a tool called hbck which is implemented by the HBaseFsck class. HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. It works in two basic modes – a read-only inconsistency identifying mode and a multi-phase read-write repair mode.

How do we back up a HBase cluster?

There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has benefits and limitation.

Full Shutdown Backup

Some environments can tolerate a periodic full shutdown of their HBase cluster, for example, if it is being used as a back-end process and not serving front-end webpages.

Stop HBase: Stop the HBase services first.
Distcp: Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
Restore: The backup of the HBase directory from HDFS is copied onto the ‘real’ HBase directory via distcp. The act of copying these files, creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn’t required for this kind of restore, because it’s a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.

Live Cluster Backup

The environments which cannot handle downtime uses Live Cluster Backup.

CopyTable: Copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
Export: Export approach dumps the content of a table to HDFS on the same cluster.

How HBase Handles the write failure?

Failures are common in large distributed systems, and HBase is no exception.

If the server hosting a MemStore that has not yet been flushed crashes. The data that was in memory, but not yet persisted are lost. HBase safeguards against that by writing to the WAL before the write completes. Every server that’s part of the.

HBase cluster keeps a WAL to record changes as they happen. The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written. This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes down, the data that were not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.

So, this brings us to the end of the Apache HBase Interview Questions blog.This Tecklearn ‘Top Apache HBase Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Apache HBase or Big Data Domain. If you wish to learn Apache HBase and build a career in Big Data domain, then check out our interactive, Apache HBase Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/apache-hbase-training/

Apache HBase Training

About the Course

Tecklearn Apache HBase training will master the powerful NoSQL distributed database. You will learn HBase architecture, data analytics using HBase, integration with Hive, monitoring cluster using ZooKeeper and working on real-life industry projects. Build your career as a certified HBase professional through our hands-on training with real-world examples. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache HBase.

Why Should you take Apache HBase Training?

HBase is now the largest data-driven service serving top websites including Facebook Messaging Platform.
There is Strong demand for HBase qualified professionals and they are paid big bucks for the right skills.
According to indeed.com, the average pay of an HBase developer stands at $81,422 per annum.

What you will Learn in this Course?>

Introduction to HBase and NoSQL

Introduction to HBase
Fundamentals of HBase
What is NoSQL
NoSQL Vs RDBMS
Why HBase
Where to use HBase

HBase Data Modelling

Data Modelling
HDFS vs. HBase
HBase Use Cases

HBase Architecture and Components

HBase Architecture
Components of HBase Cluster

HBase Installation

Prerequisites for HBase Installation
Installation Steps

Programming in HBase

Create an Eclipse Project for HBase
Simple Table Creation from Java in HBase
HBase API
HBase Shell
Primary operations and advanced operations

Integration of Hive with HBase

Create a table and insert data into it
Integration of Hive with HBase
HBase Mapping

Deep Dive into HBase

Input Data into HBase
File Loading
HDFS File
HBase handling files in File System
WAL
Seek Vs Transfer
HBase ACID Properties

Got a question for us? Please mention it in the comments section and we will get back to you.

533