Top Apache Hive Interview Questions and Answers

Last updated on Feb 18 2022
Anudhati Reddy

Table of Contents

Top Apache Hive Interview Questions and Answers

Define the difference between Hive and HBase?

Hive vs HBase
HBase Hive
1. HBase is built on the top of HDFS 1. It is a data warehousing infrastructure
2. HBase operations run in a real-time on its database rather 2. Hive queries are executed as MapReduce jobs internally
3. Provides low latency to single rows from huge datasets 3. Provides high latency for huge datasets
4. Provides random access to data 4. Provides random access to data

What kind of applications is supported by Apache Hive?

Hive supports all those client applications that are written in:

  • Java
  • PHP
  • Python
  • C++
  • Ruby

by exposing its Thrift server.

Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?

The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.

Following are the steps to configure MySQL database as the local metastore in Apache Hive:

  • One should make the following changes in hive-site.xml:
    • jdo.option.ConnectionURLproperty should be set to jdbc:mysql://host/dbname?createDataba
      seIfNotExist=true.
    • jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
    • One should also set the username and password as:
      • jdo.option.ConnectionUserName is set to desired username.
      • jdo.option.ConnectionPassword is set to the desired password.
    • The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be copied into the Hive’s lib directory.
    • Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.

What is the difference between external table and managed table?

Here is the key difference between an external table and managed table:

  • In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
  • On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.

What is dynamic partitioning and when is it used?

In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.

One may use dynamic partition in following two cases:

  • Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
  • When one does not know all the values of the partitions beforehand and therefore, finding these partition values manually from a huge data sets is a tedious task.

Scenario:

Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;

Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?

We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.

As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem:

  1. Create a partitioned table, say partitioned_transaction:

CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ; 

  1. Enable dynamic partitioning in Hive:

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

  1. Transfer the data from the non – partitioned table into the newly created partitioned table:

INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;

Now, we can perform the query using each partition and therefore, decrease the query time.

How Hive distributes the rows into buckets?

Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For integer data type, the hash_function will be:

hash_function (int_type_column)= value of int_type_column

What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?

The command:  ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.

What is indexing and why do we need it?

One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.

Scenario:

Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.

So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?

One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:

  • Create a temporary table:

CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)

ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;

  • Load the data into temp_table:

LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;

  • Create a table that will store data in SequenceFile format:

CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)

ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;

  • Transfer the data from the temporary table into the sample_seqfile table:

INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;

Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.

Differentiate between Pig and Hive.

Criteria Apache Pig Apache Hive
Nature Procedural data flow language Declarative SQL-like language
Application Used for programming Used for report creation
Used by Researchers and programmers Mainly Data Analysts
Operates on The client-side of a cluster The server-side of a cluster
Accessing raw data Not as fast as HiveQL Faster with in-built features
Schema or data type Always defined in the script itself Stored in the local database
Ease of learning Takes little extra time and effort to master Easy to learn from database experts

What is a Hive variable? What do we use it for?

Hive variables are basically created in the Hive environment that is referenced by Hive scripting languages. They allow to pass some values to a Hive query when the query starts executing. They use the source command.

Explain the process to access subdirectories recursively in Hive queries.

By using the below commands, we can access subdirectories recursively in Hive:

hive> Set mapred.input.dir.recursive=true;hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be pointed to the higher level directory, and this is suitable for the directory structure like:

/data/country/state/city/

Can we change the settings within a Hive session? If yes, how?

Yes, we can change the settings within a Hive session using the SET command. It helps change the Hive job settings for an exact query. For example, the following command shows that buckets are occupied according to the table definition:

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.

hive> SET hive.enforce.bucketing; hive.enforce.bucketing=true

This list will not include the defaults of Hadoop. So, we should use the below code:

SET -v

It will list all the properties including the Hadoop defaults in the system.

Is it possible to add 100 nodes when we already have 100 nodes in Hive? If yes, how?

Yes, we can add the nodes by following the below steps:

Step 1: Take a new system; create a new username and password
Step 2: Install SSH and with the master node setup SSH connections
Step 3: Add ssh public_rsa id key to the authorized keys file
Step 4: Add the new DataNode hostname, IP address, and other details in /etc/hosts slaves file:

192.168.1.102 slave3.in slave3

Step 5: Start the DataNode on a new node
Step 6: Login to the new node like suhadoop or:

ssh -X hadoop@192.168.1.103

Step 7: Start HDFS of the newly added slave node by using the following command:

./bin/hadoop-daemon.sh start data node

Step 8: Check the output of the jps command on the new node

How to change the column data type in Hive? Explain RLIKE in Hive.

We can change the column data type by using ALTER and CHANGE as follows:

ALTER TABLE table_name CHANGE column_namecolumn_namenew_datatype;

For example, if we want to change the data type of the salary column from integer to bigint in the employee table, we can use the following:

ALTER TABLE employee CHANGE salary salary BIGINT;

RLIKE: Its full form is Right-Like and it is a special function in Hive. It helps examine two substrings, i.e., if the substring of A matches with B, then it evaluates to true.

What are Buckets in Hive?

Buckets in Hive are used in segregating Hive table data into multiple files or directories. They are used for efficient querying.

What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?

Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS have put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:

  • Analyzes relatively static data
  • Has less responsive time
  • Does not make rapid changes in data

Although Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP), it is suitable for data warehouse applications in large datasets. There are two types of tables in Hive:

  • Managed tables
  • External tables

What is the precedence order of Hive configuration?

We are using a precedence hierarchy for setting properties:

  1. The SET command in Hive
  2. The command-line –hiveconf option
  3. Hive-site.XML
  4. Hive-default.xml
  5. Hadoop-site.xml
  6. Hadoop-default.xml

If you run a select * query in Hive, why doesn’t it run MapReduce?

The hive.fetch.task.conversion property of Hive lowers the latency of MapReduce overhead, and in effect when executing queries such as SELECT, FILTER, LIMIT, etc. it skips the MapReduce function.

How can we improve the performance with ORC format tables in Hive?

We can store Hive data in a highly efficient manner in an Optimized Row Columnar (ORC) file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing, and processing data.

Set hive.compute.query.using.stats-true;Set hive.stats.dbclass-fs;CREATE TABLE orc_table (idint,name string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\:’LINES TERMINATED BY ‘\n’STORES AS ORC;

Explain the functionality of ObjectInspector.

ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.

  • An instance of Java class
  • A standard Java object
  • A lazily initialized object

ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.

Differentiate between Hive and HBase.

Hive HBase
Enables most SQL queries Does not allow SQL queries
Operations do not run-in real time Operations run in real time
A data warehouse framework A NoSQL database
Runs on top of MapReduce Runs on top of HDFS

 How can we access the subdirectories recursively?

By using the below commands, we can access subdirectories recursively in Hive:

hive> Set mapred.input.dir.recursive=true;hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be pointed to the higher level directory, and this is suitable for the directory structure:

/data/country/state/city/

What are the uses of Hive Explode?

Hadoop Developers consider an array as their input and convert it into a separate table row. To convert complicated data types into desired table formats, Hive uses Explode.

What is the available mechanism for connecting applications when we run Hive as a server?

  • Thrift Client: Using Thrift, we can call Hive commands from various programming languages, such as C++, PHP, Java, Python, and Ruby.
  • JDBC Driver: JDBC Driver enables accessing data with JDBC support, by translating calls from an application into SQL and passing the SQL queries to the Hive engine.
  • ODBC Driver: It implements the ODBC API standard for the Hive DBMS, enabling ODBC-compliant applications to interact seamlessly with Hive.

Mention various date types supported by Hive.

The timestamp data type stores date in the java.sql.timestamp format.

Three collection data types in Hive are:

  • Arrays
  • Maps
  • Structs

Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? If yes, how? Give an example.

Yes, we can run UNIX shell commands from Hive using an ‘!‘ mark before the command. For example, !pwd at Hive prompt will display the current directory.
We can execute Hive queries from the script files using the source command.

Example:

Hive> source /path/to/file/file_with_query.hql

What is a Hive Variable and What Is It Used For?

Referenced by Hive scripting languages, a Hive variable is created in the Hive environment and uses the source command. Once Hive queries begin executing, a Hive variable provides values to queries.

What is Hive Composed Of?

Tell your interviewer that Hive is made up of three main components: Hive Services, Hive Clients, and Hive Storage and Computing. You should also briefly explain to your interviewer what each component is capable of and the differences between each part.

What Are the Main Components of Hive Architecture?

You’ll first want to answer this question by naming each of the main components: Driver, User Interface, Execute Engine, Compiler, and Megastore. You’ll really demonstrate your Hive knowledge to your interviewer if you’re able to explain the capabilities of each component as well.

What Options Are Available When It Comes to Attaching Applications to the Hive Server?

Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.

What do you understand by Hive?

Hive is an information warehouse programming which is utilized for encourages questioning and overseeing vast data sets residing in dispersed storage. Hive language nearly looks like SQL language called HiveQL. Hive also permits conventional map to reduce projects to customize mappers and reducers when it is awkward or wasteful to execute the logic in HiveQL (User Defined Functions UDFS)

How can a developer utilize Hive?

Hive is helpful when influencing information to warehouse applications when you are dealing with static information rather than dynamic information.

  • When the application is on high latency (high reaction time)
  • When a big data set collection is kept up
  • When we are utilizing queries instead of scripting

Specify the different methods of Hive?

Various methods of the hive are as follows:

  • Local mode
  • Map reduce mode

The Hive can work in the above modes relying upon the size of data nodes in Hadoop.

What are the key segments of Hive Architecture?

Key segments of Hive Architecture incorporate

  • Meta-score
  • User Interface
  • Driver
  • Execute Engine
  • Compiler

What is the utilization of Hcatalog?

Hcatalog can be utilized to share information structures with external systems. Hcatalog gives access to hive meta-store to clients of other devices on Hadoop with the goal that they can read and compose information to hive’s data warehouse.

Say what the Object Inspector functionality is in Hive?

In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.

What is (HS2) Hive Server2?

Hive Server2 is a server interface. Various functions, which are followed by Hive Server2 are as follows:

  • Works against Hive by enabling remote customers to execute questions.
  • The outcomes of inquiries specified are retrieved

Propelled highlights:

  • Multi-customer concurrency
  • Authentication

Specify the partitions in Hive?

Here are the partitions in Hive:

  • Partitions are a method for isolating tables into different parts based on partition keys.
  • Partition is utilized when the table has at least one Partition keys.
  • Partition act as necessary key components that decide how the information is stored in the table.

Clarify how Hive De-serialize and serialize the information?

Serialization and de-serialization designs are prominently known as SerDes. Hive enables the system to read or write information in a specific format. These formats parse the organized or unstructured data bytes put away in HDFS by the definition of Hive tables.

Say what the views are in Hive?

Views are Similar to tables In Hive; They are produced based on various requirements:

  • Any results can be spared asset data as a view in Hive
  • Similar to views utilized as a part of SQL in use.
  • All kind of DML tasks can be performed on a view.

What is the major difference between local and remote meta-store?

  • Local Meta-store: In local meta-store design, the meta-store service keeps running in the same JVM in which the Hive service is running and associates with a database running in a different JVM, either on a similar machine or a remote machine.
  • Remote Meta-store: In the remote meta-store design, the meta-store service keeps running alone separating JVM and not in the Hive benefit JVM. Different procedures communicate with the meta-store server utilizing Thrift Network APIs. You can have at least one meta-store servers for this situation to give greater accessibility.

What is the man difference between HBase and Hive?

Both hive and HBase can be utilized in different technologies that depend on Hadoop. Hive happens to be an infrastructure warehouse of information, which is utilized on Hadoop while HBase is NoSQL. The key esteem stores which keep running on Hadoop themselves. The hive will also enable the individuals who know about SQL run a few of jobs in MapReduce when Hbase will also bolster 4 of the activities, for example, put, get, scan and erase. The HBase happens to be useful for questioning for information yet Hive then again is useful for questioning information is analytical and is gathered over a while.

Clarify about the SMB Join in Hive?

In SMB join in Hive, every mapper peruses a bucket from the first table and the relating bucket from the second table, and after that, a merge sort join is performed. Sort Merge Bucket (SMB) joins in the hive is for the most utilized as there are no restrictions on file or segment or table join. SMB join can best be utilized when the tables are huge. In SMB join the sections are bucketed and arranged to utilize the join segments. All tables ought to have a similar number of buckets in SMB join.

What are the various uses of explode Hive?

Hadoop developers consider the exhibit as their inputs and convert them into a different table row. To change over data types into wanted table formats Hive is basically utilizing detonate.

What is the use of Hcatalog?

Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.

Write a query to rename a table Student to Student_New.

Alter Table Student RENAME to Student_New

Where is table data stored in Apache Hive by default?

hdfs: //namenode_server/user/hive/warehouse

When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?

When running Hive in embedded mode, it creates a local metastore. When you run the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.

The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.

How will you read and write HDFS files in Hive?

  1. i) TextInputFormat- This class is used to read data in plain text file format.
  2. ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.

iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.

  1. iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.

I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?

Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.

What is the use of explode in Hive?

Explode in Hive is used to convert complex data types into desired table formats. explode UDTF basically emits all the elements in an array into multiple rows.

Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.

SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.

ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.

DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.

CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.

Is it possible to change the default location of Managed Tables in Hive, if so how?

Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword.

How data transfer happens from HDFS to Hive?

If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user just has to define the table using the keyword external that creates the table definition in the hive metastore.

Create external table table_name (

id int,

myfields string

)

location ‘/my/location/in/hdfs’;

In case of embedded Hive, can the same metastore be used by multiple users?

We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.

Is it possible to compress json in Hive external table?

Yes, you need to gzip your files and put them as is (*.gz) into the table location.

Mention what are the different modes of Hive?

Depending on the size of data nodes in Hadoop, Hive can operate in two modes.

These modes are,

  • Local mode
  • Map reduce mode

Mention key components of Hive Architecture?

Key components of Hive Architecture include,

  • User Interface
  • Compiler
  • Metastore
  • Driver
  • Execute Engine

Mention what Hive is composed of?

Hive consists of 3 main parts,

  1. Hive Clients
  2. Hive Services
  3. Hive Storage and Computing

Mention what is (HS2) HiveServer2?

It is a server interface that performs following functions.

  • It allows remote clients to execute queries against Hive
  • Retrieve the results of mentioned queries

Some advanced features Based on Thrift RPC in its latest version include

  • Multi-client concurrency
  • Authentication

Mention what Hive query processor does?

Hive query processor convert graph of MapReduce jobs with the execution time framework.  So that the jobs can be executed in the order of dependencies.

Mention what are the components of a Hive query processor?

The components of a Hive query processor include,

  • Logical Plan Generation
  • Physical Plan Generation
  • Execution Engine
  • Operators
  • UDF’s and UDAF’s
  • Optimizer
  • Parser
  • Semantic Analyzer
  • Type Checking

What is Buckets in Hive?

  • The data present in the partitions can be divided further into Buckets
  • The division is performed based on Hash of particular columns that is selected in the table.

In Hive, how can you enable buckets?

In Hive, you can enable buckets by using the following command,

set.hive.enforce.bucketing=true;

In Hive, can you overwrite Hadoop MapReduce configuration in Hive?

Yes, you can overwrite Hadoop MapReduce configuration in Hive.

 Explain how can you change a column data type in Hive?

You can change a column data type in Hive by using command,

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

Where does the data of a Hive table gets stored?

By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.

What is a metastore in Hive?

Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.

Why Hive does not store metadata information in HDFS?

Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

What is the difference between local and remote metastore?

Local Metastore:

In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.

Remote Metastore:

In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.

What is the default database provided by Apache Hive for metastore?

By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.

Scenario:

Is it possible to change the default location of a managed table?

Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.

When should we use SORT BY instead of ORDER BY?

We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.  Curriculum

What is a partition in Hive?

Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.

Why do we perform partitioning in Hive?

Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.

For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.

How can you add a new partition for the month December in the above partitioned table?

For adding a new partition in the above table partitioned_transaction, we will issue the command give below:

ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION  ‘/partitioned_transaction’;

What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?

By default, the number of maximum partitions that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:

SET hive.exec.max.dynamic.partitions.pernode = <value> 

Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>

Scenario:

I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?

To remove this error, one has to execute following commands:

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

Next

  • By default, hive.exec.dynamic.partition configuration property is set to False in case you are using Hive whose version is prior to 0.9.0.
  • exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive allows all partitions to be dynamic.

 Why do we need buckets?

There are two main reasons for performing bucketing to a partition:

  • Amap side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
  • Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.

Scenario:

Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:

id first_name last_name email gender ip_address

1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52

2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130

3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64

4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31

5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19

How will you consume this CSV file into the Hive warehouse using built SerDe?

SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.

Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:

CREATE EXTERNAL TABLE sample

(id int, first_name string, 

last_name string, email string,

gender string, ip_address string) 

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ 

STORED AS TEXTFILE LOCATION ‘/temp’;

Now, we can perform any query on the table ‘sample’:

SELECT first_name FROM sample WHERE gender = ‘male’;

How to skip header rows from a table in Hive?

Imagine that header records in a table are as follows:

System=…Version=…Sub-version=…

Suppose, we do not want to include the above three lines of headers in our Hive query. To skip the header lines from our table in Hive, we will set a table property.

CREATE EXTERNAL TABLE employee (name STRING,job STRING,dob STRING,id INT,salary INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILELOCATION ‘/user/data’TBLPROPERTIES(“skip.header.line.count”=”2”);

What are the components used in Hive Query Processor?

Following are the components of a Hive Query Processor:

  • Parse and Semantic Analysis (ql/parse)
  • Metadata Layer (ql/metadata)
  • Type Interfaces (ql/typeinfo)
  • Sessions (ql/session)
  • Map/Reduce Execution Engine (ql/exec)
  • Plan Components (ql/plan)
  • Hive Function Framework (ql/udf)
  • Tools (ql/tools)
  • Optimizer (ql/optimizer)

What is the definition of Hive? What is the present version of Hive? Explain ACID transactions in Hive.

Hive is an open-source data warehouse system. We can use Hive for analyzing and querying large datasets. It’s similar to SQL. The present version of Hive is 0.13.1. Hive supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions. ACID transactions are provided at row levels. Following are the options Hive uses to support ACID transactions:

  • Insert
  • Delete
  • Update

What is the maximum size of a string data type supported by Hive? Explain how Hive supports binary formats.

The maximum size of a string data type supported by Hive is 2 GB. Hive supports the text file format by default, and it also supports the binary format sequence files, ORC files, Avro data files, and Parquet files.

  • Sequence file: It is a splittable, compressible, and row-oriented file with a general binary format.
  • ORC file: Optimized row columnar (ORC) format file is a record-columnar and column-oriented storage file. It divides the table in row split. Each split stores the value of the first row in the first column and follows subsequently.
  • Avro data file: It is the same as a sequence file that is splittable, compressible, and row-oriented but without the support of schema evolution and multilingual binding.
  • Parquet file:In Parquet format, along with storing rows of data adjacent to one another, we can also store column values adjacent to each other such that both horizontally and vertically datasets are partitioned.

Whenever we run a Hive query, a new metastore_db is created. Why?

A local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. The property is:

javax.jdo.option.ConnectionURL

with the default value:

jdbc:derby:;databaseName=metastore_db;create=true

Therefore, we have to change the behavior of the location to an absolute path so that from that location the metastore can be used.

How do we write our own custom SerDe?

Mostly, end-users prefer writing a Deserializer instead of using SerDe as they want to read their own data format instead of writing to it, e.g., RegexDeserializer deserializes data with the help of the configuration parameter ‘regex’ and with a list of column names.

If our SerDe supports DDL (i.e., SerDe with parameterized columns and column types), we will probably implement a protocol based on DynamicSerDe, instead of writing a SerDe. This is because the framework passes DDL to SerDe through the ‘Thrift DDL’ format and it’s totally unnecessary to write a “Thrift DDL” parser.

What Are the Different Modes in the Hive?

This may seem like an easy question, but again, sometimes interviewers like to ask these basic questions to see how confident you are when it comes to your Hive knowledge. Answer by saying that Hive can sometimes operate in two modes, which are MapReduce mode and local mode. Explain that this depends on the size of the DataNodes in Hadoop.

What is Hive Bucketing?

When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables. You’ll also want to take your answer a step further by explaining some of the specific bucketing features, as well as some of the advantages of bucketing in Hive. For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.

What Variations of Tables Are Available in Hive?

This is a fairly straightforward question for someone experienced in Hive, so it’s important to know the answer without hesitation: The two types of tables are managed tables and external tables.

What Are Partitions?

In Hive, tables are organized and divided into partitions. You’ll want to include this in your answer, as well as explain why partitions are useful in Hive.

What File Formats and Applications Does Hive Support?

The answer to this question will include a lot of information, so it’s important to be prepared to list as many supported file formats and applications as possible. Applications written in C++, Python, Java, PHP, and Ruby are generally supported in Hive. When it comes to filing formats, Hive supports text file formats by default but also supports binary file formats, such as Avro data, ORC, Sequence, and Parquet files.

Specify the different types of tables accessible in Hive?

Managed and External tables are the two different kinds of tables in hive used to enhance how information is loaded, managed and controlled

Two types of tables, which are used are:

  • Managed Table-: Managed table is also known as an internal table. This is the default table in Hive. When we make a table in Hive without specifying it as external, naturally we will get a Managed table. If we make a table as a managed table, the table will be made in a specific area in HDFS.
  • External table:External table is made up for external use as when the information is utilized outside Hive. At whatever point we need to erase the table’s meta information and we need to keep the table’s information as it seems to be, we utilize External table. The external table just erases the pattern of the table.

What does meta-store means in Hive?

Meta-store in Hive stores the meta information utilizing RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which changes over the object portrayal into a relational schema.

Hive meta-store comprises of two major units:

  • A service that gives meta-store access to other Apache Hive administrations.
  • Disk storage for the Hive metadata, which is separate from HDFS stockpiling.

What in Hive made out of?

Hive is made out of

  • Clients
  • Services
  • Storage and Computing

List down the segments of a Hive question processor?

The segments of a Hive question processor are as follows:

  • Logical Plan Generation
  • Physical Plan Generation
  • UDF’s and UDAF’s
  • Execution Engine
  • Operators
  • Semantic Analyzer
  • Optimizer
  • Type Checking
  • Parser

Say when to pick “Inward Table” and “Outside Table” in Hive?

In Hive, you can pick an internal table

  • If the preparing data accessible in the local file system.
  • If we need Hive to deal with the entire lifecycle of data including the cancellation

You can pick an External table

  • If processing information accessible in HDFS
  • Useful when the documents are being utilized outside of Hive

What do Hive variable means? How can we utilize it?

Hive variable is made in the Hive condition that can be referenced by Hive contents. It is utilized to pass a few values to the hive inquiries when the queries begin executing.

What is the difference between HBase and Hive?

Hive vs HBase
HBase Hive
HBase does not allow execution of SQL queries. Hive allows execution of most SQL queries.
HBase runs on top of HDFS. Hive runs on top of Hadoop MapReduce.
HBase is a NoSQL database. Hive is a datawarehouse framework.
Supports record level insert, updated and delete operations. Does not support record level insert, update and delete.

 Can you list few commonly used Hive services?

  • Command Line Interface (cli)
  • Hive Web Interface (hwi)
  • HiveServer (hiveserver)
  • Printing the contents of an RC file using the tool rcfilecat.
  • Jar
  • Metastore

For the complete list of big data companies and their salaries-

Explain the difference between partitioning and bucketing.

  • Partitioning and Bucketing of tables is done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e., either by timestamp ranges, by location, etc. Bucketing does not work by default.
  • Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
  • In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
  • Basically, a bucket is a file in Hive whereas partition is a directory.

 Explain about the different types of partitioning in Hive?

Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.

Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.

When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.

What are the components of a Hive query processor?

Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies. The various components of a query processor are-

  • Parser
  • Semantic Analyser
  • Type Checking
  • Logical Plan Generation
  • Optimizer
  • Physical Plan Generation
  • Execution Engine
  • Operators
  • UDF’s and UDAF’s.

Differentiate between describe and describe extended.

Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.

Describe extended database/schema- Gives the details of the database or schema in a detailed manner.

Is it possible to overwrite Hadoop MapReduce configuration in Hive?

Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.

Write a hive query to view all the databases whose name begins with “db”

SHOW DATABASES LIKE ‘db.*’

How can you prevent a large job from running for a long time?

This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;

The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

**question**

What is a Hive Metastore?

Hive Metastore is a central repository that stores metadata in external database.

Are multiline comments supported in Hive?

No

What is Object Inspector functionality?

ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.

Explain about the different types of join in Hive.

HiveQL has 4 different types of joins –

JOIN- Similar to Outer Join in SQL

FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.

LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.

RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

 How can you configure remote metastore mode in Hive?

To configure metastore in Hive, hive-site.xml file has to be configured with the below property –

hive.metastore.uris

thrift: //node1 (or IP Address):9083

IP address and port of the metastore host

The partition of hive table has been modified to point to a new directory location. Do I have to move the data to the new location or the data will be moved automatically to the new location?

Changing the point of partition will not move the data to the new location. It has to be moved manually to the new location from the old one.

What will be the output of cast (‘XYZ’ as INT)?

It will return a NULL value.

Master Hadoop by working on interesting Hadoop Hive Real-Time Projects

 What are the different components of a Hive architecture?

Hive Architecture consists of a –

  • User Interface – UI component of the Hive architecture calls the execute interface to the driver.
  • Driver create a session handle to the query and sends the query to the compiler to generate an execution plan for it.
  • Metastore – Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
  • Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
  • Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.

What happens on executing the below query? After executing the below query, if you modify   the column –how will the changes be tracked?

Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)

AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’;

The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.

What is the default database provided by Hive for Metastore?

Derby is the default database.

Mention Hive default read and write classes?

Hive default read and write classes are

  1. TextInputFormat/HiveIgnoreKeyTextOutputFormat
  2. SequenceFileInputFormat/SequenceFileOutputFormat

Mention what are the different modes of Hive?

Different modes of Hive depends on the size of data nodes in Hadoop.

These modes are,

  • Local mode
  • Map reduce mode

Why is Hive not suitable for OLTP systems?

Hive is not suitable for OLTP systems because it does not provide insert and update function at the row level.

 Mention what is the difference between Hbase and Hive?

Difference between Hbase and Hive is,

  • Hive enables most of the SQL queries, but HBase does not allow SQL queries
  • Hive does not support record level insert, update, and delete operations on table
  • Hive is a data warehouse framework whereas HBase is NoSQL database
  • Hive run on the top of MapReduce, HBase runs on the top of HDFS

Explain what is a Hive variable? What for we use it?

Hive variable is created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.

Mention what is ObjectInspector functionality in Hive?

ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects.  It allows to access the internal fields inside the objects.

Mention what is Partitions in Hive?

Hive organizes tables into partitions.

  • It is one of the ways of dividing tables into different parts based on partition keys.
  • Partition is helpful when the table has one or more Partition keys.
  • Partition keys are basic elements for determining how the data is stored in the table.

Mention when to choose “Internal Table” and “External Table” in Hive?

In Hive you can choose internal table,

  • If the processing data available in local file system
  • If we want Hive to manage the complete lifecycle of data including the deletion

You can choose External table,

  • If processing data available in HDFS
  • Useful when the files are being used outside of Hive

Mention if we can name view same as the name of a Hive table?

No. The name of a view must be unique compared to all other tables and as views present in the same database.

Mention what are views in Hive?

In Hive, Views are Similar to tables. They are generated based on the requirements.

  • We can save any result set data as a view in Hive
  • Usage is similar to as views used in SQL
  • All type of DML operations can be performed on a view

Explain how Hive Deserialize and serialize the data?

Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record.  To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.

Mention what is the difference between order by and sort by in Hive?

  • SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
  • ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses a single

Explain when to use explode in Hive?

Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.

Mention how can you stop a partition form being queried?

You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.

So, this brings us to the end of the Apache Hive Interview Questions blog.This Tecklearn ‘Top Apache Hive Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Apache Hive or Big Data Domain. If you wish to learn Apache Hive and build a career in Big Data domain, then check out our interactive, Big Data Hadoop-Architect (All in 1) Combo Training, that comes with 24*7 support to guide you throughout your learning period.

 

https://www.tecklearn.com/course/bigdata-hadoop-architect-all-in-1-combo-course/

 

BigData Hadoop-Architect (All in 1) Combo Training

About the Course

Tecklearn’s BigData Hadoop-Architect (All in 1) combo includes the following Courses:

  • BigData Hadoop Analyst
  • BigData Hadoop Developer
  • BigData Hadoop Administrator
  • BigData Hadoop Tester
  • Big Data Security with Kerberos

Why Should you take BigData Hadoop Combo Training?

  • Average salary for a Hadoop Administrator ranges from approximately $104,528 to $141,391 per annum – Indeed.com
  • Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com
  • Average salary for a Big Data Hadoop Analyst is $115,819– ZipRecruiter.com

What you will Learn in this Course?

Introduction

  • The Case for Apache Hadoop
  • Why Hadoop?
  • Core Hadoop Components
  • Fundamental Concepts

HDFS

  • HDFS Features
  • Writing and Reading Files
  • NameNode Memory Considerations
  • Overview of HDFS Security
  • Using the Namenode Web UI
  • Using the Hadoop File Shell

Getting Data into HDFS

  • Ingesting Data from External Sources with Flume
  • Ingesting Data from Relational Databases with Sqoop
  • REST Interfaces
  • Best Practices for Importing Data

YARN and MapReduce

  • What Is MapReduce?
  • Basic MapReduce Concepts
  • YARN Cluster Architecture
  • Resource Allocation
  • Failure Recovery
  • Using the YARN Web UI
  • MapReduce Version 1

Planning Your Hadoop Cluster

  • General Planning Considerations
  • Choosing the Right Hardware
  • Network Considerations
  • Configuring Nodes
  • Planning for Cluster Management

Hadoop Installation and Initial Configuration

  • Deployment Types
  • Installing Hadoop
  • Specifying the Hadoop Configuration
  • Performing Initial HDFS Configuration
  • Performing Initial YARN and MapReduce Configuration
  • Hadoop Logging

Installing and Configuring Hive, Impala, and Pig

  • Hive
  • Impala
  • Pig

Hadoop Clients

  • What is a Hadoop Client?
  • Installing and Configuring Hadoop Clients
  • Installing and Configuring Hue
  • Hue Authentication and Authorization

Cloudera Manager

  • The Motivation for Cloudera Manager
  • Cloudera Manager Features
  • Express and Enterprise Versions
  • Cloudera Manager Topology
  • Installing Cloudera Manager
  • Installing Hadoop Using Cloudera Manager
  • Performing Basic Administration Tasks Using Cloudera Manager

Advanced Cluster Configuration

  • Advanced Configuration Parameters
  • Configuring Hadoop Ports
  • Explicitly Including and Excluding Hosts
  • Configuring HDFS for Rack Awareness
  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and How it Works
  • Securing a Hadoop Cluster with Kerberos

Managing and Scheduling Jobs

  • Managing Running Jobs
  • Scheduling Hadoop Jobs
  • Configuring the Fair Scheduler
  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the Cluster
  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • General System Monitoring
  • Monitoring Hadoop Clusters
  • Common Troubleshooting Hadoop Clusters
  • Common Misconfigurations

Introduction to Pig

  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

Basic Data Analysis with Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions

Processing Complex Data with Pig

  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data

Multi-Dataset Operations with Pig

  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets

Pig Troubleshooting and Optimization

  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs

Introduction to Hive and Impala

  • What Is Hive?
  • What Is Impala?
  • Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Querying with Hive and Impala

  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Differences Between Hive and Impala Query Syntax
  • Using Hue to Execute Queries
  • Using the Impala Shell

Data Management

  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables
  • Choosing a File Format
  • Managing Metadata
  • Controlling Access to Data

Relational Data Analysis with Hive and Impala

  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing

Working with Impala 

  • How Impala Executes Queries
  • Extending Impala with User-Defined Functions
  • Improving Impala Performance

Analyzing Text and Complex Data with Hive

  • Complex Values in Hive
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Conclusion

Hive Optimization 

  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Bucketing
  • Indexing Data

Extending Hive 

  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries

Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Modelling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Parallel Programming with Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Preview: Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL with Impala

Hadoop Testing

  • Hadoop Application Testing
  • Roles and Responsibilities of Hadoop Testing Professional
  • Framework MRUnit for Testing of MapReduce Programs
  • Unit Testing
  • Test Execution
  • Test Plan Strategy and Writing Test Cases for Testing Hadoop Application

Big Data Testing

  • BigData Testing
  • Unit Testing
  • Integration Testing
  • Functional Testing
  • Non-Functional Testing
  • Golden Data Set

System Testing

  • Building and Set up
  • Testing SetUp
  • Solary Server
  • Non-Functional Testing
  • Longevity Testing
  • Volumetric Testing

Security Testing

  • Security Testing
  • Non-Functional Testing
  • Hadoop Cluster
  • Security-Authorization RBA
  • IBM Project

Automation Testing

  • Query Surge Tool

Oozie

  • Why Oozie
  • Installation Engine
  • Oozie Workflow Engine
  • Oozie security
  • Oozie Job Process
  • Oozie terminology
  • Oozie bundle

 

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "Top Apache Hive Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *