Top Apache Sqoop Interview Questions and Answers

Last updated on Feb 18 2022
Rajnikanth S

Table of Contents

Top Apache Sqoop Interview Questions and Answers

What is Sqoop?

The sqoop is an acronym of SQL-TO-HADOOP. It is a command-line interface application. It is worked for conveying data between relational databases (MySQL/ PostgreSQL/Oracle/SQL Server/DB2) and Hadoop (HIVE, HDFS, HBase). The sqoop has community support and contributions, therefore this is robust.

sq1

Mention the best features of Apache Sqoop.

Apache Sqoop is a tool in Hadoop ecosystem have several advantages. Like

  1. Parallel import/export
  2. Connectors for all major RDBMS Databases
  3. Import results of SQL query
  4. Incremental Load
  5. Full Load
  6. Kerberos Security Integration
  7. Load data directly into Hive / HBase
  8. Compression
  9. Support for Accumulate

What is Sqoop Job?

To perform an incremental import if a saved job is configured, then state regarding the most recently imported rows is updated in the saved job. Basically, that allows the job to continually import only the newest rows.

Does Apache Sqoop have a default database?

Yes, MySQL is the default database.
To learn Sqoop List Databases in detail, follow this link.

How will you list all the columns of a table using Apache Sqoop?

Since to list all the columns we do not have any direct command like sqoop-list-columns. So, indirectly we can achieve this is to retrieve the columns of the desired tables and redirect them to a file that can be viewed manually containing the column names of a particular table.
Sqoop import –m 1 –connect ‘jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword’ –query “SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name=’mytableofinterest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’

If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?

By using incremental parameter with data import we can synchronize the data–
–However, with one of the two options, we can use incremental parameter-
i) append
Basically, we should use incremental import with append option. Even if the table is getting updated continuously with new rows and increasing row id values then. Especially, where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
ii) lastmodified
However, in this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.

Name a few import control commands. How can Sqoop handle large objects?

To import RDBMS data, we use import control commands
Append: Append data to an existing dataset in HDFS.
–append
Columns: columns to import from the table.
–columns
<col,col……> •
Where: where clause to use during import. —
Where the common large objects are Blog and Clob. Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is stored in external memory.

How can we import data from particular row or column? What are the destination types allowed in Sqoop import command?

Basically, on the basis of where clause, Sqoop allows to Export and Import the data from the data table. So, the syntax is
–columns
<col1,col2……> –where
–query

For Example:
sqoop import –connect jdbc:mysql://db.one.com/corp –table TECKLEARN_EMP –where “start_date> ’2016-07-20’ ”
sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM tecklearn_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”
However, into following services Sqoop supports data imported:

  1. HDFS
  2. Hive
  3. Hbase
  4. Hcatalog
  5. Accumulo

What is Sqoop Import? Explain its purpose.

While it comes to import tables from RDBMS to HDFS we use Sqoop Import tool. Generally, we can consider that each row in a table is a record in HDFS. Also, when we talk about text files all records are there as text data. However, when we talk about Avro and sequence files all records are there as binary data here. To be more specific, it imports individual tables from RDBMS to HDFS.

What is the default file format to import data using Apache Sqoop?

By using two file formats Sqoop allows data import. Such as:

  1. i) Delimited Text File Format

Basically, to import data using Sqoop this is the default file format. Moreover, to the import command in Sqoop, this file format can be explicitly specified using the –as-textfile argument. Likewise, passing this argument will produce the string-based representation of all the records to the output files with the delimited characters between rows and columns.

  1. ii) Sequence File Format

We can say, Sequence file format is a binary file format. Their records are stored in custom record-specific data types which are shown as Java classes. In addition, Sqoop automatically creates these data types and manifests them as java classes.

How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?

However, direct import of BLOB and CLOB large objects is not supported by Apache Sqoop import command. So, in order to import large objects like I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.

How can you execute a free-form SQL query in Sqoop to import the rows in a sequential manner?

By using the –m 1 option in the Sqoop import command we can accomplish it. Basically, it will create only one MapReduce task which will then import rows serially.

When to use –target-dir and when to use –warehouse-dir while importing data?

Basically, we use –target-dir to specify a particular directory in HDFS. Whereas we use –warehouse-dir to specify the parent directory of all the sqoop jobs. So, in this case under the parent directory sqoop will create a directory with the same name as the table.

What is the process to perform an incremental data load in Sqoop?

In Sqoop, the process to perform incremental data load is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. Moreover, in Sqoop the delta data can be facilitated through the incremental load command.
In addition, by using Sqoop import command we can perform incremental load. Also, by loading the data into the hive without overwriting it. However, in Sqoop the different attributes that need to be specified during incremental load are
1) Mode (incremental)
It shows how Sqoop will determine what the new rows are. Also, it has value as Append or Last Modified.
2) Col (Check-column)
Basically, it specifies the column that should be examined to find out the rows to be imported.
3) Value (last-value)
It denotes the maximum value of the check column from the previous import operation.

What is the significance of using –compress-codec parameter?

However, we use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.

Can free-form SQL queries be used with Sqoop import command? If yes, then how can they be used?

In Sqoop, we can use SQL queries with the import command. Basically, we should use import command with the –e and – query options to execute free-form SQL queries. But note that the –target dir value must be specified While using the –e and –query options with the import command.

What is the importance of eval tool?

Basically, Sqoop Eval helps to run sample SQL queries against Database as well as preview the results on the console. Moreover, it helps to know what data we can import or that desired data is imported or not.

How can you import only a subset of rows from a table?

In the sqoop import statement, by using the WHERE clause we can import only a subset of rows.

What are the limitations of importing RDBMS tables into Hcatalog directly?

By making use of –hcatalog –database option with the –hcatalog –table, we can import RDBMS tables into Hcatalog directly. However, there is one limitation to it is that it does not support several arguments like –as-Avro file, -direct, -as-sequencefile, -target-dir , -export-dir.

What is the advantage of using –password-file rather than -P option while preventing the display of password in the sqoop import statement?

Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads from standard input, preventing automation.

What do you mean by Free Form Import in Sqoop?

By using any SQL Sqoop can import data from a relational database query rather than only using table and column name parameters.

What is the role of JDBC driver in Sqoop?

Basically, sqoop needs a connector to connect to different relational databases. Since, as a JDBC driver, every DB vendor makes this connector available which is specific to that DB. Hence, to interact with Sqoop needs the JDBC driver of each of the database it needs.

Is JDBC driver enough to connect sqoop to the databases?

No. to connect to a database Sqoop needs both JDBC and connector.

What is InputSplit in Hadoop?

Input Split is defined as while a Hadoop job runs, it splits input files into chunks also assign each split to a mapper to process.

How can you schedule a sqoop job using Oozie?

However, Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.

What is the importance of — the split-by clause in running parallel import tasks in sqoop?

In Sqoop, it mentions the column name based on whose value the data will be divided into groups of records. Further, by the MapReduce tasks, these group of records will be read in parallel.

What is a sqoop metastore?

A tool that Sqoop hosts a shared metadata repository is what we call sqoop metastore. Moreover, multiple users and/or remote users can define and execute saved jobs (created with the sqoop job) defined in this metastore.
In addition, with the –meta-connect argument Clients must be configured to connect to the metastore in sqoop-site.xml.

What is the purpose of sqoop-merge?

The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.

How can you see the list of stored jobs in sqoop metastore?

sqoop job –list

Which database the sqoop metastore runs on?

Basically, on the current machine running sqoop-metastore launches, a shared HSQLDB database instance.

Where can the metastore database be hosted?

Anywhere, it means we can host metastore database within or outside of the Hadoop cluster.

Give the sqoop command to see the content of the job named myjob?

Sqoop job –show myjob

How can you control the mapping between SQL data types and Java types?

we can configure the mapping between by using the –map-column-java property.
For example:
$ sqoop import … –map-column-java id = String, value = Integer

Is it possible to add a parameter while running a saved job?

Yes, by using the –exec option we can add an argument to a saved job at runtime.
sqoop job –exec jobname — — newparameter

What is the usefulness of the options file in sqoop.

To specify the command line values in a file and use it in the sqoop commands we use the options file in sqoop.
For example
The –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.

How can you avoid importing tables one-by-one when importing a large number of tables from a database?

Using the command
sqoop import-all-tables
–connect
–usrename
–password
–exclude-tables table1,table2 ..
Basically, this will import all the tables except the ones mentioned in the exclude-tables clause.

How can you control the number of mappers used by the sqoop command?

To control the number of mappers executed by a sqoop command we use the parameter –num-mappers. Moreover, we should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.

What is the default extension of the files produced from a sqoop import using the –compress parameter?

.gz

What is the work of Export in Hadoop sqoop?

Export tool transfer the data from HDFS to RDBMS

Use of Codegen command in Hadoop sqoop?

Basically, Codegen command generates code to interact with database records

Use of Help command in Hadoop sqoop?

Help command in Hadoop sqoop generally list available commands

What is the significance of using –compress-codec parameter?

We use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.

What is a disadvantage of using –direct parameter for faster data load by sqoop?

The native utilities used by databases to support faster laod do not work for binary data formats like SequenceFile.

How will you update the rows that are already exported?

Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.

What are the basic commands in Apache Sqoop and its uses?

The basic commands of Apache Sqoop are:
Codegen, Create-hive-table, EvalExport, Help, ImportImport-all-tables, List-databasesList-tables, Versions.
Moreover, uses of Apache Sqoop basic commands are:

  1. Codegen- It helps to generate code to interact with database records.
  2. Create- hive-table- It helps to Import a table definition into a hive
  3. Eval- It helps to evaluate SQL statement and display the results
  4. Export- It helps to export an HDFS directory into a database table
  5. Help- It helps to list the available commands
  6. Import- It helps to import a table from a database to HDFS
  7. Import-all-tables- It helps to import tables from a database to HDFS
  8. List-databases- It helps to list available databases on a server
  9. List-tables- It helps to list tables in a database
  10. Version- It helps to display the version information

How Sqoop word came? Sqoop is which type of tool and the main use of sqoop?

Sqoop word came from SQL+HADOOP=SQOOP.
Basically, it is a data transfer tool. We use Sqoop to import and export a large amount of data from RDBMS to HDFS and vice versa.

What is Sqoop Validation?

It means to validate the data copied. Either import or export by comparing the row counts from the source as well as the target post copy. Likewise, we use this option to compare the row counts between source as well as the target just after data imported into HDFS. Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.

What is Purpose to Validate in Sqoop?

In Sqoop to validate the data copied is Validation main purpose. Basically, either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.

What is Sqoop Import Mainframe Tool and its Purpose?

Basically, a tool which we use to import all sequential datasets in a partitioned dataset (PDS) on a mainframe to HDFS is Sqoop Import Mainframe. That tool is what we call import mainframe tool. Also, A PDS is akin to a directory on the open systems. Likewise, in a dataset, the records can only contain character data. Moreover here, records will be stored as a single text field with the entire record.

What is the purpose of Sqoop List Tables?

Basically, the main purpose of sqoop-list-tables is list tables present in a database.

Difference Between Apache Sqoop vs Flume.

So, let’s discuss all the differences on the basis of features.

  1. Data Flow
    Apache Sqoop – Basically, Sqoopworks with any type of relational database system (RDBMS) that has the basic JDBC connectivity. Also, Sqoop can import data from NoSQL databases like MongoDB, Cassandra and along with it. Moreover, it allows data transfer to Apache Hive or HDFS.
    Apache Flume– Likewise, Flume works with streaming data sources those are generated continuously in Hadoop environments. Like log files.
  2. Type of Loading
    Apache Sqoop – Basically,  Sqoop load is not driven by events.
    Apache Flume – Here, data loading is completely event-driven.
  3. When to use
    Apache Sqoop – However, if the data is being available in Teradata, Oracle, MySQL, PostreSQL or any other JDBC compatible database it is considered an ideal fit.
    Apache Flume – While we move bulk of streaming data from sources likes JMS or spooling directories, it is the best choice.
  4. Link to HDFS
    Apache Sqoop – Basically, for importing data in Apache Sqoop, HDFS is the destination
    Apache Flume – In Apache Flume, data generally flow to HDFS through channels
  5. Architecture 
    Apache Sqoop – Basically, it has connector based architecture. However, that means the connectors know a great deal in connecting with the various data sources. Also to fetch data correspondingly.
    Apache Flume – However, it has agent-based architecture. Basically, it means code written in Flume is we call agent that may responsible for fetching the data.

Shed light on the versatile features of Sqoop

 It is important to note that Apache Sqoop is also known as a tool in the Hadoop ecosystem which carries with it several benefits. Here is the list of them.

  • Import and export in a parallel manner
  • It supports Accumulo
  • Compression of data
  • Full load taking capabilities
  • Incremental load bearing capabilities
  • Security Integration in a proper way
  • Can connect a majority of RDBMS databases It can depict results related to the queries of SQL

How can you import large objects like BLOB and CLOB in Sqoop?

 The direct import function is not supported by Sqoop in case of CLOB and BLOB objects. Hence, if you have to import large purposes, you can use JDBC based imports. This can be done without introducing the direct argument of the import utility.

What is the default database of Apache Sqoop?

 The default database of Apache Sqoop is MySQL.

Describe the process of executing a free-form SQL query to import rows

 To achieve a free-form SQL query, you have to use the –m1 option. This would create only one Mapreduce task. This would then import the rows directly.

Describe the importance of using –compress-codec parameter

 The –compress-codec parameter can be used to get the export file of the Sqoop import in the mentioned formats.

What is the significance of Eval tool?

 Sqoop Eval would help you to make use of the sample SQL queries. This can be against the database as it can preview the results that are displayed on the console. Interestingly, with the help of the Eval tool, you would be well aware of the fact that the desired data can be imported correctly or not.

What is the meaning of Free form import in Sqoop?

 With the use of Sqoop, one can import the relational database query. This can be done using column and table name parameters.

Shed light on the advantage of utilizing –password-file rather than –P option

 The –password-file option is usually used inside the Sqoop script file. On the other hand, the –P option is able to read the standard input along with the column name parameters.

Is the JDBC driver fully capable to connect Sqoop on the databases?

 The JDBC driver is not capable to connect Sqoop on the databases. This is the reason that Sqoop requires both the connector and JDBC driver.

What is the meaning of Input Split in Hadoop?

 Input Split is that kind of a function which is associated with splitting the input files into various chunks. These chunks can also assign each split to a mapper in the ongoing process of data correction.

Illustrate the utility of the Help Command in Sqoop

 The help command in Sqoop can be utilized to list the various available commands.

Shed light on the service of Codegen command in Sqoop

 The Codegen command is associated with the generation of code so that it can appropriately interact with the database records.

Describe the procedure involved in executing an incremental data load in Sqoop

 You should be well aware of the fact that in Sqoop, the process of performing additional data load is to update the uploaded data. This data is often referred to as delta data. In Sqoop, this delta data can be altered with the use of incremental load command. Additionally, it can be said that with the help of Sqoop, the import command can also perform additional load. By loading the data into the hive without overwriting it, its efficiency can be maintained in a significant manner. This is possible only with the help of incremental data load.

It is also essential for you to illustrate the various types of incremental data load. They are as follows:
Progressive Mode: This variety usually determines the number of new rows. Moreover, it also possesses a value that can best resemble the Append functions.

Value: This denotes the maximum amount that is derived from the check column from the previous import operation.

The Check Column feature: This function is helpful in specifying the number of columns that should be assessed to determine the number of rows to be imported.

Illustrate on the process of listing all the columns of a table with the help of Apache Sqoop

 To contain all the columns, you do not have any direct command like the Sqoop indexed columns. However, you can also indirectly achieve this. You can do that by retrieving the columns of the desired tables. After that, you can redirect them to a set of files that can be viewed in a standard manner. This also contains the columns of a particular table.

What is the default file format in order to import data with the utilization of Apache Sqoop?

 At the time of answering this question, you should know that there are two file formats that can be used in the case of importing data. These are as follows:

Sequencing the file format

It is a commonly observed fact that a sequence file format is also known by the name of binary file format. The records of these binary file formats are usually stored in the custom record data types. Moreover, Sqoop can automatically create a varied data types and also manifests them in the form of Java classes.

Delimiting the text file format

This is the usual file format in importing data. Additionally, it can be said that in order to avail the import command in Sqoop, this file format can be specified. You can specify the file format with the use of text file argument command. On the other hand, when you pass this argument, you would produce a string-based representation of varied types of records. You can also create the output files with the use of delimited characters between columns and rows.

List all the basic commands in Apache Sqoop along with their applications

 The basic controls in Apache Sqoop along with their uses are:

  1. Export:This function helps to export the HDFS directory into a database table
    2. List Tables: This function would help the user to list all tables in a particular database.
    3. Codegen:This function would help you to generate code so that you can interact with varied types of database records.
    4. Create: This function allows a user to import the table definition within the hive of databases.
    5. Eval: This function would always help you to assess the SQL statement and display the results.
    6. Version: This function would help you to depict the information related to the text of the database.
    7. Import all tables: This function would help a user to import all the tables from a database to HDFS.
    8. List all the databases: This function would assist a user to create a list of the available databases on a particular server.

What is the meaning of Sqoop Validation?

 It refers to the manner in which data validation happens when it is copied. It can also be executed by either exporting or importing the data. It can also be done with the help of a basic comparison between the row counts from the source. You can also opt to use the option to make sure that you are comparing the row counts between the target as well as the source. During the time of the imports, all the rows can be deleted and added. In this context, it is important to note that during the whole process, Sqoop keeps a tab on the changes that have been affected.

Give a basic introduction to Sqoop

 When it comes to transfer data between relational database servers and Hadoop, you should know that Sqoop is one of the best tools. In order to be more specific, you should use it in importing data from various types of relational databases. It is important for you to note that you can import data from varied types of databases such as MySQL, HDFS, and Hadoop. It is also interesting to note that you have the option to export data from the Hadoop file with the help of Sqoop. This functionality is being provided by the Apache Software Foundation.

It is also important to mention that Sqoop utilizes two main tools. They are in the form of Sqoop export and Sqoop import. With the help of these two tools, you can now extract data information form varied types of databases.

What are the limitations of importing the RDBMS tables into the Hcatalog directly?

 In order to import the tables into the Hcatalog in a direct manner, you have to make sure that you are using the –Hcatalog database option. However, in this process, you would face a limitation of importing the tables. It is in the form of the fact that this option do not supports a plethora of arguments like –direct, –as-Avro file and -export-dir.

Shed light on the procedure of updating the rows that have been directly exported

 In order to update the existing rows that have been exported, you have to use a particular parameter. This parameter is in the form of update key. You can also opt to use a list of comma-separated commands. This would help you to identify a row in a unique fashion. A majority of the columns are used in the Where clause of the update query that has been already been generated. Moreover, all the other types of table columns should be used in the SET portion of the generated query.

What is the significance of the Sqoop Import Mainframe tool? 

 The Sqoop Import Mainframe tool can also be used to import all the important datasets which lies in a partitioned dataset. The partitioned dataset is also known as PDS. The PDS is also known to a directory on varied types of open systems. It is important for you to note that in a dataset, the various types of records would be stored as a single text field with the help of the entire record. This tool would always help you to make sure that you are importing the right types of data tools and that too in a proper manner.

Define Sqoop meta store

 It is also known as a shared metadata repository with the help of which the local users can execute and define various types of list tables. In order to connect to the metastore, you have to make changes to the Sqoop –site.xml.

Does Sqoop uses the maps reduce function? If it does then shed light on the reasons

 Apache Sqoop also uses the Map-Reduce function of Hadoop to obtain data from the relational databases. During the process of importing data, Sqoop controls the mappers and their numbers. The mappers who access RDBMS come across denial-of-service attacks. Hence, it can be said that with the help of Sqoop, big data can be efficiently managed.

Describe the practicality of opting for Sqoop nowadays

 Apache Sqoop is regarded as an excellent help for those individuals who face challenges in transferring data out of the data warehouse. It is also used for importing data from RDBMS to HDFS. With the help of Sqoop, the users can also import more than one table. Interestingly, with the use of Apache Sqoop, the data selected columns can be easily exported. Furthermore, Sqoop is also compatible with a majority of JDBC databases. Here is the list of questions which would help you to crack the Sqoop interview.

Compare Sqoop and Flume

sq2

Name a few import control commands. How can Sqoop handle large objects?

Import control commands are used to import RDBMS data

Append: Append data to an existing dataset in HDFS. –append

Columns: columns to import from the table. –columns

<col,col……> • Where: where clause to use during import. —

where The common large objects are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0. then it is stored in external memory.

How can we import data from particular row or column? What is the destination types allowed in Sqoop import command?

Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is

–columns

<col1,col2……> –where

–query

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp --table TECKLEARN_EMP --where “start_date> ’2016-07-20’ ”

sqoopeval --connect jdbc:mysql://db.test.com/corp --query “SELECT * FROM tecklearn_emp LIMIT 20”

sqoop import –connect jdbc:mysql://localhost/database --username root --password aaaaa –columns “name,emp_id,jobtitle”

Sqoop supports data imported into following services:

HDFS

Hive

Hbase

Hcatalog

Accumulo

Role of JDBC driver in sqoop setup? Is the JDBC driver enough to connect the sqoop to the database?

Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.
No, Sqoop needs JDBC and a connector to connect a database.

Using Sqoop command how can we control the number of mappers?

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax: -m, –num-mappers

How will you update the rows that are already exported? Write sqoop command to show all the databases in MySQL server.

By using the parameter – update-key we can update existing rows. Comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause generated UPDATE query. All other table columns will be used in the SET part of the query.
The command below is used to show all the databases in MySQL server.

$ sqoop list –databases –connect jdbc:mysql://database.test.com/

Define Sqoop metastore? What is the purpose of Sqoop-merge?

Sqoop meta store is a tool for using hosts in a shared metadata repository. Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the

–meta-connect argument.

The purpose of sqoop-merge is:
This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Explain the saved job process in Sqoop.

Sqoop allows us to define saved jobs which make this process simple. A saved job records the configuration information required to execute a Sqoop command at a later time. sqoop-job tool describes how to create and work with saved jobs. Job descriptions are saved to a private repository stored in $HOME/.sqoop/.

We can configure Sqoop to instead use a shared metastore, which makes saved jobs offered to multiple users across a shared cluster. Starting the metastore is covered by the section on the sqoop-metastore tool.

How Sqoop word came ? Sqoop is which type of tool and the main use of sqoop?

Sqoop word came from SQL+HADOOP=SQOOP. And Sqoop is a data transfer tool.
The main use of Sqoop is to import and export the large amount of data from RDBMS to HDFS and vice versa.

How to enter into Mysql prompt, and explain the command parameter indicates?

The command for entering into Mysql prompt is “mysql –u root –p”
-u indicatesthe user
Root indicates username
-p indicates password.

I am getting connection failure exception during connecting to Mysql through Sqoop, what is the root cause and fix for this error scenario?

This will happen when there is lack of permissions to access our Mysql database over the network. We can try the below command to confirm the connect to Mysql database from aSqoop client machine.
$ mysql –host=MySqlnode> –database=test –user= –password=
We can grant the permissions with below commands.

mysql> GRANT ALL PRIVILEGES ON *.* TO ‘%’@’localhost’;

mysql> GRANT ALL PRIVILEGES ON *.* TO ‘ ’@’localhost’;

  1. I am getting java.lang.IllegalArgumentException: during importing tables from oracle database.what might be the root cause and fix for this error scenario?

Sqoop commands are case- sensitive of table names and user names.
By specifying the above two values in UPPER case, it will resolve the issue.
In case, the source table is created under different user namespace,then table name should be like USERNAME.TABLENAME as shown below
sqoop import
–connect jdbc:oracle:thin:@tecklearn.testing.com/TECKLEARN
–username SQOOP
–password sqoop
–table COMPANY.EMPLOYEES

How can you list all the columns of a table using Apache sqoop?

There is no straight way to list all the columns of a table in Apache Sqoop like sqoop-list-columns, so first we should retrieve the columns of the particular table and transform to a file containing the column names of particular table.Syntax is:

Sqoop import –m1 –connect ‘jdbc:sqlserver://servername;database=databasename;

Username-DeZyre;password=mypassword’ –query “SELECT column_name,DATA_TYPE FROM INFORMATION_SCHEMA columns WHEREtable_name=’mytableofinterest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’.

What are the basic commands in Hadoop Sqoop and its uses?

The basic commands of HadoopSqoop are

  • Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables,Versions.
  • Useof HadoopSqoop basic commands
  • Codegen- It helps to generate code to interact with database records.
  • Create-hive-table- It helps to Import a table definition into a hive
  • Eval- It helps to evaluateSQL statement and display the results
  • Export-It helps to export an HDFS directory into a database table
  • Help- It helps to list the available commands
  • Import- It helps to import a table from a database to HDFS
  • Import-all-tables- It helps to import tables from a database to HDFS
  • List-databases- It helps to list available databases on a server
  • List-tables-It helps to list tables in a database
  • Version-It helps to display the version information

Is sqoop same as to distcp in hadoop?

No. Because the only distcp import command is same as Sqoop import command and both the commands submit parallel map-only jobs but both command functions are different. Distcp is used to copy any type of files from Local filesystem to HDFS and Sqoop is used for transferring the data records between RDBMS and Hadoop eco- system service.

For each sqoop copying into HDFS how many MapReduce jobs and tasks will be submitted?

There are 4 jobs that will be submitted to each Sqoop copying into HDFS and no reduce tasks are scheduled.

How can Sqoop be used in Java programs?

In the Java code Sqoop jar is included in the classpath. The required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.runTool() method also invoked in Java code.

I am having around 500 tables in a database. I want to import all the tables from the database except the tables named Table 498, Table 323, and Table 199. How can we do this without having to import the tables one by one?

This can be proficient using the import-all-tables, import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
–connect –username –password –exclude-tables Table498, Table 323, Table 199

What is Sqoop?

Sqoop is an open-source tool that enables users to transfer bulk data between Hadoop eco system and relational databases.

What are the relational databases supported in Sqoop?

Below are the list of RDBMSs that are supported by Sqoop Currently.

MySQL

PostGreSQL

Oracle

Microsoft SQL

IBM’s Netezza

Teradata

What are the destination types allowed in Sqoop Import command?

Currently Sqoop Supports data imported into below services.

HDFS

Hive

HBase

HCatalog

Accumulo

When Importing tables from MySQL to what are the precautions that needs to be taken care w.r.t to access?

In MySQL, we need to make sure that we have granted all privileges on the databases, that needs to be accessed, should be given to all users at destination hostname. If Sqoop is being run under localhost and MySQL is also present on the same then we can grant the permissions with below two commands from MySQL shell logged in with ROOT user.

1

2

3

4
$ mysql -u root -p

mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'localhost';

mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';

What if my MySQL server is running on MachineA and Sqoop is running on MachineB for the above question?

From MachineA login to MySQL shell and perform the below command as root user. If using hostname of second machine, then that should be added to /etc/hosts file of first machine.

1

2

3

4
$ mysql -u root -p

mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'MachineB hostname or Ip address';

mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'MachineB hostname or Ip address';

How Many Mapreduce jobs and Tasks will be submitted for Sqoop copying into HDFS?

For each sqoop copying into HDFS only one mapreduce job will be submitted with 4 map tasks. There will not be any reduce tasks scheduled.

How can we control the parallel copying of RDBMS tables into hadoop ?

We can control/increase/decrease speed of copying by configuring the number of map tasks to be run for each sqoop copying process. We can do this by providing argument -m 10 or  –num-mappers 10 argument to sqoop import command. If we specify -m 10 then it will submit 10 map tasks parallel at a time. Based on our requirement we can increase/decrease this number to control the copy speed.

What are the criteria for specifying parallel copying in Sqoop with multiple parallel map tasks?

To use multiple mappers in Sqoop, RDBMS table must have one primary key column (if present) in a table and the same will be used as split-by column in Sqoop process. If primary key is not present, we need to provide any unique key column or set of columns to form unique values and these should be provided to -split-by column argument.

While loading tables from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do ?

We need to use –direct argument in import command to use direct import fast path and this –direct can be used only with MySQL and PostGreSQL as of now.

Is Sqoop similar to distcp in hadoop?

Partially yes, hadoop’s distcp command is similar to Sqoop Import command. Both submits parallel map-only jobs but distcp is used to copy any type of files from Local FS/HDFS to HDFS and Sqoop is for transferring the data records only between RDMBS and Hadoop eco system services, HDFS, Hive and HBase.

What are the majorly used commands in Sqoop?

In Sqoop Majorly Import and export commands are used. But below commands are also useful some times.

codegen

eval

import-all-tables

job

list-databases

list-tables

merge

metastore

What is the example connect string for Oracle database to import tables into HDFS?

We need to use Oracle JDBC Thin driver while connecting to Oracle database via Sqoop. Below is the sample import command to pull table employees from oracle database testdb.

1

2

3

4

5

6
sqoop import \

--connect jdbc:oracle:thin:@oracle.example.com/testdb \

--username SQOOP \

--password sqoop \

--table employees

While connecting to MySQL through Sqoop, I am getting Connection Failure exception what might be the root cause and fix for this error scenario?

This might be due to insufficient permissions to access your MySQL database over the network. To confirm this we can try the below command to connect to MySQL database from Sqoop’s client machine.

1

2

 $ mysql –host=MySql node&gt; –database=test –user= –password=

If this is the case then we need grant permissions user @ sqoop client machine as per the answer to Question 6 in this post.

While importing tables from Oracle database, Sometimes I am getting java.lang.IllegalArgumentException: Attempted to generate class with no columns! or NullPointerException what might be the root cause and fix for this error scenario?

While dealing with Oracle database from Sqoop, Case sensitivity of table names and user names matters highly. Most probably by specifying these two values in UPPER case will solve the issue unless actual names are mixed with Lower/Upper cases. If these are mixed, then we need to provide them within double quotes.

In case, the source table is created under different user namespace, then we need to provide table name as USERNAME.TABLENAME as shown below.

1

2

3

4

5

6
sqoop import \

--connect jdbc:oracle:thin:@oracle.example.com/ORACLE \

--username SQOOP \

--password sqoop \

--table SIVA.EMPLOYEES

What is Sqoop metastore?

 Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore.

What are the two file formats supported by sqoop for import?

 Delimited text and Sequence Files.

What is the difference between Sqoop and DistCP command in Hadoop?

 Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS, etc.

What do you mean by Free Form Import in Sqoop?

 Sqoop can import data form a relational database using any SQL query rather than only using table and column name parameters.

Does Apache Sqoop have a default database?

 Yes, MySQL is the default database

How can you execute a free form SQL query in Sqoop to import the rows in a sequential manner?

 This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one MapReduce task which will then import rows serially.

I have around 300 tables in a database. I want to import all the tables from the database except the tables named Table298, Table 123, and Table299. How can I do this without having to import the tables one by one?

 This can be accomplished using the import-all-tables import command in Sqoop and by specifying the exclude-tables option with it as follows-

sqoop import-all-tables

–connect –username –password –exclude-tables Table298, Table 123, Table 299

How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?

 Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.

How will you list all the columns of a table using Apache Sqoop?

 Unlike sqoop-list-tables and sqoop-list-databases, there is no direct command like sqoop-list-columns to list all the columns. The indirect way of achieving this is to retrieve the columns of the desired tables and redirect them to a file which can be viewed manually containing the column names of a particular table.

Sqoop import –m 1 –connect ‘jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword’ –query “SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name=’mytableofinterest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’

The incoming value from HDFS for a particular column is NULL. How will you load that row into RDBMS in which the columns are defined as NOT NULL?

 Using the –input-null-string parameter, a default value can be specified so that the row gets inserted with the default value for the column that it has a NULL value in HDFS.

What is the significance of using –split-by clause for running parallel import tasks in Apache Sqoop?

 –Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient.

What is the default file format to import data using Apache Sqoop?

 Sqoop allows data to be imported using two file formats

  • Delimited Text File Format

This is the default file format to import data using Sqoop. This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based representation of all the records to the output files with the delimited characters between rows and columns.

  • Sequence File Format

It is a binary file format where records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.

What are the basic commands in Hadoop Sqoop and its uses?

 The basic commands of Hadoop Sqoop are

  • Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables,Versions.
  • Useof HadoopSqoop basic commands
  • Codegen- It helps to generate code to interact with database records.
  • Create-hive-table- It helps to Import a table definition into a hive
  • Eval- It helps to evaluateSQL statement and display the results
  • Export-It helps to export an HDFS directory into a database table
  • Help- It helps to list the available commands
  • Import- It helps to import a table from a database to HDFS
  • Import-all-tables- It helps to import tables from a database to HDFS
  • List-databases- It helps to list available databases on a server
  • List-tables-It helps to list tables in a database
  • Version-It helps to display the version information

For each sqoop copying into HDFS how many MapReduce jobs and tasks will be submitted?

There are 4 jobs that will be submitted to each Sqoop copying into HDFS and no reduce tasks are scheduled.

You successfully imported a table using Apache Sqoop to HBase but when you query the table it is found that the number of rows is less than expected. What could be the likely reason?

 If the imported records have rows that contain null values for all the columns, then probably those records might have been dropped off during import because HBase does not allow null values in all the columns of a record.

Explain the significance of using –split-by clause in Apache Sqoop?

 split-by is a clause, it is used to specify the columns of the table which are helping to generate splits for data imports during importing the data into the Hadoop cluster. This clause specifies the columns and helps to improve the performance via greater parallelism. And also it helps to specify the column that has an even distribution of data to create splits,that data is imported.

How can you see the list of stored jobs in sqoop metastore?

 sqoop job –list

Give a sqoop command to import data from all tables in the MySql DB DB1.

 sqoop –tables –connect jdbc:mysql://host/DB1

Where can the metastore database be hosted?

 The metastore database can be hosted anywhere within or outside of the Hadoop cluster..

If the source data gets updated every now and then, how will you synchronise the data in HDFS that is imported by Sqoop?

 Data can be synchronised using incremental parameter with data import –

–Incremental parameter can be used with one of the two options-

  1. append-If the table is getting updated continuously with new rows and increasing row id values then incremental import with append option should be used where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
  2. lastmodified – In this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.

How can Sqoop be used in Java programs?

 In the Java code Sqoop jar is included in the classpath. The required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.runTool() method also invoked in Java code.

Below command is used to specify the connect string that contains hostname to connect MySQL with local host and database name as test_db

connectjdbc: mysql: //localhost/test_db

Is the above command the best way to specify the connect string in case I want to use Apache Sqoop with a distributed hadoop cluster?

When using Sqoop with a distributed Hadoop cluster the URL should not be specified with localhost in the connect string because the connect string will be applied on all the DataNodes with the Hadoop cluster. So, if the literal name localhost is mentioned instead of the IP address or the complete hostname then each node will connect to a different database on their localhosts. It is always suggested to specify the hostname that can be seen by all remote nodes.

I am having around 500 tables in a database. I want to import all the tables from the database except the tables named Table498, Table 323, and Table199. How can we do this without having to import the tables one by one?

 This can be proficient using the import-all-tables, import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
–connect –username –password –exclude-tables Table498, Table 323, Table 199

You use –split-by clause but it still does not give optimal performance then how will you improve the performance further.

 Using the –boundary-query clause. Generally, sqoop uses the SQL query select min (), max () from to find out the boundary values for creating splits. However, if this query is not optimal then using the –boundary-query argument any random query can be written to generate two numeric columns.

During sqoop import, you use the clause –m or –numb-mappers to specify the number of mappers as 8 so that it can run eight parallel MapReduce tasks, however, sqoop runs only four parallel MapReduce tasks. Why?

 Hadoop MapReduce cluster is configured to run a maximum of 4 parallel MapReduce tasks and the sqoop import can be configured with number of parallel tasks less than or equal to 4 but not more than 4.

What is sqoop used for?

The sqoop is primarily conveying the immense data between the relational database and the Hadoop ecosystem. The entire database or individual table import to the ecosystem (HDFS) and after modification export to the database. The sqoop helps to support the multiple loads in one database table.

What is sqoop eval?

The eval tool in sqoop used for sample demo for import data. It is permit user to run the sample RDBMS queries and examine the results on the console. Because of the eval tool, we know what is output and what kind of data import.

Does sqoop use MapReduce?

Mapreduce used in sqoop for parallel import and export the data between database and Hadoop file system.

It is used for fault resistance.

What is accumulo in sqoop?

The Accumulo in sqoop is a sorted, distributed key and value store. It provides robust, extensible data storage and retrieves data. This is stable and it has own security for key and value. A large amount of data store, retrieve and manage the HDFS data.

What is default file format in Sqoop?

The default file type is a text file format. It is the same as specifying –as-textfile clause to sqoop import command.

What is sqoop Metastore?

The metastore is a tool that is used to share data or /and repository. The multiple users can create a job and load some data other hand remote users access than a job and run again this job. The metastore in sqoop is the central place for job information.

It worked for users and developers to make collaboration.

What is sqoop import?

The sqoop import is helped to import table data to the Hadoop file system in the form of text or binary.

The syntax of sqoop import is below.

$sqoop import (generic-args) (import-args)

$sqoop-import (generic-args) (import-args)

$ sqoop import / --connect jdbc:mysql:
// localhost/ database / --username user_name / --table table_name --m */ --target-dir / table_name imported

Explain relaxed isolation in sqoop?

The –relaxed-isolation is the argument of import sqoop. This is used to import the data which is read uncommitted for mappers. The sqoop transfer committed data relational database to the Hadoop file system but with this argument, we can transfer uncommitted data in the isolation level.

How are large objects handled in Sqoop?

If the object is less than 16mb then it is stored with other common size data. The large objects are CLOBs and BLOBs. The large objects are handled by import the large object into LobFile means a large object file. The LobFile is an external storage file that can store records of large objects.

List some sqoop commands you have used?

In the sqoop mostly import and export command are used. Apart from these two other commands are used.

This list is below

  • codegen
  • eval
  • import-all-tables
  • job, list-database
  • list-tables
  • Merge
  • metastore

How to check sqoop version in hortonworks?

The following command is used to know the version in hortonworks.

# ssh username@127.0.0.1 -p 2222

Enter password: hadoop

[username@sandbox ~]# sqoop version

What is the use of split by in sqoop?

The sqoop import/export parallel, data can split into multiple chunks to transfer. The Split by in sqoop selects the id_number to split a column of the table.

sqoop import --connect jdbc:mysql:// localhost/database_name --username user_name --password 1234
--query 'select * from table_name where id=3 AND $CONDITIONS' --split-by table.id_no --target-dir /dir

the split by helped to proper distribution to make a split of data.

Why do we use $conditions in sqoop?

The condition comes up with split but split automatically decides which slice of data transfers as every task. Condition force to run only one job ar a time and gives mapper to transfer data without any attack.

sqoop import –connect jdbc:mysql:// localhost/database_name –username user_name –password 1234
–query ‘select * from table_name where id=3 AND $CONDITIONS’ –split-by table.id_no – m-1 target-dir /dir

How to do job scheduling in sqoop?

The sqoop can create Oozie workflow jobs Oozie has in-built sqoop actions inside, where the sqoop commands are executed. sqoop job –create job_name.

List some major features of Apache Sqoop?

The following list of apache sqoop features.

  • Compression
  • Connectors for all major RDBMS databases
  • Kerberos security integration
  • Full load
  • Incremental load
  • Import results of database query
  • Load data directly into hive and HBase
  • Parallel import and export connection
  • Support for accumulo

What is use of Sqoop Export?

The data is a transfer from the HDFS to RDBMS (relational database), called sqoop export. Before transforming the data, sqoop tool fetch table from the database. Therefore the table must be available in the database.

The syntax of export is below.

$ sqoop export (generic-args) (export-args)

$ sqoop-export (generic-args) (export-args)

What is –direct mode in sqoop?

The sqoop is used for Hadoop and database connection but has some stages. The –directive mode in scoop used for directly import multiple table or individual table into HIVEHDFSHBase. If we have a specific database connection directly apart from default database connection then –directive mode used.

What are reducers in sqoop?

The reducer is used for accumulation or aggregation. After mapping, the reducer fetches the data transfer by the database to Hadoop. In the sqoop there is no reducer because import and export work parallel in sqoop.

What is boundary query in sqoop?

The boundary query is used for splitting the value according to id_no of the database table.

To boundary query, we can take a minimum value and maximum value to split the value.

To make split using boundary queries, we need to know all the values in the table.

To import data from the database to HDFS using boundary queries.

Example–boundary-query

"SELECT min(id_value), max(id_value) from table_name"

Compare Sqoop and Flume

sq3

What Is the Role Of Jdbc Driver In A Sqoop Set Up?

To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to inetract with.

Is Jdbc Driver Enough To Connect Sqoop To The Databases?

No. Sqoop needs both JDBC and connector to connect to a database.

When to Use Target-dir And When To Use Warehouse-dir While Importing Data?

To specify a particular directory in HDFS use –target-dir but to specify the parent directory of all the sqoop jobs use warehouse-dir. In this case under the parent directory sqoop will cerate a directory with the same name as th e table.

How Can You Import Only A Subset Of Rows Form A Table?

By using the WHERE clause in the sqoop import statement we can import only a subset of rows.

How Can We Import A Subset Of Rows From A Table Without Using The Where Clause?

We can run a filtering query on the database and save the result to a temporary table in database.Then use the sqoop import command without using the where clause

What Is The Advantage Of Using Password-file Rather Than -p Option While Preventing The Display Of Password In The Sqoop Import Statement?

The password-file option can be used inside a sqoop script while the -P option reads from standard input , preventing automation.

What Is The Default Extension Of The Files Produced From A Sqoop Import Using The –compress Parameter?

.gz

What Is The Significance Of Using Compress-codec Parameter?

To get the out file of a sqoop import in formats other than .gz like .bz2 we use the compress -code parameter.

What Is A Disadvantage Of Using Direct Parameter For Faster Data Load By Sqoop?

The native utilities used by databases to support faster laod do not work for binary data formats like SequenceFile

How Can You Control The Number Of Mappers Used By The Sqoop Command?

The Parameter num-mapers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.

How Can You Avoid Importing Tables One-by-one When Importing A Large Number Of Tables From A Database?

Using the command

sqoop import-all-tables

  1. connect
  2. usrename
  3. password
  4. exclude-tables table1, table2.

This will import all the tables except the ones mentioned in the exclude-tables clause.

When the Source Data Keeps Getting Updated Frequently, What Is The Approach To Keep It In Sync With The Data In Hdfs Imported By Sqoop?

Sqoop can have 2 approaches.

To use the incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.

To use the incremental parameter with last modified option where a date column in the source is checked for records which have been updated after the last import.

What Is the Usefulness of The Options File In Sqoop?

The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.

For example the –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.

Is It Possible To Add A Parameter While Running A Saved Job?

Yes, we can add an argument to a saved job at runtime by using the –exec option

sqoop job –exec jobname — — newparameter

How Do You Fetch Data Which Is The Result Of Join Between Two Tables?how Can We Slice The Data To Be Imported To Multiple Parallel Tasks?

Using the –split-by parameter we specify the column name based on which sqoop will divide the data to be imported into multiple chunks to be run in parallel.

How Can You Choose A Name For The Mapreduce Job Which Is Created On Submitting A Free-form Query Import?

By using the –mapreduce-job-name parameter. Below is a example of the command.

sqoop import
--connect jdbc:mysql://mysql.example.com/sqoop
--username sqoop
--password sqoop
--query 'SELECT normcities.id,
countries.country,
normcities.city
FROM normcities
JOIN countries USING(country_id)
WHERE $CONDITIONS'
--split-by id
--target-dir cities
--mapreduce-job-name normcities

Before Starting the Data Transfer Using Mapreduce Job, Sqoop Takes A Long Time to Retrieve the Minimum And Maximum Values Of Columns Mentioned In –split-by Parameter. How Can We Make It Efficient?

We can use the –boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query.

What Is The Difference Between The Parameters?

sqoop.export.records.per.statement and sqoop.export.statements.per.transaction

The parameter “sqoop.export.records.per.statement” specifies the number of records that will be used in each insert statement.

But the parameter “sqoop.export.statements.per.transaction” specifies how many insert statements can be processed parallel during a transaction.

How Will You Implement All-or-nothing Load Using Sqoop?

Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful.

How Do You Clear the Data in A Staging Table Before Loading It by Sqoop?

By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.

How Will You Update the Rows That Are Already Exported?

The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.

What Is A Sqoop Metastore?

It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.

Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.

What is the Purpose of Sqoop-merge?

The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.

How Can You See the List of Stored Jobs In Sqoop Metastore?

sqoop job –list

Give the Sqoop Command to See The Content Of The Job Named Myjob?

Sqoop job –show myjob

Which Database the Sqoop Metastore Runs On?

Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.

Where Can The Metastore Database Be Hosted?

The metastore database can be hosted anywhere within or outside of the Hadoop cluster..

Explain the significance of using –split-by clause in Apache Sqoop?

split-by is a clause, it is used to specify the columns of the table which are helping to generate splits for data imports during importing the data into the Hadoop cluster. This clause specifies the columns and helps to improve the performance via greater parallelism. And also it helps to specify the column that has an even distribution of data to create splits,that data is imported.

How Can You Sync A Exported Table With Hdfs Data In Which Some Rows Are Deleted?

Truncate the target table and load it again.

How Can You Export Only A Subset of Columns To A Relational Table Using Sqoop?

By using the –column parameter in which we mention the required column names as a comma separated list of values.

How Can We Load To A Column In A Relational Table Which Is Not Null But The Incoming Value From Hdfs Has A Null Value?

By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted into the target table.

How Can You Schedule A Sqoop Job Using Oozie?

Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.

Sqoop Imported A Table Successfully To Hbase But It Is Found That the Number Of Rows Is Fewer Than Expected. What Can Be The Cause?

Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a row, those rows get dropped.

Give A Sqoop Command To Show All The Databases In A Mysql Server.?

$ sqoop list-databases –connect jdbc:mysql://database.example.com/

What Do You Mean by Free Form Import in Sqoop?

Sqoop can import data form a relational database using any SQL query rather than only using table and column name parameters.

How Can You Force Sqoop To Execute A Free Form Sql Query Only Once and Import The Rows Serially?

By using the –m 1 clause in the import command, sqoop cerates only one mapreduce task which will import the rows sequentially.

In A Sqoop Import Command You Have Mentioned To Run 8 Parallel Mapreduce Task But Sqoop Runs Only 4. What Can Be The Reason?

The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel tasks less or equal to that of the MapReduce cluster.

What Is the Importance Of –split-by Clause In Running Parallel Import Tasks In Sqoop?

The –split-by clause mentions the column name based on whose value the data will be divided into groups of records. These group of records will be read in parallel by the mapreduce tasks.

What Does This Sqoop Command Achieve?

$ sqoop import –connnect <connect-str> –table foo –target-dir /dest

It imports data from a database to a HDFS file named foo located in the directory /dest

What Happens When A Table Is Imported Into A Hdfs Directory Which Already Exists Using The –apend Parameter?

Using the –append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.

How Can You Control the Mapping Between Sql Data Types and Java Types?

By using the –map-column-java property we can configure the mapping between.

Below is an example : $ sqoop import … –map-column-java id = String, value = Integer

How To Import Only The Updated Rows Form A Table Into Hdfs Using Sqoop Assuming The Source Has Last Update Timestamp Details For Each Row?

By using the last modified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.

What are the Two File Formats Supported By Sqoop For Import?

Delimited text and Sequence Files.

Give A Sqoop Command To Import The Columns

Employee_id,first_name,last_name From The Mysql Table Employee?

$ sqoop import –connect jdbc:mysql://host/dbname –table EMPLOYEES
–columns “employee_id,first_name,last_name”

Give A Sqoop Command To Run Only 8 Mapreduce Tasks In Parallel?

$ sqoop import –connect jdbc:mysql://host/dbname –table table_name
-m 8

What Does The Following Query Do?

$ Sqoop Import --connect Jdbc:mysql://host/dbname --table Employees     --where "start_date > '2017-03-31'

It imports the employees who have joined after 31-Mar-2017.

Give A Sqoop Command To Import All The Records From Employee Table Divided Into Groups Of Records By The Values In The Column Department_id.?

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES
–split-by dept_id

What Does The Following Query Do?
$ Sqoop Import –connect Jdbc:mysql://db.foo.com/somedb –table Sometable
–where “id > 1000” –target-dir /incremental_dataset –append

It performs an incremental import of new data, after having already imported the first 100,0rows of a table

Give A Sqoop Command To Import Data From All Tables In The Mysql Db Db1.?

sqoop import-all-tables –connect jdbc:mysql://host/DB1

Give A Command To Execute A Stored Procedure Named Proc1 Which Exports Data To From Mysql Db Named Db1 Into A Hdfs Directory Named Dir1.?

$ sqoop export –connect jdbc:mysql://host/DB1 –call proc1
–export-dir /Dir1

So, this brings us to the end of the Apache Sqoop Interview Questions blog.This Tecklearn ‘Top Apache Sqoop Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Apache Sqoop or Big Data Domain. If you wish to learn Apache Sqoop and build a career in Big Data domain, then check out our interactive, Big Data Spark and Hadoop Developer Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/big-data-spark-and-hadoop-developer/

BigData Spark and Hadoop Developer Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. In this Big Data course, you will master MapReduce, Hive, Pig, Sqoop, Oozie and Flume, Spark framework and RDD, Scala and Spark SQL, Machine Learning using Spark, Spark Streaming, etc. It is a comprehensive Hadoop Big Data training course designed by industry experts considering current industry job requirements to help you learn Big Data Hadoop and Spark modules. This Cloudera Hadoop and Spark training will prepare you to clear Cloudera CCA175 Big Data certification.

Why Should you take Spark and Hadoop Developer Training?

  • Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com.
  • Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
  • Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop

What you will Learn in this Course?

Introduction to Hadoop and the Hadoop Ecosystem

  • Problems with Traditional Large-scale Systems
  • Hadoop!
  • The Hadoop EcoSystem

Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN

Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Modelling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Parallel Programming with Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Preview: Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL with Impala

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Top Apache Sqoop Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *