How to export data back from Hadoop HDFS to RDBMS and Create and maintain the Sqoop jobs

Last updated on May 30 2022
Swati Dogra

Table of Contents

How to export data back from Hadoop HDFS to RDBMS and Create and maintain the Sqoop jobs

Sqoop – Export

The target table must exist in the target database. The files which are given as input to the Sqoop contain records, which are called rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.

Insert all the record from the input files to the database table using the INSERT statement is the default operation. In update mode, Sqoop generates the UPDATE statement that updates the existing record into the database.

Syntax

The following is the syntax for the export command.

$ sqoop export (generic-args) (export-args)

$ sqoop-export (generic-args) (export-args)

Example

Let us take an example of the employee data in file, in HDFS. The employee data is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.

1201, gopal,     manager, 50000, TP

1202, manisha,   preader, 50000, TP

1203, kalil,     php dev, 30000, AC

1204, prasanth,  php dev, 30000, AC

1205, kranthi,   admin,   20000, TP

1206, satish p,  grp des, 20000, GR

It is mandatory that the table to be exported is present in the database from where it has to be exported and is created manually.

The following query is used to create the table ‘employee’ in mysql command line.

$ mysql

mysql> USE db;

mysql> CREATE TABLE employee (

   id INT NOT NULL PRIMARY KEY,

   name VARCHAR(20),

   deg VARCHAR(20),

   salary INT,

   dept VARCHAR(10));

The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server.

$ sqoop export \

--connect jdbc:mysql://localhost/db \

--username root \

--table employee \

--export-dir /emp/emp_data

The following command is used to verify the table in mysql command line.

mysql>select * from employee;

If the given data is stored successfully, then you can find the following table of given employee data.

+——+————–+————-+——————-+——–+

| Id   | Name         | Designation | Salary            | Dept   |

+——+————–+————-+——————-+——–+

| 1201 | gopal        | manager     | 50000             | TP     |

| 1202 | manisha      | preader     | 50000             | TP     |

| 1203 | kalil        | php dev     | 30000             | AC     |

| 1204 | prasanth     | php dev     | 30000             | AC     |

| 1205 | kranthi      | admin       | 20000             | TP     |

| 1206 | satish p     | grp des     | 20000             | GR     |

+——+————–+————-+——————-+——–+

Sqoop – Job

This section of blog describes how to create and maintain the Sqoop jobs. Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-execution is used in the incremental import, which can import the updated rows from RDBMS table to HDFS.

Syntax

The following is the syntax for creating a Sqoop job.

$ sqoop job (generic-args) (job-args)

   [-- [subtool-name] (subtool-args)]




$ sqoop-job (generic-args) (job-args)

   [-- [subtool-name] (subtool-args)]

Create Job (–create)

Here we are creating a job with the name myjob, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.

$ sqoop job --create myjob \

-- import \

--connect jdbc:mysql://localhost/db \

--username root \

--table employee --m 1

Verify Job (–list)

‘–list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.

$ sqoop job --list

It shows the list of saved jobs.

Available jobs:

   myjob

Inspect Job (–show)

‘–show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.

$ sqoop job --show myjob

It shows the tools and their options, which are used in myjob.

Job: myjob

 Tool: import Options:

 ----------------------------

 direct.import = true

 codegen.input.delimiters.record = 0

 hdfs.append.dir = false

 db.table = employee

 ...

 incremental.last.value = 1206

 ...

Execute Job (–exec)

‘–exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.

$ sqoop job --exec myjob

It shows you the following output.

10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation

...


So, this brings us to the end of blog. This Tecklearn ‘How to export data back from Hadoop HDFS to RDBMS and Create and maintain the Sqoop jobs’ helps you with commonly asked questions if you are looking out for a job in Apache Sqoop and Big Data Hadoop Developer.

If you wish to learn Sqoop and build a career in Big Data Hadoop domain, then check out our interactive, Big Data Hadoop Developer Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/big-data-spark-and-hadoop-developer/

Big Data Spark and Hadoop Developer Training

About the Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. In this Big Data course, you will master MapReduce, Hive, Pig, Sqoop, Oozie and Flume, Spark framework and RDD, Scala and Spark SQL, Machine Learning using Spark, Spark Streaming, etc. It is a comprehensive Hadoop Big Data training course designed by industry experts considering current industry job requirements to help you learn Big Data Hadoop and Spark modules. This Cloudera Hadoop and Spark training will prepare you to clear Cloudera CCA175 Big Data certification.

Why Should you take Spark and Hadoop Developer Training?

  • Average salary for a Spark and Hadoop Developer ranges from approximately $106,366 to $127,619 per annum – Indeed.com.
  • Hadoop Market is expected to reach $99.31B by 2022 growing at a CAGR of 42.1% from 2015 – Forbes.
  • Amazon, Cloudera, Data Stax, DELL, EMC2, IBM, Microsoft & other MNCs worldwide use Hadoop

What you will Learn in this Course?

Introduction to Hadoop and the Hadoop Ecosystem

  • Problems with Traditional Large-scale Systems
  • Hadoop!
  • The Hadoop Ecosystem

Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN

Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Modelling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the Spark Context
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Parallel Programming with Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Preview: Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • Comparing Spark SQL with Impala

 

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "How to export data back from Hadoop HDFS to RDBMS and Create and maintain the Sqoop jobs"

Leave a Message

Your email address will not be published. Required fields are marked *