Knowledge about Apache Sqoop and its all basic commands to import and export the Data


Sqoop is a command-line interface application for transferring data between Relational Databases and Hadoop.

We can say that Sqoop is a connector between RDBMS to Hadoop (Import) or Hadoop to RDBMS(Export).

Options while importing data to Hadoop:
.      Importing table data from RDBMS table to HDFS(file system)
.      Importing table data from RDBMS table to Hive table
.      Importing table data from RDBMS table to HBase

 Options while exporting data to RDBMS:
 .      Exporting data from HDFS(file system) to RDBMS table
 .      Export data from Hadoop (Hive) to RDBMS table

List all databases in MySQL
sqoop-list-databases --connect jdbc:mysql://localhost --username hadoop --password hadoop

List all tables in MySQL
sqoop-list-tables --connect jdbc:mysql://localhost/hive --username hadoop --P

Note : If we pass -P as parameter, we can give the password in the run time so that the password is not hard-coded for security reasons.

Pass parameter file to Sqoop
sqoop-list-tables --options-file /root/Anamika_Singh/sqoop_param
cat>sqoop_param
--connect
jdbc:mysql://localhost/hive
--username
hadoop
--password
hadoop

Import table data to HDFS (O/P file will be by default delimited text)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1


If there is no primary key in the table, then we need to specify number of mappers as 1 i.e., sequential import of data. we can explicitly specify number of mappers for parallel import.
If there is no primary key in the table & if we need parallel import then, use split-by "some column name" & can specify any numbers of mappers. Those many part-m files will be generated.

By default if there is primary key in a table or using split-by, it uses 4 mappers. We can see 4 mappers output in HDFS.

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by EDept

Import table data to HDFS (Import only specific columns)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --columns "ENo,ESal"

Import table data to HDFS (Tab separated file format)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --fields-terminated-by '\t' --table Employee -m 1

Import table data to HDFS (Save to target directory)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --target-dir /user/Employee

Import table data to HDFS (Where condition)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1  --where "ESal > 50000"

Import table data to HDFS (Where condition)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by EDept

Import table data to Hive (Create table & load data)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --hive-import --create-hive-table

Import table data to Hive (table already exists, Only load data)
sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --hive-import --hive-table emp_hive

Import table data to Hive (By passing Partition Key & Values)
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-partition-key order_status --hive-partition-value "CLOSED" --query 'select order_id, order_date, order_customer_id from orders1 where order_status="CLOSED" and $CONDITIONS' --hive-overwrite --create-hive-table --delete-target-dir -m 1 --hive-database Anamika_Singh --target-dir /user/hive/warehouse/Anamika_Singh.db/orders;

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-partition-key order_status --hive-partition-value "CLOSED" --query 'select order_id, order_date, order_customer_id from orders1 where order_status="CLOSED" and $CONDITIONS' --delete-target-dir -m 1 --hive-database Anamika_Singh --target-dir /user/hive/warehouse/Anamika_Singh.db/orders;

Arguements:
--create-hive-table                            Fail if the target hive table exists
--hive-import                                    Import tables into Hive
--hive-overwrite                                Overwrite existing data in the Hive table
--hive-table <table-name>                Sets the table name to use when importing to Hive
--hive-database <database-name>    Sets the database name to use when importing to Hive
--delete-target-dir                              To delete target directory in default path
--exclude-tables                                 To exclude some tables from import

--import-all-tables:
For the import-all-tables tool to be useful, the following conditions must be met:
1. Each table must have a single-column primary key(Should not have more than 1 primary key)
2. You must intend to import all columns of each table.
3. You must not intend to use WHERE clause.

sqoop import-all-tables --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --warehouse-dir /user/hive/warehouse/Anamika_Singh.db -m 1 --exclude-tables orders1,orders_temp;
This tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.

--incremental data load to hive table with last value
sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-database Anamika_Singh --hive-table orders1 --incremental append --check-column order_id --last-value 22 -m 1 --hive-import --table orders1;

--incremental data load to ive table with last modified date
sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-database Anamika_Singh --incremental lastmodified --check-column order_date --last-value '2013-07-25 00:00:08' -m 1 --table orders1 --append;

--Export data from Hadoop (Hive) to RDBMS table
sqoop-export --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table emp --export-dir '/user/hive/warehouse/Anamika_Singh.db/emp_bk' --input-null-string '\\N' --input-null-non-string '\\N';

Export Failure Cases:
Exports may fail for a number of reasons:
1.      Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
2.      Attempting to INSERT a row which violates constraints (for example, inserting a duplicate primary key value)
3.      Attempting to parse records using incorrect delimiters
4.      Capacity issues (such as insufficient RAM or disk space)

codegen generates the .java, .class & .jar file for the sqoop job executed
sqoop codegen --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table ordr;

eval prints the output of the query on the screen
sqoop eval --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --query "select * from emp where deptno=20";

Writing a sqoop Job
sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
sqoop job --list
sqoop job --show myjob
sqoop job --exec myjob
sqoop job –delete myjob;

Sqoop Incremental Job :
sqoop job \
 --create myjob \
 --import \
 --connect jdbc:mysql://localhost/retail_db \
 --username root \
 --password cloudera \
 --incremental append \
 --check-column order_id \
 --last-value 0 \
 --hive-import \
 --table orders1;

sqoop job -list;
sqoop job --exec myjob;
sqoop job -delete myjob;

Importing data from RDBMS table to HBase :

1  .      To Existing Hbase Table(Only Load)
sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera –table orders –hbase-table orders_hbase –column-family cf1 –hbase-row-key order_id 

2  .      To new Hbase Table (Create & Load)
sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera –table orders –hbase-table orders_hbase –hbase-create-table –column-family cf1 –hbase-row-key order_id 

Comments

  1. This is really helpful ma'am.. I don't have to make any notes to glance when I have doubts.. This is perfect. Thank you!!

    ReplyDelete

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

How to Convert a Spark DataFrame to Map in Scala

How to Handle and Convert DateTime format in Spark-Scala.