Knowledge about Apache Sqoop and its all basic commands to import and export the Data
Sqoop is a command-line
interface application for transferring data between Relational Databases and
Hadoop.
We can say that Sqoop is
a connector between RDBMS to Hadoop (Import) or Hadoop to RDBMS(Export).
Options while importing
data to Hadoop:
. Importing table data from RDBMS table to HDFS(file system)
. Importing table data from RDBMS table to Hive table
. Importing table data from RDBMS table to HBase
. Importing table data from RDBMS table to HDFS(file system)
. Importing table data from RDBMS table to Hive table
. Importing table data from RDBMS table to HBase
Options while
exporting data to RDBMS:
. Exporting data from HDFS(file system) to RDBMS table
. Export data from Hadoop (Hive) to RDBMS table
. Exporting data from HDFS(file system) to RDBMS table
. Export data from Hadoop (Hive) to RDBMS table
List all databases in
MySQL
sqoop-list-databases
--connect jdbc:mysql://localhost --username hadoop --password hadoop
List all tables in MySQL
sqoop-list-tables
--connect jdbc:mysql://localhost/hive --username hadoop --P
Note : If we pass -P as parameter, we can give the
password in the run time so that the password is not hard-coded for security
reasons.
Pass parameter file to
Sqoop
sqoop-list-tables
--options-file /root/Anamika_Singh/sqoop_param
cat>sqoop_param
--connect
jdbc:mysql://localhost/hive
--username
hadoop
--password
hadoop
Import table data to
HDFS (O/P file will be by default delimited text)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1
If there is no primary
key in the table, then we need to specify number of mappers as 1 i.e.,
sequential import of data. we can explicitly specify number of mappers for
parallel import.
If there is no primary
key in the table & if we need parallel import then, use split-by "some
column name" & can specify any numbers of mappers. Those many part-m
files will be generated.
By default if there is
primary key in a table or using split-by, it uses 4 mappers. We can see 4
mappers output in HDFS.
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by
EDept
Import table data to
HDFS (Import only specific columns)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --columns
"ENo,ESal"
Import table data to
HDFS (Tab separated file format)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --fields-terminated-by '\t'
--table Employee -m 1
Import table data to
HDFS (Save to target directory)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1
--target-dir /user/Employee
Import table data to
HDFS (Where condition)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m
1 --where "ESal > 50000"
Import table data to
HDFS (Where condition)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by
EDept
Import table data to
Hive (Create table & load data)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1
--hive-import --create-hive-table
Import table data to
Hive (table already exists, Only load data)
sqoop-import
--options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1
--hive-import --hive-table emp_hive
Import table data to
Hive (By passing Partition Key & Values)
sqoop-import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera
--hive-import --hive-table orders1 --hive-partition-key order_status --hive-partition-value
"CLOSED" --query 'select order_id, order_date, order_customer_id from
orders1 where order_status="CLOSED" and $CONDITIONS' --hive-overwrite
--create-hive-table --delete-target-dir -m 1 --hive-database Anamika_Singh
--target-dir /user/hive/warehouse/Anamika_Singh.db/orders;
sqoop-import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera
--hive-import --hive-table orders1 --hive-partition-key order_status
--hive-partition-value "CLOSED" --query 'select order_id, order_date,
order_customer_id from orders1 where order_status="CLOSED" and
$CONDITIONS' --delete-target-dir -m 1 --hive-database Anamika_Singh
--target-dir /user/hive/warehouse/Anamika_Singh.db/orders;
Arguements:
--create-hive-table Fail
if the target hive table exists
--hive-import Import
tables into Hive
--hive-overwrite Overwrite
existing data in the Hive table
--hive-table
<table-name> Sets
the table name to use when importing to Hive
--hive-database
<database-name> Sets
the database name to use when importing to Hive
--delete-target-dir To
delete target directory in default path
--exclude-tables To
exclude some tables from import
--import-all-tables:
For the
import-all-tables tool to be useful, the following conditions must be met:
1. Each table must have
a single-column primary key(Should not have more than 1 primary key)
2. You must intend to
import all columns of each table.
3. You must not intend
to use WHERE clause.
sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db --username root --password cloudera
--warehouse-dir /user/hive/warehouse/Anamika_Singh.db -m 1 --exclude-tables
orders1,orders_temp;
This tool imports a set
of tables from an RDBMS to HDFS. Data from each table is stored in a separate
directory in HDFS.
--incremental data load
to hive table with last value
sqoop import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera
--hive-database Anamika_Singh --hive-table orders1 --incremental append
--check-column order_id --last-value 22 -m 1 --hive-import --table orders1;
--incremental data load
to ive table with last modified date
sqoop import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera
--hive-import --hive-table orders1 --hive-database Anamika_Singh --incremental
lastmodified --check-column order_date --last-value '2013-07-25 00:00:08' -m 1
--table orders1 --append;
--Export data from
Hadoop (Hive) to RDBMS table
sqoop-export --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera --table
emp --export-dir '/user/hive/warehouse/Anamika_Singh.db/emp_bk'
--input-null-string '\\N' --input-null-non-string '\\N';
Export Failure Cases:
Exports may fail for a
number of reasons:
1. Loss of connectivity from the Hadoop cluster to
the database (either due to hardware fault, or server software crashes)
2. Attempting to INSERT a row which violates
constraints (for example, inserting a duplicate primary key value)
3. Attempting to parse records using incorrect
delimiters
4. Capacity issues (such as insufficient RAM or
disk space)
codegen generates
the .java, .class & .jar file for the sqoop job executed
sqoop codegen --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera --table
ordr;
eval prints the output
of the query on the screen
sqoop eval --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera --query
"select * from emp where deptno=20";
Writing a sqoop Job
sqoop job --create myjob
\
--import \
--connect
jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
sqoop job --list
sqoop job --show myjob
sqoop job --exec myjob
sqoop job –delete myjob;
Sqoop Incremental Job :
sqoop job \
--create myjob \
--import \
--connect
jdbc:mysql://localhost/retail_db \
--username root \
--password
cloudera \
--incremental
append \
--check-column
order_id \
--last-value 0 \
--hive-import \
--table orders1;
sqoop job -list;
sqoop job --exec myjob;
sqoop job -delete myjob;
Importing data from
RDBMS table to HBase :
1 . To Existing Hbase Table(Only Load)
sqoop import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera –table
orders –hbase-table orders_hbase –column-family cf1 –hbase-row-key
order_id
2 . To new Hbase Table (Create & Load)
sqoop import --connect
jdbc:mysql://localhost/retail_db --username root --password cloudera –table
orders –hbase-table orders_hbase –hbase-create-table –column-family cf1
–hbase-row-key order_id
This is really helpful ma'am.. I don't have to make any notes to glance when I have doubts.. This is perfect. Thank you!!
ReplyDeleteThanks Roopa.
Delete