Knowledge about Apache Sqoop and its all basic commands to import and export the Data

Sqoop is a command-line interface application for transferring data between Relational Databases and Hadoop.

We can say that Sqoop is a connector between RDBMS to Hadoop (Import) or Hadoop to RDBMS(Export).

Options while importing data to Hadoop:
.      Importing table data from RDBMS table to HDFS(file system)
.      Importing table data from RDBMS table to Hive table
.   Importing table data from RDBMS table to HBase

Options while exporting data to RDBMS:
. Exporting data from HDFS(file system) to RDBMS table
. Export data from Hadoop (Hive) to RDBMS table

List all databases in MySQL

sqoop-list-databases --connect jdbc:mysql://localhost --username hadoop --password hadoop

List all tables in MySQL

sqoop-list-tables --connect jdbc:mysql://localhost/hive --username hadoop --P

Note : If we pass -P as parameter, we can give the password in the run time so that the password is not hard-coded for security reasons.

Pass parameter file to Sqoop

sqoop-list-tables --options-file /root/Anamika_Singh/sqoop_param

cat>sqoop_param

--connect

jdbc:mysql://localhost/hive

--username

hadoop

--password

hadoop

Import table data to HDFS (O/P file will be by default delimited text)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1

If there is no primary key in the table, then we need to specify number of mappers as 1 i.e., sequential import of data. we can explicitly specify number of mappers for parallel import.

If there is no primary key in the table & if we need parallel import then, use split-by "some column name" & can specify any numbers of mappers. Those many part-m files will be generated.

By default if there is primary key in a table or using split-by, it uses 4 mappers. We can see 4 mappers output in HDFS.

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by EDept

Import table data to HDFS (Import only specific columns)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --columns "ENo,ESal"

Import table data to HDFS (Tab separated file format)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --fields-terminated-by '\t' --table Employee -m 1

Import table data to HDFS (Save to target directory)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --target-dir /user/Employee

Import table data to HDFS (Where condition)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --where "ESal > 50000"

Import table data to HDFS (Where condition)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee --split-by EDept

Import table data to Hive (Create table & load data)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --hive-import --create-hive-table

Import table data to Hive (table already exists, Only load data)

sqoop-import --options-file /root/Anamika_Singh/sqoop_param --table Employee -m 1 --hive-import --hive-table emp_hive

Import table data to Hive (By passing Partition Key & Values)

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-partition-key order_status --hive-partition-value "CLOSED" --query 'select order_id, order_date, order_customer_id from orders1 where order_status="CLOSED" and $CONDITIONS' --hive-overwrite --create-hive-table --delete-target-dir -m 1 --hive-database Anamika_Singh --target-dir /user/hive/warehouse/Anamika_Singh.db/orders;

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-partition-key order_status --hive-partition-value "CLOSED" --query 'select order_id, order_date, order_customer_id from orders1 where order_status="CLOSED" and $CONDITIONS' --delete-target-dir -m 1 --hive-database Anamika_Singh --target-dir /user/hive/warehouse/Anamika_Singh.db/orders;

Arguements:

--create-hive-table Fail if the target hive table exists

--hive-import Import tables into Hive

--hive-overwrite Overwrite existing data in the Hive table

--hive-table <table-name> Sets the table name to use when importing to Hive

--hive-database <database-name> Sets the database name to use when importing to Hive

--delete-target-dir To delete target directory in default path

--exclude-tables To exclude some tables from import

--import-all-tables:

For the import-all-tables tool to be useful, the following conditions must be met:

1. Each table must have a single-column primary key(Should not have more than 1 primary key)

2. You must intend to import all columns of each table.

3. You must not intend to use WHERE clause.

sqoop import-all-tables --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --warehouse-dir /user/hive/warehouse/Anamika_Singh.db -m 1 --exclude-tables orders1,orders_temp;

This tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.

--incremental data load to hive table with last value

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-database Anamika_Singh --hive-table orders1 --incremental append --check-column order_id --last-value 22 -m 1 --hive-import --table orders1;

--incremental data load to ive table with last modified date

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --hive-import --hive-table orders1 --hive-database Anamika_Singh --incremental lastmodified --check-column order_date --last-value '2013-07-25 00:00:08' -m 1 --table orders1 --append;

--Export data from Hadoop (Hive) to RDBMS table

sqoop-export --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table emp --export-dir '/user/hive/warehouse/Anamika_Singh.db/emp_bk' --input-null-string '\\N' --input-null-non-string '\\N';

Export Failure Cases:

Exports may fail for a number of reasons:

1. Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)

2. Attempting to INSERT a row which violates constraints (for example, inserting a duplicate primary key value)

3. Attempting to parse records using incorrect delimiters

4. Capacity issues (such as insufficient RAM or disk space)

codegen generates the .java, .class & .jar file for the sqoop job executed

sqoop codegen --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table ordr;

eval prints the output of the query on the screen

sqoop eval --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --query "select * from emp where deptno=20";

Writing a sqoop Job

sqoop job --create myjob \

--import \

--connect jdbc:mysql://localhost/db \

--username root \

--table employee --m 1

sqoop job --list

sqoop job --show myjob

sqoop job --exec myjob

sqoop job –delete myjob;

Sqoop Incremental Job :

sqoop job \

--create myjob \

--import \

--connect jdbc:mysql://localhost/retail_db \

--username root \

--password cloudera \

--incremental append \

--check-column order_id \

--last-value 0 \

--hive-import \

--table orders1;

sqoop job -list;

sqoop job --exec myjob;

sqoop job -delete myjob;

Importing data from RDBMS table to HBase :

1 . To Existing Hbase Table(Only Load)

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera –table orders –hbase-table orders_hbase –column-family cf1 –hbase-row-key order_id

2 . To new Hbase Table (Create & Load)

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera –table orders –hbase-table orders_hbase –hbase-create-table –column-family cf1 –hbase-row-key order_id

Search This Blog

Big Data!! Simple Solutions!!!

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

Comments

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1