Knowledge about HIVE Tables


Dear friends, today we will learn to create the hive table (internal and external table), load the data from local and hdfs location, see the table structures, drop the tables & database and another basic things related to this, so let’s begin :
Assume that we have the given csv file/datasets to load into the tables :

Data without header :
1,Divyansh,Laptop,75000,Delhi
2,Anjali,Laptop,50000,Delhi
3,Sushil,Hard Disk,20000,Delhi
4,Sidharth,Mobile,75000,Kolkata
5,Surendra,Desktop,70000,Kolkata
6,Mohit,Pen Drive,25000,Kolkata
7,Mohit,Laptop,65000,Hyderabad
8,Prabhu,Laptop,50000,Hyderabad
9,Rajesh,Head Phone,10000,Hyderabad
10,Mohit,Mobile,40000,Chennai
11,Sonia,Desktop,25000,Chennai
12,Deepthi,Head Phone,10000,Chennai
13,Priyanka,Desktop,56000,Noida
14,Prashant,Hard Disk,35000,Noida
15,Ayush,Mobile,15000,Noida




Data with header :
Cust_no,cust_name,orders,price,city
1,Divyansh,Laptop,75000,Delhi
2,Anjali,Laptop,50000,Delhi
3,Sushil,Hard Disk,20000,Delhi
4,Sidharth,Mobile,75000,Kolkata
5,Surendra,Desktop,70000,Kolkata
6,Mohit,Pen Drive,25000,Kolkata
7,Mohit,Laptop,65000,Hyderabad
8,Prabhu,Laptop,50000,Hyderabad
9,Rajesh,Head Phone,10000,Hyderabad
10,Mohit,Mobile,40000,Chennai
11,Sonia,Desktop,25000,Chennai
12,Deepthi,Head Phone,10000,Chennai
13,Priyanka,Desktop,56000,Noida

14,Prashant,Hard Disk,35000,Noida
15,Ayush,Mobile,15000,Noida






Types of tables in hive :
1.    Internal table/Managed Table
2.    External Table
Internal tables : Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
External Tables : Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For External Tables -
·        External table stores files on the HDFS server but tables are not linked to the source file completely.
·         If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “test” in HIVE using HIVE-QL and link the table to file “test_file”then deleting table “test” from HIVE will not delete “test_file” from HDFS.
·         External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
·         Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.


For Internal Tables-
·         Stored in a directory based on settings in  hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
·         Deleting the table deletes the metadata and data from master-node and HDFS respectively.
·         Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).


Use EXTERNAL tables when:
·         The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
·         Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
·         Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
·         You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:
·         The data is temporary.
·         You want Hive to completely manage the life-cycle of the table and data.









Let’s start : Create a new database learning and use this database to create tables under this :
create database learning;
use learning;      

Print the current database :

set hive.cli.print.current.db=true;



Create a hive table test1 and load the data without header as we have to create column name in hive.

Internal Table or Managed table :

create table test1(cust_no int, cust_name string, orders string, price bigint, city String) row format delimited fields terminated by ',';

see the list of tables : show tables;

load the data from local :

load data local inpath '/home/g.aspiresit.001/Anamika/test_file.csv' into table test1;

See the data of the table :

Select * from test1;




Now we will see the data with column name :

set hive.cli.print.header=true;

select * from test1;




Now we will try to create external table with location and without location. And this time we will load data from hdfs location.
If we are creating table with location then no load command, automatic it will read the data from that file :


In above scenario I loaded the data which had no header/column names. Now in this table I will load data with header.
But we have to be careful that if have any file with header then at loading time we should skip the header line because we have already column name and that header will become first record of our data.
To skip the header of the file we have to set table proberties while creating the table :
TBLPROPERTIES("skip.header.line.count"="1");

Create table without location and load the data :
create external table test2(cust_no int, cust_name string, orders string, price bigint, city String)
row format delimited fields terminated by ','
TBLPROPERTIES("skip.header.line.count"="1");

load the data from hdfs path :

load data inpath '/user/g.aspiresit.001/Anamika/test_file_with_header.csv' into table test2;






Create table with location and now it will read the data from that location without any load or insert data :
create external table test3(cust_no int, cust_name string, orders string, price bigint, city String)
row format delimited fields terminated by ',' location '/user/g.aspiresit.001/Anamika/'
TBLPROPERTIES("skip.header.line.count"="1");





List the tables :
Show tables ;




See the Table Creation :

Show create table;




See the table’s columns list with data type :

desc test1;
or
describe test1;



See the detailed table information : like using this command we can know that table is Managed table or External table  and location of table along with other information :

desc formatted test1;
or
describe formatted test1;     

Above both command will return the same output as above example.




See the detailed table information :

desc extended test1;
or
describe extended test1;



In above example first we created the Managed/Internal table :
Now after create the table test1 and load the data, we can see the table name with loaded data file in hdfs location/hive warehouse directory  as below screenshot :


So Now we will drop this table and see that including schema in hive, data file also deleted from its hdfs location (hive warehouse location) :




So now in below screenshot we can see that after drop the table, data also deleted from hdfs location :



Now same we will do for external table test2 and we can see that in this scenario only schema/metadata will be deleted from hive, but our our data will be remain in hdfs location.

First verify the table type :



Below we can see that loaded data in hdfs location will be safe after drop the table test2:
To test it first I will see the data before drop the table and again after drop the table as below screen shot.

We can use the hdfs command via hive shell using symbol  !
As


!hadoop fs -ls 
hdfs://nnscbhaastest/apps/hive/warehouse/learning.db/test2;






Drop table in hive with PURGE : 

Now if we don't use purge the table, it goes to a Trash directory, 

from there the table can be recovered after drop it. But if you do 

use purge table won't go to Trash directory, so it can't be recovered.

PURGE clause:
The optional PURGE keyword, available in CDH 5.5 / Impala 2.3 and higher, causes Impala to remove the associated HDFS data files immediately, rather than going through the HDFS trashcan mechanism. Use this keyword when dropping a table if it is crucial to remove the data as quickly as possible to free up space, or if there is a problem with the trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone than the data files.


Now, Suppose we have not to drop a single table, I want to drop the entire database which have for example 100 tables, then how you will drop it?
To use this case I created the below database and tables under this as below:

create database learning_drop;

use learning_drop;

create external table test1(cust_no int, cust_name string, orders string, price bigint, city String);

create external table test2(cust_no int, cust_name string, orders string, price bigint, city String);

create external table test3(cust_no int, cust_name string, orders string, price bigint, city String);

create external table test4(cust_no int, cust_name string, orders string, price bigint, city String);

create external table test5(cust_no int, cust_name string, orders string, price bigint, city String);

show tables;



Now try to drop this database :

drop database learning_drop;


So as per this screenshot we are getting the exception that InvalidOperationException(message:Database learning_drop is not empty. One or more tables exist.)


Then how we will drop this? First drop all the tables then finally drop the database? But if we have thousands of tables in the same database then? Drop the tables thousands times??

Well No!, we have the option to drop the database having number of tables in a single shot using the option CASCADE as per below :




So for now thanks to read it, we will meet again with next topic Partitions in Hive table.
If you have any questions or suggestions, please fell free to comment. I’ll be reach back to you soon :)

Thank You !!








Comments

Popular posts from this blog

Transformations and Actions in Spark

How to Convert a Spark DataFrame to Map in Scala

How to Handle and Convert DateTime format in Spark-Scala.