Partitions in Hive




Hello friends, 
Today we will discuss about partitioning in Hive and ways to use it. The bigger problem with hive is that when we apply where clause in our query then even a simple query in Hive also reads the entire dataset and this situation decreases the efficiency and becomes a bottleneck when we are required to run the queries on large tables, but this issue can be overcome by implementing partitions on hive tables.


Partitions in hive :
Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys such as date, city, and department. 
Partition is helpful when the table has one or more partitions keys. Partition keys are basic elements for determining how the data is stored in the table.

In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables. Without partitioning, Hive reads all the data in the directory and applies the query filters on it. This is slow and expensive since all data has to be read.
Very often users need to filter the data on specific column values. To apply the partitioning in hive, users need to understand the domain of the data on which they are doing analysis.
With this knowledge, identification of the frequently queried or accessed columns becomes easy and then partitioning feature of Hive can be applied on the selected columns.
Owing the fact that Partitions are horizontal slices of data, larger sets of data can be separated into more manageable chunks/parts.

When to use Hive Partitioning:
When any user wants data contained within a table to be split across multiple sections in hive table, then we can go for partitions.
If we are applying partitioning in Hive table, The entries for the various columns of dataset are segregated and stored in their respective partition. When we write the query to fetch the values from table, only the required partitions of the table are queried, which reduces the time taken by query to yield the result.

Types of partitions in Hive :
·         Static partitioning
·         Dynamic partitioning.



A sample Use Case :
Below are the sample data files which have records of Employees like, Employee Code, employee Name, Employee Salary and Employee’s Joining Year, so using below datasets we want to partition the table by last column Employee’s joining year.
As per below files, three files are already segregated/separate files for per year, so these files can be use for static partition and last file having 4 years of data in a single file, so in this case we can go for dynamic partition, so automatically data will be segregated and 4 partitions will be create with segregated data.


Sample data 1 to use for static partition : (file name : static_part_file_2015.csv)
011,Ram,15000,2015
022,Shyam,20000,2015
033,Mohan,18000,2015
044,Sohan,17000,2015




Sample data 2 to use for static partition : (file name : static_part_file_2016.csv)
055,Naveen,21000,2016
066,Prakash,25000,2016
077,Madhav,30000,2016
088,Kisan,28000,2016
099,Jay,24000,2016




Sample data 3 to use for static partition : (file name : static_part_file_2017.csv)
111,Priya,27000,2017
222,Anshuman,23000,2017
333,Anjali,26000,2017
444,Manav,21000,2017
555,Anuj,25000,2017




Sample Data to use for Dynamic partition :  (file name : dynamic_part_file.csv)
011,Ram,15000,2015
022,Shyam,20000,2015
033,Mohan,18000,2015
044,Sohan,17000,2015
055,Naveen,21000,2016
066,Prakash,25000,2016
077,Madhav,30000,2016
088,Kisan,28000,2016
099,Jay,24000,2016
111,Priya,27000,2017
222,Anshuman,23000,2017
333,Anjali,26000,2017
444,Manav,21000,2017
555,Anuj,25000,2017
666,Arjun,40000,2018
777,Smita,35000,2018
888,Karan,36000,2018
999,Anuj,32000,2018




How to create partition table in hive? : The way of creating partition table in Hive either Static Partition or Dynamic Partition both are same, different is only while loading/inserting the data based upon data/files.
While creating the partition tables, partition columns(column on which we want to partition the table/ segregate the data) should be mentioned at the end with partitioned by clause.


Query for creation of Partition table :
Create table table_name (column1 datatype, column2 datatype) partitioned by (partition column_column3 datatype);

Below I am creating two tables to use Static and Dynamic Partitions :

1.    create table static_part_table (emp_cd int, emp_nm string, emp_sal bigint) partitioned by (emp_join_year int) row format delimited fields terminated by ‘,’ ;

2.    create table dynamic_part_table (emp_cd int, emp_nm string, emp_sal bigint) partitioned by (emp_join_year int) row format delimited fields terminated by ‘,’ ;



Static Partition :
When to use static partitioning:
Static partitioning needs to be applied when we know data(supposed to be inserted) belongs to which partition. In case of static partitions, we need to have separate files for each partition or we want to insert data from any exists table in that case we can insert using where clause in HQL in specific partition.
If we have different segregated files then we can go for static partition and every time we load the data, need to specify the partition value and file name.

Now we’ll load the separate data files in static_part_table using below commands : 
load data local inpath '/home/g.aspiresit.001/static_part_file_2015.csv' into table static_part_table partition (emp_join_year = '2015');

load data local inpath '/home/g.aspiresit.001/static_part_file_2016.csv' into table static_part_table partition (emp_join_year = '2016');

load data local inpath '/home/g.aspiresit.001/static_part_file_2017.csv' into table static_part_table partition (emp_join_year = '2017');



Now we will check the created partitions in this table :




Below we can verify the data in different partitions :




Now we can see the partitioned data in hive warehouse path (hdfs location), here we can see that partitions are created as a directory and inside that our loaded file is available.





Verify the loaded partitioned data of a partition emp_join_year=2017:



or we can insert data in static partition table with segregate data using Where clause in HQL from any existed table. Suppose I have a Hive table all_year_data_table where data is available for the year 2015, 2016, 2017 and 2018 as per below screenshot.



Now, I want to insert year 2018 data in my table static_part_table in partition 2018, we can do this using below query :

INSERT OVERWRITE TABLE  static_part_table PARTITION(emp_join_year = '2018')
SELECT
emp_cd,
emp_nm,
emp_sal
FROM all_year_data_table
WHERE emp_join_year = '2018';


In below screenshot we can verify the no. of partitions before and after run the above query.




Now verify the data of newly created partition in hive : emp_join_year=2018 :






Dynamic Partition :

When to use dynamic partitioning:
In static partitioning every partitioning needs to be backed with individual hive statement which is not feasible for large number of partitions as it will require writing of lot of hive statements. In that scenario dynamic partitioning is suggested as automatically it will create as many number of partitions with single hive statement.
In this we will see that while insert the data using single query the data will be segregate automatically and create possible no. of partitions with that segregated data.
Above we have already created a table to use dynamic partition : dynamic_part_table

To use the dynamic partition in Hive first we need to set below parameters:
We can use the below parameters in hive shell or in hive-site.xml file.
In this case we have set the below properties in hive shell.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Now we will try to load the data in this table from a single file which have data for all the year, and now we want to create partition for each year data with a single query :
But to use dynamic partition we can not load data file directly in the table as it needs to read all the file line by line, then after full scan it creates possible no. of partitions, then load the segregated data in each partition, whether we load the data in table, blindly it loads the file in the table without read or scan it. So we will create a temporary table  temp_table, load the dynamic_part_file.csv in this table and then from that temporary table we will insert data into main target table dynamic_part_table.


Benefit of Temporary Table : Temporary tables are session oriented tables, as soon we will come out from the hive session, table will be automatic drop, so for temporary type of work, we can go for this type of table to save space.

Syntax to create Temporary Table :
create temporary table temp_table (emp_cd int, emp_nm string, emp_sal bigint, emp_join_year int) row format delimited fields terminated by ',' ;

In below screenshot we can verify that as soon we come out from hive session, table automatic dropped.




So now we will insert data in dynamic_part_table from temporary table temp_table to create dynamic partitions using below single query :

INSERT OVERWRITE TABLE  dynamic_part_table
PARTITION(emp_join_year)
SELECT
emp_cd,
emp_nm,
emp_sal,
emp_join_year
FROM temp_table;



In our data file we have 4 years data : 2015, 2016, 2017 and 2018, So it should create 4 partitions respectively for these years. Now We will verify these partitions in Hive along with data :


Now in hive warehouse path (hdfs) we will verify these partitions :

Next, we will verify the loaded/inserted data in these partitions :

And at last if you have noticed then we can see that while we loaded the data using load data …. Command then in hive warehouse path we can see directly exact loaded file name under partition(partition is a directory in hdfs),
And while inserted the data from another existed table then it is scanning the source table, running MapReduce job, then after internal screening its inserting the data. So in this case we are not able to see the file name, here we can see the data file inside partitions as 000000_0, 000000_1, 000000_2 as per no. of blocks of data/file.
 


Citing the all above explanation we come to the conclusion that data management becomes very efficient with the usage of partitioning in hive as its working as a kind of optimization to avoid full table scan. And finally, we cover the topic of static and dynamic partitions in hive. In next post we will discuss about nested partitions, Bucketing in hive tables and skewed table in hive.

Thank you so much to read this post, if you have any suggestions or query, kindly feel free to leave your valuable comments and feedback.

Thank You!



Comments

Popular posts from this blog

Transformations and Actions in Spark

How to convert XML String to JSON String in Spark-Scala

How to Convert a Spark DataFrame to Map in Scala