Window Functions in Hive



Dear friends, today we'll learn about window functions in Hive.

Window functions are very specific in Hive which allows us to create a window on a set of data further allowing aggregation surrounding that data.  

Window Functions in Hive includes the following functions

·         Lead :
v    The number of rows to lead can optionally be specified.  If the number of rows to lead is not specified, the lead is one row.

v    Returns null when the lead for the current row extends beyond the end of the window.


·         Lag :
v    The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.

v    Returns null when the lag for the current row extends before the beginning of the window.


·         FIRST_VALUE
·         LAST_VALUE



The OVER clause

·         OVER with standard aggregates:
v    SUM
v    MIN
v    MAX
v    AVG
v    COUNT

  
·         OVER with a PARTITION BY statement with one or more partitioning columns.

·         OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns.




Analytics functions :
·         RANK
·         DENSE_RANK
·         ROW_NUMBER
·         CUME_DIST
·         PERCENT_RANK
·         NTILE




Here I’ll give you a brief idea of these window functions in Hive, we will be using stock market data to use these functions.


Below is the sample stock data to use for the stock table - (stock_market_data.csv) :

20180721,A,26.68,28.2,26.41,26.58,45628
20180722,A,26.95,28.87,26.95,28.72,38494
20180723,A,28.56,29.41,28.76,29.3,32153
20180726,A,29.22,29.67,29.11,29.64,20254
20180726,A,29.73,29.73,28.81,28.87,31415
20180721,B,10.95,11.01,10.53,10.59,250893
20180722,B,10.8,10.99,10.73,10.82,258437
20180723,B,10.87,11.12,10.67,11.05,250988
20180726,B,11.14,11.37,11.01,11.23,264268
20180726,B,11.35,11.42,11.11,11.21,247625
20180721,C,264.99,265.15,254,254.24,423763
20180722,C,257.72,260,255.31,259.024,230470
20180723,C,257.21,260.38,256.28,259.94,190495
20180726,C,259.98,260.0998,257.71,259.28,150196
20180726,C,260.68,264.8,260.3,264.08,208846
20180721,D,32.35,32.44,31.85,32.01,31942
20180722,D,32.76,32.46,29.53,29.63,169784
20180723,D,29.67,30.72,28.95,30.53,180316
20180726,D,30.61,30.9,29.95,30.5,88487
20180726,D,30.61,30.63,26.82,29.14,192412
20180721,E,47.73,49.38,44.59,48.68,141958
20180722,E,49.02,49.54,48.34,49.01,107537
20180723,E,48.98,49.16,48.58,48.93,87209
20180726,E,48.89,49.37,48.82,49.24,59181
20180726,E,49.5,49.5,48.96,49.26,46345
20180721,F,26.94,28.28,26.52,26.67,83223
20180722,F,26.84,28.74,26.8,28.58,61384
20180723,F,28.59,29.35,28.54,29.23,77031
20180726,F,29.2,29.28,28.805,29.17,50617
20180726,F,29.34,29.39,28.83,29.22,51753
20180721,G,53.8,54.01,52.35,52.47,55067
20180722,G,52.98,53.25,52.64,52.91,49081
20180723,G,52.73,53.14,52.5101,52.75,34742
20180726,G,52.65,53.16,52.55,52.82,42832
20180726,G,53,53.55,52.98,53.47,40933




 Now we will create a table in Hive and load this stock market data using below code:

create table stock_market (Date_ string, Stock_Ticker string, Open double, High double, Low double, Close double, Volume_for_the_day int) row format delimited fields terminated by ',' ;

load data local inpath '/home/g.aspiredev.001/stock_market_data.csv' into table stock_market;



Now I’ll verify the data in Hive table to select only 10 records :





Now we will try to deep dive into the window functions in Hive.

Lag : 
This function returns the values of the previous row. We can specify an integer offset which designates the row position else it will take the default integer offset as 1.

Below is the sample function for lag :

select stock_ticker, date_, close, lag(close,1) over(partition by stock_ticker) as yesterday_price from stock_market;

Here using lag we can display the yesterday’s closing price of the stock_ticker. Lag is to be used with over function, inside the over function we can use partition or order by classes.

In the below screenshot, we can see the closing price of the stock for the day and the yesterday’s price.




Lead :
This function returns the values from the following rows. We can specify an integer offset which designates the row position else it will take the default integer offset as 1.

Below is the sample function for lead :

Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s and that can be done as follows.

select stock_ticker, date_, close, case(lead(close,1) over(partition by stock_ticker)-close)>0 when true then "higher" when false then "lesser" end as Changes from stock_market;

In the below screenshot, we can see the expected result.




FIRST_VALUE :
It returns the value of the first row from that window. With the below query, we can see the first row high price of the stock_ticker for all the days.

select stock_ticker, first_value(high) over(partition by stock_ticker) as first_high from stock_market;




LAST_VALUE :
It is the reverse of FIRST_VALUE. It returns the value of the last row from that window. With the below query, We can see the last row high price value of the stock_ticker for all the days.

select stock_ticker, last_value(high) over(partition by stock_ticker) as last_high from stock_market;





Let us now see the usage of the aggregate function using Over.


Sum :
It returns the sum of all the values for the expression written in the over clause. With the below query, we can find the sum of all the closing stock prices for that particular stock_ticker.

select stock_ticker, sum(close) over(partition by stock_ticker) as total_sum from stock_market;

For each stock_ticker, the sum of all the closing prices will be calculated, we can see the same in the below screenshot.




Find running total :
For suppose let us take if we want to get running total of the volume_for_the_day for all the days for every stock_ticker then we can do this with the below query :

select stock_ticker, date_, volume_for_the_day, sum(volume_for_the_day) over(partition by stock_ticker order by date_) as running_total from stock_market;


In the below screenshot, we can see the volume_for_the_day for each day and the running total is the sum of volume_for_the_day’s that are elapsed.




Find the percentage of each row value :
Now let’s take a scenario where we need to find the percentage of the volume_for_the_day on the total volumes for that particular stock_ticker and that can be done as follows :

select stock_ticker, date_, volume_for_the_day, (volume_for_the_day*100/(sum(volume_for_the_day) over(partition by stock_ticker)))  as percentage from stock_market;


In the below screenshot, we can see that the percentage contribution of the volumes for the day is found based on the total volume for that stock_ticker.





Min :
It returns the minimum value of the column for the rows in that over clause. We can find the minimum closing stock price for each particular stock_ticker using below query.

select stock_ticker, min(close) over(partition by stock_ticker) as minimum from stock_market;




Max :
It returns the maximum value of the column for the rows in that over clause. We can find the maximum closing stock price for each particular stock_ticker using below query.

select stock_ticker, max(close) over(partition by stock_ticker) as maximum from stock_market;



Avg :
It returns the average value of the column for the rows that over clause returns. We can find the average closing stock price for each particular stock_ticker using below query.

select stock_ticker, avg(close) over(partition by stock_ticker) as average from stock_market;




Count :
It returns the count of all the values for the expression written in the over clause. We can find the number of rows present for each stock_ticker using below query.

select stock_ticker, count(stock_ticker) over(partition by stock_ticker) as count from stock_market;

For each partition, the count of stock_ticker will be calculated, We can see the same in the below screenshot.






Now we will work on some Analytic functions.

Rank :
The rank function will return the rank of the values as per the result set of the over clause. If two values are same then it will give the same rank to those 2 values and then for the next value, the sub-sequent rank will be skipped.

The below query will rank the closing prices of the stock for each stock_ticker. The same we can see in the below screenshot.

select stock_ticker, close, rank() over(partition by stock_ticker order by close) as closing_rank from stock_market;






Dense_rank :
It is same as the rank() function but the difference is if any duplicate value is present then the rank will not be skipped for the subsequent rows. Each unique value will get the ranks in a sequence.

The below query will rank the closing prices of the stock for each stock_ticker. The same we can see in the below screenshot.

select stock_ticker, close, dense_rank() over(partition by stock_ticker order by close) as closing_dense_rank from stock_market;





Row_number :
Row number will return the continuous sequence of numbers for all the rows of the result set of the over clause.

Using the below query, we will get the stock_ticker, closing price and its row number for each stock_ticker.

select stock_ticker, close, row_number() over(partition by stock_ticker order by close) as row_number from stock_market;




Cume_dist :
It returns the cumulative distribution of a value. It results from 0 to 1. For suppose if the total number of records are 10 then for the 1st row the cume_dist will be 1/10 and for the second 2/10 and so on till 10/10.
This cume_dist will be calculated in accordance with the result set returned by the over clause. The below query will result in the cumulative of each record for every stock_ticker.

select stock_ticker, cume_dist() over(partition by stock_ticker order by close) as cumulative_distribution from stock_market;




Percent_rank :
It returns the percentage rank of each row within the result set of over clause. Percent_rank is calculated in accordance with the rank of the row and the calculation is as follows (rank-1)/(total_rows_in_group – 1). If the result set has only one row then the percent_rank will be 0.

The below query will calculate the percent_rank for every row in each partition and we can see the same in the below screenshot.

select stock_ticker, close, percent_rank() over(partition by stock_ticker order by close) as percent_rank from stock_market;



Ntile :
It returns the bucket number of the particular value. For suppose if we say Ntile(2) then it will create 2 buckets based on the result set of the over clause after that it will place the first 50% of the records in the 1st bucket and 2nd bucket.

The below query will create 2 buckets for every stock_ticker and the first 50% records for every stock_ticker will be in the 1st bucket and so on.

select stock_ticker, close, ntile(2) over(partition by stock_ticker order by close ) as ntile_bucket from stock_market;

In the below screenshot, we can see that 2 buckets will be created for every stock_ticker and the least 50% closing prices will be in the first bucket and the next 50% will be in the second bucket for all the stock_ticker.




This is how we can perform window operations in Hive.

I hope this blog helped you in understanding what are window functions and how to implement them in Hive. And if you have any questions or suggestions, kindly leave a comment.


Thank You!


Comments

Popular posts from this blog

Transformations and Actions in Spark

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1