Window Functions in Hive
Dear friends, today we'll learn about window functions in Hive.
Window functions are very
specific in Hive which allows us to create a window on a set of data further
allowing aggregation surrounding that data.
Window Functions in Hive includes the following
functions
·
Lead :
v The number of rows to lead
can optionally be specified. If the number of rows to lead is not specified,
the lead is one row.
v Returns null when the lead
for the current row extends beyond the end of the window.
·
Lag :
v The number of rows to lag
can optionally be specified. If the number of rows to lag is not specified, the
lag is one row.
v Returns null when the lag
for the current row extends before the beginning of the window.
·
FIRST_VALUE
·
LAST_VALUE
The OVER clause
·
OVER with standard aggregates:
v SUM
v MIN
v MAX
v AVG
v COUNT
·
OVER with a PARTITION BY statement with one or more
partitioning columns.
·
OVER with PARTITION BY and
ORDER BY with one or more partitioning and/or ordering columns.
Analytics
functions :
·
RANK
·
DENSE_RANK
·
ROW_NUMBER
·
CUME_DIST
·
PERCENT_RANK
·
NTILE
Here I’ll give you a brief
idea of these window functions in Hive, we will be using stock market data to
use these functions.
Below is the
sample stock data to use for the stock table - (stock_market_data.csv) :
20180721,A,26.68,28.2,26.41,26.58,45628
20180722,A,26.95,28.87,26.95,28.72,38494
20180723,A,28.56,29.41,28.76,29.3,32153
20180726,A,29.22,29.67,29.11,29.64,20254
20180726,A,29.73,29.73,28.81,28.87,31415
20180721,B,10.95,11.01,10.53,10.59,250893
20180722,B,10.8,10.99,10.73,10.82,258437
20180723,B,10.87,11.12,10.67,11.05,250988
20180726,B,11.14,11.37,11.01,11.23,264268
20180726,B,11.35,11.42,11.11,11.21,247625
20180721,C,264.99,265.15,254,254.24,423763
20180722,C,257.72,260,255.31,259.024,230470
20180723,C,257.21,260.38,256.28,259.94,190495
20180726,C,259.98,260.0998,257.71,259.28,150196
20180726,C,260.68,264.8,260.3,264.08,208846
20180721,D,32.35,32.44,31.85,32.01,31942
20180722,D,32.76,32.46,29.53,29.63,169784
20180723,D,29.67,30.72,28.95,30.53,180316
20180726,D,30.61,30.9,29.95,30.5,88487
20180726,D,30.61,30.63,26.82,29.14,192412
20180721,E,47.73,49.38,44.59,48.68,141958
20180722,E,49.02,49.54,48.34,49.01,107537
20180723,E,48.98,49.16,48.58,48.93,87209
20180726,E,48.89,49.37,48.82,49.24,59181
20180726,E,49.5,49.5,48.96,49.26,46345
20180721,F,26.94,28.28,26.52,26.67,83223
20180722,F,26.84,28.74,26.8,28.58,61384
20180723,F,28.59,29.35,28.54,29.23,77031
20180726,F,29.2,29.28,28.805,29.17,50617
20180726,F,29.34,29.39,28.83,29.22,51753
20180721,G,53.8,54.01,52.35,52.47,55067
20180722,G,52.98,53.25,52.64,52.91,49081
20180723,G,52.73,53.14,52.5101,52.75,34742
20180726,G,52.65,53.16,52.55,52.82,42832
20180726,G,53,53.55,52.98,53.47,40933
Now we will create a table in Hive and load
this stock market data using below code:
create table stock_market
(Date_ string, Stock_Ticker string, Open double, High double, Low double, Close
double, Volume_for_the_day int) row format delimited fields terminated by ',' ;
load data local inpath
'/home/g.aspiredev.001/stock_market_data.csv' into table stock_market;
Now I’ll verify the data in
Hive table to select only 10 records :
Now we will try to deep dive into the window functions in Hive.
Lag :
This function returns the
values of the previous row. We can specify an integer offset which designates
the row position else it will take the default integer offset as 1.
Below is the sample function
for lag :
select stock_ticker, date_, close, lag(close,1)
over(partition by stock_ticker) as yesterday_price from
stock_market;
Here using lag we can
display the yesterday’s closing price of the stock_ticker. Lag is to be used
with over function, inside the over function we can use partition or order by
classes.
In the below screenshot, we
can see the closing price of the stock for the day and the yesterday’s price.
Lead :
This function returns the values
from the following rows. We can specify an integer offset which designates the
row position else it will take the default integer offset as 1.
Below is the sample function
for lead :
Now using the lead
function, we will find that whether the following day’s closing price is higher
or lesser than today’s and that can be done as follows.
select stock_ticker, date_,
close, case(lead(close,1) over(partition by stock_ticker)-close)>0 when true
then "higher" when false then "lesser" end as Changes from stock_market;
In the below screenshot, we
can see the expected result.
FIRST_VALUE :
It returns the value of the
first row from that window. With the below query, we can see the first row high
price of the stock_ticker for all the days.
select stock_ticker, first_value(high)
over(partition by stock_ticker) as first_high from stock_market;
LAST_VALUE :
It is the reverse of
FIRST_VALUE. It returns the value of the last row from that window. With the
below query, We can see the last row high price value of the stock_ticker for
all the days.
select stock_ticker, last_value(high)
over(partition by stock_ticker) as last_high from stock_market;
Let us now
see the usage of the aggregate function using Over.
Sum :
It returns the sum of all
the values for the expression written in the over clause. With the below query,
we can find the sum of all the closing stock prices for that particular stock_ticker.
select stock_ticker, sum(close)
over(partition by stock_ticker) as total_sum from stock_market;
For each stock_ticker, the
sum of all the closing prices will be calculated, we can see the same in the
below screenshot.
Find running total :
For suppose let us take if
we want to get running total of the volume_for_the_day for all the days for
every stock_ticker then we can do this with the below query :
select stock_ticker, date_,
volume_for_the_day, sum(volume_for_the_day) over(partition by stock_ticker
order by date_) as running_total from stock_market;
In the below screenshot, we can see the volume_for_the_day for each day and the running total is the sum of volume_for_the_day’s that are elapsed.
Find the percentage of each row value :
Now let’s take a scenario
where we need to find the percentage of the volume_for_the_day on the total
volumes for that particular stock_ticker and that can be done as follows :
select stock_ticker, date_,
volume_for_the_day, (volume_for_the_day*100/(sum(volume_for_the_day) over(partition
by stock_ticker))) as percentage from stock_market;
In the below screenshot, we can see that the percentage contribution of the volumes for the day is found based on the total volume for that stock_ticker.
Min :
It returns the minimum
value of the column for the rows in that over clause. We can find the minimum
closing stock price for each particular stock_ticker using below query.
select stock_ticker,
min(close) over(partition by stock_ticker) as minimum from stock_market;
Max :
It returns the maximum value
of the column for the rows in that over clause. We can find the maximum closing
stock price for each particular stock_ticker using below query.
select stock_ticker,
max(close) over(partition by stock_ticker) as maximum from stock_market;
Avg :
It returns the average
value of the column for the rows that over clause returns. We can find the
average closing stock price for each particular stock_ticker using below query.
select stock_ticker,
avg(close) over(partition by stock_ticker) as average from stock_market;
Count :
It returns the count of all
the values for the expression written in the over clause. We can find the
number of rows present for each stock_ticker using below query.
select stock_ticker, count(stock_ticker)
over(partition by stock_ticker) as count from stock_market;
For each partition, the
count of stock_ticker will be calculated, We can see the same in the below
screenshot.
Now we will
work on some Analytic functions.
Rank :
The rank function will
return the rank of the values as per the result set of the over clause. If two
values are same then it will give the same rank to those 2 values and then for
the next value, the sub-sequent rank will be skipped.
The below query will rank
the closing prices of the stock for each stock_ticker. The same we can see in the below screenshot.
select stock_ticker, close,
rank() over(partition by stock_ticker order by close) as closing_rank from stock_market;
Dense_rank :
It is same as the rank()
function but the difference is if any duplicate value is present then the rank
will not be skipped for the subsequent rows. Each unique value will get the
ranks in a sequence.
The below query will rank
the closing prices of the stock for each stock_ticker. The same we can see in
the below screenshot.
select stock_ticker, close,
dense_rank() over(partition by stock_ticker order by close) as closing_dense_rank
from stock_market;
Row_number :
Row number will return the
continuous sequence of numbers for all the rows of the result set of the over clause.
Using the below query, we
will get the stock_ticker, closing price and its row number for each stock_ticker.
select stock_ticker, close,
row_number() over(partition by stock_ticker order by close) as row_number from stock_market;
Cume_dist :
It returns the cumulative
distribution of a value. It results from 0 to 1. For suppose if the total
number of records are 10 then for the 1st row the cume_dist will be 1/10 and for the second
2/10 and so on till 10/10.
This cume_dist will be
calculated in accordance with the result set returned by the over clause. The
below query will result in the cumulative of each record for every stock_ticker.
select stock_ticker, cume_dist()
over(partition by stock_ticker order by close) as cumulative_distribution from stock_market;
Percent_rank :
It returns the percentage
rank of each row within the result set of over clause. Percent_rank is
calculated in accordance with the rank of the row and the calculation is as
follows (rank-1)/(total_rows_in_group – 1). If the result set has only one row
then the percent_rank will be 0.
The below query will
calculate the percent_rank for every row in each partition and we can see the
same in the below screenshot.
select stock_ticker, close,
percent_rank() over(partition by stock_ticker order by close) as percent_rank
from stock_market;
Ntile :
It returns the bucket
number of the particular value. For suppose if we say Ntile(2) then it will
create 2 buckets based on the result set of the over clause after that it will
place the first 50% of the records in the 1st bucket and 2nd bucket.
The below query will create
2 buckets for every stock_ticker and the first 50% records for every stock_ticker
will be in the 1st bucket and so on.
select stock_ticker, close,
ntile(2) over(partition by stock_ticker order by close ) as ntile_bucket from
stock_market;
In the below screenshot, we
can see that 2 buckets will be created for every stock_ticker and the least 50%
closing prices will be in the first bucket and the next 50% will be in the
second bucket for all the stock_ticker.
This is how we can perform window
operations in Hive.
I hope this blog helped you
in understanding what are window functions and how to implement them in Hive. And
if you have any questions or suggestions, kindly leave a comment.
Thank You!
Comments
Post a Comment