groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala

September 07, 2019

Hello Friends,

Today I would like to write about groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala :

groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.

reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey().

aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y.
For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.

Below is the example for each :
First I am creating a variable pairs_for_test using parallelize of Array to apply these functions :
Array(("a", 2), ("a", 3), ("a", 4), ("b", 7), ("b", 5), ("a", 10), ("b", 15))

and output of each a and b should be :
a = 2 + 3 + 4 + 10 = 19
b = 7 + 5 + 15 = 27
So finally Output should be : Array((a,19), (b,27)), which we can verify using below commands :

So let's start :

scala> val pairs_for_test = sc.parallelize(Array(("a", 2), ("a", 3), ("a", 4), ("b", 7), ("b", 5), ("a", 10), ("b", 15)))

groupByKey :

scala> val res_groupByKey=pairs_for_test.groupByKey().map(x=> (x._1, x._2.sum))
res_groupByKey: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[18] at map at <console>:25
scala> res_groupByKey.collect
res15: Array[(String, Int)] = Array((a,19), (b,27))

reduceBykey :

scala> val res_reduceByKey = pairs_for_test.reduceByKey(_ + _)
res_reduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[19] at reduceByKey at <console>:25
scala> res_reduceByKey.collect
res16: Array[(String, Int)] = Array((a,19), (b,27))

aggregateByKey :

In aggregateByKey : 0 is initial value, _+_ inside partition, _+_ between partitions

scala> val res_aggregateByKey = pairs_for_test.aggregateByKey(0)(_+_, _+_)
res_aggregateByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at aggregateByKey at <console>:25
scala> res_aggregateByKey.collect
res17: Array[(String, Int)] = Array((a,19), (b,27))

So here we can see the use of each one with above definitions.

Below Screenshot can be refer for the same as I have captured the same above code for the use of groupByKey, reduceByKey, aggregateByKey :

I hope this post helps you. If you like this post, please Like and Share this and if you have any query or suggestions, you can drop your comments.

Thank You !

Search This Blog

Big Data!! Simple Solutions!!!

groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala

Comments

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

How to Convert a Spark DataFrame to Map in Scala

How to Handle and Convert DateTime format in Spark-Scala.