groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala
Hello Friends, Today I would like to write about groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala : groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned. reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey(). aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key. Below is the example for each : First I am creating a variable pairs_for_test using parallelize of Array to apply these functions : Array(("a", 2), ("a", 3), ("a", 4), ("b", 7), ("b&qu