Big Data!! Simple Solutions!!!

Posts

Showing posts from September, 2019

groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala

September 07, 2019

Hello Friends, Today I would like to write about groupByKey vs reduceByKey vs aggregateByKey in Apache Spark/Scala : groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned. reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey(). aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key. Below is the example for each : First I am creating a variable pairs_for_test using parallelize of Array to apply these functions : Array(("a", 2), ("a", 3), ("a", 4), ("b", 7), ("b...

Some of the best practices while we write Spark based jobs of ETL

September 02, 2019

Hello Friends, I would like to write some of the best practices while we write Spark based jobs of ETL : Multi joins : In cases of performing join on the Datasets one after the other ,always keep the largest Dataset to your left and join the least sized dataset first and proceed with next smaller size Datasets. This approach will drastically elimiate shuffling of data across the nodes. Lesser shuffling higher performance. In case of two bigger table and one smaller table, join the suitable big dataset with small table in the above approach and join the result bigger Dataset and another big dataset with sort merge join option. To accomplish ideal performance in Sort Merge Join: • Make sure the partitions have been co-located. Otherwise, there will be shuffle operations to co-locate the data as it has a pre-requirement that all rows having the same value for the join key should be stored in the same partition. • The DataFrame should be distributed uniformly on the joinin...