How does Spark SQL optimize joins? What are the optimization tricks for joins?

How does Spark SQL optimize joins? What are the optimization tricks for joins?


Lets try to understand how spark 2.0 works for DataFrame API Being a DataFrame, spark has knowledge about the structure of the data.

While applying join we always have to think that what type of join I should do and which will  be more efficient and optimize to process the data smoothly?

 When joining big table to small table, it is a good idea to broadcasting the smaller table?

 However when joining big table to big table , what optimization tricks are there? Does sorting help here ? would spark do the sorting internally? When should I repartition the data?


So Answer is Yes!
We can optimize the joins in Spark SQL for better performance and process data smoothly :

As we all now that Spark SQL comes with JoinSelection execution planning strategy that translates a logical join to one of the supported join physical operators (per join physical operator selection requirements).


There are 6 different types of physical join operators which can be apply for optimization:

1.  BroadcastHashJoinExec : when left or right join side can be broadcast (i.e. smaller than  spark.sql.autoBroadcastJoinThreshold  which is 10M by default.)
2. ShuffledHashJoinExec : when  spark.sql.join.preferSortMergeJoin  is disabled and it's possible to build hash maps for left or right join side (among the requirements).
3. SortMergeJoinExec :  when left join keys are "orderable".
4. BroadcastNestedLoopJoinExec :  when there are no joining keys and left or right join side could be broadcast.
5. CartesianProductExec :  when it's inner or cross join with no join condition.
6. BroadcastNestedLoopJoinExec : when no other have matched.


So now we can see there's a lot of theory to digest to "what optimization tricks are there".
1. Does sorting help? 
Yes. See SortMergeJoinExec operator.
2. would spark do the sorting internally?
Yes, It will try, but humans can (still?) do wonders.
3. When should I repartition the data?
Always if we can and know that pruning can help. That can reduce the number of rows to deal with and effectively allows for  BroadcastHashJoinExec   over  ShuffledHashJoinExec   or the others.

I also think that repartitioning the data can be of particular help with cost-based optimization where table pruning could reduce the number of columns and rows, and in turn the table size and the cost of one join over others in general.


Comments

Popular posts from this blog

Transformations and Actions in Spark

How to convert XML String to JSON String in Spark-Scala

How to Convert a Spark DataFrame to Map in Scala