Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark


Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark :

1. Split all joins in such a way that each join should be handled through single DataFrame/DataSet.

2. Re-partition each DataFrame based on join columns before joining

3. Persist it in memory before joining ( un-persist it at the end ) if a DataFrame is going to be used more than one places. As you know for every action, the DAG is going to be started Afresh

 4. All derived columns should be calculated at the time of DataFrame created. (not at the time of joining)

5. Broad cast small tables across all worker nodes if necessary

6. Since spark tasks are created based on the number of partitions, check the number of partitions on each DataFrame and reduce it (coalesce) if the count is high.
7. While creating DataFrame, select only those columns that we are interested in. Selects all and drop it unwanted columns which is not efficient.

Comments

Popular posts from this blog

Transformations and Actions in Spark

How to convert XML String to JSON String in Spark-Scala

How to Convert a Spark DataFrame to Map in Scala