Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark

July 31, 2019

Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark :

1. Split all joins in such a way that each join should be handled through single DataFrame/DataSet.

2. Re-partition each DataFrame based on join columns before joining

3. Persist it in memory before joining ( un-persist it at the end ) if a DataFrame is going to be used more than one places. As you know for every action, the DAG is going to be started Afresh

4. All derived columns should be calculated at the time of DataFrame created. (not at the time of joining)

5. Broad cast small tables across all worker nodes if necessary

6. Since spark tasks are created based on the number of partitions, check the number of partitions on each DataFrame and reduce it (coalesce) if the count is high.

7. While creating DataFrame, select only those columns that we are interested in. Selects all and drop it unwanted columns which is not efficient.

Search This Blog

Big Data!! Simple Solutions!!!

Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark

Comments

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1