Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark
Some of the Good practices which needs to be followed
while we are dealing with DataFrame based
Joins in Spark :
1. Split all joins in such a way that each join should
be handled through single DataFrame/DataSet.
2. Re-partition each DataFrame based on join columns
before joining
3. Persist it in memory before joining ( un-persist it
at the end ) if a DataFrame is going to be used more than one places. As you
know for every action, the DAG is going to be started Afresh
4. All derived
columns should be calculated at the time of DataFrame created. (not at the time
of joining)
5. Broad cast small tables across all worker nodes if
necessary
6. Since spark tasks are created based on the number of
partitions, check the number of partitions on each DataFrame and reduce it
(coalesce) if the count is high.
7. While creating
DataFrame, select only those columns that we are interested in. Selects all and
drop it unwanted columns which is not efficient.
Comments
Post a Comment