Some of the best practices while we write Spark based jobs of ETL
Hello Friends,
I would like to write some of the best practices while we write Spark based jobs of ETL :
Multi joins :
In cases of performing join on the Datasets one after the other ,always keep the largest Dataset to your left and join the least sized dataset first and proceed with next smaller size Datasets. This approach will drastically elimiate shuffling of data across the nodes. Lesser shuffling higher performance.
In case of two bigger table and one smaller table, join the suitable big dataset with small table in the above approach and join the result bigger Dataset and another big dataset with sort merge join option.
To accomplish ideal performance in Sort Merge Join:• Make sure the partitions have been co-located. Otherwise, there will be shuffle operations to co-locate the data as it has a pre-requirement that all rows having the same value for the join key should be stored in the same partition.
• The DataFrame should be distributed uniformly on the joining columns.
• To leverage parallelism the DataFrame should have an adequate number of unique keys
Use Broadcast :
Use Broadcast hints on Smaller DataFrame so the Smaller DataFrame is copied across all the nodes and there won't be any network shuffle during the join operations. By default broadcast property is enabled and any DataFrame less than 10 mb will be broadcasted for optimal performance. Always ensure filter the data and the join the DataFrames rather than join and filter. This would ensure less data is travelled across the network and join performance will be improved. Inorder to colocate the data always use bucketing on key columns. Always use Dataframe/Dataset API rather than using lower level RDD API. Because Spark will rewrite your code using it's cost based optimization when we use Dataset/DataFrame . This won't happen when we use RDD and we are in the risk and code will be executed in the flow the way it is written.
I would like to write some of the best practices while we write Spark based jobs of ETL :
Multi joins :
In cases of performing join on the Datasets one after the other ,always keep the largest Dataset to your left and join the least sized dataset first and proceed with next smaller size Datasets. This approach will drastically elimiate shuffling of data across the nodes. Lesser shuffling higher performance.
In case of two bigger table and one smaller table, join the suitable big dataset with small table in the above approach and join the result bigger Dataset and another big dataset with sort merge join option.
To accomplish ideal performance in Sort Merge Join:• Make sure the partitions have been co-located. Otherwise, there will be shuffle operations to co-locate the data as it has a pre-requirement that all rows having the same value for the join key should be stored in the same partition.
• The DataFrame should be distributed uniformly on the joining columns.
• To leverage parallelism the DataFrame should have an adequate number of unique keys
Use Broadcast :
Use Broadcast hints on Smaller DataFrame so the Smaller DataFrame is copied across all the nodes and there won't be any network shuffle during the join operations. By default broadcast property is enabled and any DataFrame less than 10 mb will be broadcasted for optimal performance. Always ensure filter the data and the join the DataFrames rather than join and filter. This would ensure less data is travelled across the network and join performance will be improved. Inorder to colocate the data always use bucketing on key columns. Always use Dataframe/Dataset API rather than using lower level RDD API. Because Spark will rewrite your code using it's cost based optimization when we use Dataset/DataFrame . This won't happen when we use RDD and we are in the risk and code will be executed in the flow the way it is written.
Comments
Post a Comment