Posts

Showing posts from August, 2019

Performance Tuning in Apache Spark

Image
Performance Tuning in Apache Spark : The process of adjusting settings to record for memory, cores, and all the instances used by the system is termed tuning. This process guarantees that the Spark has optimal performance and prevents resource bottlenecking. Effective changes are made to each property and settings, to ensure the correct usage of resources based on system-specific setup. Apache Spark has in-memory computation nature. As a result resources in the cluster (CPU, memory etc.) may get bottlenecked. Sometimes to decrease memory usage RDD are stored in serialized form. Data serialization plays very important role in good network performance and can also help in reducing memory usage, and memory tuning. If used these properly, tuning can be do: 1. Ensure proper use of all resources in an effective manner. 2.Eliminates those jobs that run long. 3.Improves the performance time of the system. 4.Guarantees that jobs are on correct execution engine. Now we can

How does Spark SQL optimize joins? What are the optimization tricks for joins?

How does Spark SQL optimize joins? What are the optimization tricks for joins? Lets try to understand how spark 2.0 works for DataFrame API Being a DataFrame, spark has knowledge about the structure of the data. W hile applying join we always have to think that what type of join I should do and which will  be more efficient and optimize to process the data smoothly?   When joining big table to small table, it is a good idea to broadcasting the smaller table?   However when joining big table to big table , what optimization tricks are there? Does sorting help here ? would spark do the sorting internally? When should I repartition the data? So Answer is Yes! We can optimize the joins in Spark SQL for better performance and process data smoothly : As we all now that Spark SQL comes with JoinSelection execution planning strategy that translates a logical join to one of the supported join physical operators (per join physical operator selection requireme