Posts

Showing posts from July, 2019

What is DStream and readStream in Spark Streaming

                       What is  DStream and readStream in Spark Streaming DStream :   A DStream is a sequence of RDDs representing a data stream. A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data (You can refer spark.RDD for more details on RDDs). DStreams can either be created from live data (such as, data from HDFS, Kafka or Flume) or it can be generated by transformation existing DStreams using operations such as map, window and reduceByKeyAndWindow. readStream :   readStream is a component of Spark Structured streaming. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as str

Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark

Some of the Good practices which needs to be followed while we are dealing with DataFrame based Joins in Spark : 1. Split all joins in such a way that each join should be handled through single DataFrame/DataSet. 2. Re-partition each DataFrame based on join columns before joining 3. Persist it in memory before joining ( un-persist it at the end ) if a DataFrame is going to be used more than one places. As you know for every action, the DAG is going to be started Afresh   4. All derived columns should be calculated at the time of DataFrame created. (not at the time of joining) 5. Broad cast small tables across all worker nodes if necessary 6. Since spark tasks are created based on the number of partitions, check the number of partitions on each DataFrame and reduce it (coalesce) if the count is high. 7. While creating DataFrame, select only those columns that we are interested in. Selects all and drop it unwanted columns which is not efficient.