Top 11 Features of Apache Spark RDD

January 04, 2019

Hello Friends,

In this blog, we will come across various topmost features of Spark RDD. This blog also contains the introduction of Apache Spark RDD and its operations along with the methods to create RDD.

11 Topmost Features of Spark RDD

Apache Spark RDD :

RDD stands Resilient Distributed Dataset. RDDs are the fundamental abstraction of Apache Spark. It is an immutable distributed collection of the dataset. Each dataset in RDD is divided into logical partitions. On the different node of the cluster, we can compute these partitions. RDDs are a read-only partitioned collection of record. We can create RDD in three ways :

Parallelizing already existing collection in driver program.
Referencing a dataset in an external storage system (e.g. HDFS, HBase, shared file system).
Creating RDD from already existing RDDs.

There are two operations in RDD namely Transformations and Actions.

Sparkling Features of Spark RDD :

There are several advantages of using RDD. Some of them are given below :

1. In-memory Computation :

The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes.

2. Lazy Evaluation :

The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do.

3. Fault Tolerance :

Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data.

4. Immutability :

RDDS are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.

5. Persistence :

We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function.

6. Partitioning :

RDD, partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.

7. Parallel :

RDD, process the data parallelly over the cluster.

8. Location-Stickiness :

RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAG Scheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.

9. Coarse-grained Operation :

We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.

10. Typed :

We can have RDD of various types like: RDD [int], RDD [long], RDD [string].

11. No Limitation :

We can have any number of RDDs. There is no limit to its number. The limit depends on the size of disk and memory.

Hence, using RDD we can recover the shortcoming of Hadoop, MapReduce and can handle the large volume of data, as a result, it decreases the time complexity of the system. Thus the above-mentioned features of Spark RDD make them useful for fast computations and increase the performance of the system.

Thank you so much to read this post, if you have any suggestions or query, kindly feel free to leave your valuable comments and feedback.

Thank You!

Search This Blog

Big Data!! Simple Solutions!!!

Top 11 Features of Apache Spark RDD

Comments

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1