What is RDD and how to create it in Spark ?

December 23, 2018

Dear friends, today we’ll learn that what is RDD and how to create it with different three ways :

RDD – Resilient Distributed Datasets

RDDs are Immutable and partitioned collection of records, which can only be created by coarse grained operations such as map, filter, group by etc. By coarse grained operations, it means that the operations are applied on all elements in a datasets. RDDs can only be created by reading data from a stable storage such as HDFS or by transformations on existing RDDs.

But Why RDD?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page Rank algorithms, it is fairly common to reuse or share the data among multiple jobs or you may want to do multiple ad-hoc queries over a shared data set.

There is an underlying problem with data reuse or data sharing in existing distributed computing systems (such as MapReduce) and that is , you need to store data in some intermediate stable distributed store such as HDFS or Amazon S3. This makes the overall computations of jobs slower since it involves multiple IO operations, replications and serializations in the process.

above picture is indicating to Iterative Processing in MapReduce

RDDs , tries to solve these problems by enabling fault tolerant distributed In-memory computations.

above picture is indicating to Iterative Processing in Spark

Now, lets understand what exactly RDD is and how it achieves fault tolerance –

Now, How Is That Helping for Fault Tolerance?

Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of transformations to produce one RDD is called as Lineage Graph.

For example –

val RDD1 = spark.textFile(“path”)

val RDD2 = RDD1.filter(Some Functions)

val RDD3 = RDD2.map(Some another Functions)

Spark RDD Lineage Graph

In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.

How to Create a RDD in Spark ?

These are three methods to create the RDD.

1.The first method is used when data is already available with the external systems like local filesystem, HDFS, HBase.RDD can be created by calling a textFile method of SparkContext with path / URL as the argument.

Val data = sc.textFile(“/path/file_name”)

Here sc is the object of SparkContext.

Create a RDD with data file from Local and hdfs location :

For example, Suppose I have below file (test_file.csv) and available in local and hdfs as well :

test_file.csv :

1,Divyanshu,Laptop,75000,Delhi

2,Himanshu,Laptop,95000,Delhi

3,Sushil,Hard Disk,20000,Delhi

4,Sidharth,Mobile,75000,Kolkata

5,Surendra,Desktop,70000,Kolkata

6,Mohit,Pen Drive,25000,Kolkata

7,Mohit,Laptop,65000,Hyderabad

8,Prabhu,Laptop,50000,Hyderabad

9,Rajesh,Head Phone,10000,Hyderabad

10,Mohit,Mobile,40000,Chennai

11,Sonia,Desktop,25000,Chennai

12,Deepthi,Head Phone,10000,Chennai

13,Priyanka,Desktop,56000,Noida

14,Prashant,Hard Disk,35000,Noida

15,Ayush,Mobile,15000,Noida

Create RDD from hdfs file :

val RDD1 = sc.textFile("/user/g.aspiresit.001/Anamika/test_file.csv")

Read the RDD data :

RDD1.collect

Create RDD from local file:

val RDD2 = sc.textFile("file:///home/g.aspiresit.001/Anamika/test_file.csv")

Read the RDD data :

RDD2.collect

2.The second approach can be used with the existing collections

val RDD_Collections = sc.parallelize(List("spark rdd example", "creating a new RDD"))

To see the Output :

RDD_Collections.collect

val RDD_Array = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

To see the Output :

RDD_Array.foreach(println)

val RDD_Array1 = sc.parallelize(RDD_Array)

To see the Output :

RDD_Array1.collect

3.The third one is a way to create new RDD from the existing one.

Creating a new RDD based upon RDD_ARRAY1 to multiply each object of it with number 2.

val newRDD = RDD_Array1.map(x => x * 2)

To see the Output :

newRDD.collect

Thanks to read this, if you have any questions or suggestions kindly leave your comment, I’ll be back to you soon at earliest.

Thank You!

Search This Blog

Big Data!! Simple Solutions!!!