What is RDD and how to create it in Spark ?
Dear friends, today we’ll learn that what is RDD and how to create it with different three ways :
RDD – Resilient Distributed Datasets
RDDs are Immutable and partitioned collection of records, which can only be created by coarse grained operations such as map, filter, group by etc. By coarse grained operations, it means that the operations are applied on all elements in a datasets. RDDs can only be created by reading data from a stable storage such as HDFS or by transformations on existing RDDs.
But Why RDD?
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page Rank algorithms, it is fairly common to reuse or share the data among multiple jobs or you may want to do multiple ad-hoc queries over a shared data set.
There is an underlying problem with data reuse or data sharing in existing distributed computing systems (such as MapReduce) and that is , you need to store data in some intermediate stable distributed store such as HDFS or Amazon S3. This makes the overall computations of jobs slower since it involves multiple IO operations, replications and serializations in the process.
above picture is indicating to Iterative Processing in MapReduce
RDDs , tries to solve these problems by enabling fault tolerant distributed In-memory computations.
above picture is indicating to Iterative Processing in Spark
Now, lets understand what exactly RDD is and how it achieves fault tolerance –
Now, How Is That Helping for Fault Tolerance?
Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of transformations to produce one RDD is called as Lineage Graph.
For example –
val RDD1 = spark.textFile(“path”)
val RDD2 = RDD1.filter(Some Functions)
val RDD3 = RDD2.map(Some another Functions)
Spark RDD Lineage Graph
In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.
How to Create a RDD in Spark ?
These are three methods to create the RDD.
1.The first method is used when data is already available with the external systems like local filesystem, HDFS, HBase.RDD can be created by calling a textFile method of SparkContext with path / URL as the argument.
Val data = sc.textFile(“/path/file_name”)
Here sc is the object of SparkContext.
Create a RDD with data file from Local and hdfs location :
For example, Suppose I have below file (test_file.csv) and available in local and hdfs as well :
test_file.csv :
1,Divyanshu,Laptop,75000,Delhi
2,Himanshu,Laptop,95000,Delhi
3,Sushil,Hard Disk,20000,Delhi
4,Sidharth,Mobile,75000,Kolkata
5,Surendra,Desktop,70000,Kolkata
6,Mohit,Pen Drive,25000,Kolkata
7,Mohit,Laptop,65000,Hyderabad
8,Prabhu,Laptop,50000,Hyderabad
9,Rajesh,Head Phone,10000,Hyderabad
10,Mohit,Mobile,40000,Chennai
11,Sonia,Desktop,25000,Chennai
12,Deepthi,Head Phone,10000,Chennai
13,Priyanka,Desktop,56000,Noida
14,Prashant,Hard Disk,35000,Noida
15,Ayush,Mobile,15000,Noida
Create RDD from hdfs file :
val RDD1 = sc.textFile("/user/g.aspiresit.001/Anamika/test_file.csv")
Read the RDD data :
RDD1.collect
Create RDD from local file:
val RDD2 = sc.textFile("file:///home/g.aspiresit.001/Anamika/test_file.csv")
Read the RDD data :
RDD2.collect
2.The second approach can be used with the existing collections
val RDD_Collections = sc.parallelize(List("spark rdd example", "creating a new RDD"))
To see the Output :
RDD_Collections.collect
val RDD_Array = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
To see the Output :
RDD_Array.foreach(println)
val RDD_Array1 = sc.parallelize(RDD_Array)
To see the Output :
RDD_Array1.collect
3.The third one is a way to create new RDD from the existing one.
Creating a new RDD based upon RDD_ARRAY1 to multiply each object of it with number 2.
val newRDD = RDD_Array1.map(x => x * 2)
To see the Output :
newRDD.collect
Thanks to read this, if you have any questions or suggestions kindly leave your comment, I’ll be back to you soon at earliest.
Thank You!
Comments
Post a Comment