Posts

Showing posts from January, 2019

Limitations of Apache Spark

Image
Dear friends, Today we will discuss about the limitations of Apache Spark and its  Apache Spark disadvantages. As all we know that Apache Spark  is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute Streaming, Machine Learning or SQL workloads that require fast iterative access to datasets. Below are the few features of it :   1.     Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.   2.     We can write applications quickly in Java, Scala, Python, R, and SQL. Spark offers over 80 high-level operators that make it easy to build parallel apps and we can use it interactively from the Scala, Python, R, and SQL shells. e.g. df = spark.read.json("json_file_name.json") df.where("age > 21") .select("name.first") .show()   // Spark's Pyth

Top 11 Features of Apache Spark RDD

Image
Hello Friends, In this blog, we will come across various topmost features of Spark RDD . This blog also contains the introduction of Apache Spark RDD and its operations along with the methods to create RDD.                                                                       11 Topmost Features of Spark RDD Apache Spark RDD : RDD stands Resilient Distributed Dataset . RDDs are the fundamental abstraction of Apache Spark. It is an immutable distributed collection of the dataset. Each dataset in RDD is divided into logical partitions. On the different node of the cluster, we can compute these partitions. RDDs are a read-only partitioned collection of record. We can create RDD in three ways : Parallelizing already existing collection in driver program. Referencing a dataset in an external storage system (e.g. HDFS, HBase, shared file system). Creating RDD from already existing RDDs . There are two operations in RDD namely Transformations