Limitations of Apache Spark

Dear friends,

Today we will discuss about the limitations of Apache Spark and its Apache Spark disadvantages.

As all we know that Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute Streaming, Machine Learning or SQL workloads that require fast iterative access to datasets.

Below are the few features of it :

1. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

2. We can write applications quickly in Java, Scala, Python, R, and SQL.
Spark offers over 80 high-level operators that make it easy to build parallel apps and we can use it interactively from the Scala, Python, R, and SQL shells. e.g.

df = spark.read.json("json_file_name.json") df.where("age > 21") .select("name.first") .show()

// Spark's Python DataFrame API, Read JSON files with automatic schema inference

3. It is a kind of Combine SQL, Streaming, and Complex Analytics. Spark powers a stack of libraries including SQL and DataFrames, MLib for Machine Learning, GraphX, and Spark Streaming. We can combine these libraries seamlessly in the same application.

4. Spark runs everywhere like on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

Limitations of Apache Spark :

As we know that Apache Spark is the next Generation of Big data tool that is being widely used by industries but except above advantages, Some of the drawbacks of Apache Spark are there like no support for real-time processing, Problem with small file, no dedicated File management system, Expensive and much more due to these limitations of Apache Spark, industries have started shifting to Apache Flink – 4G of Big Data.

Limitations of Apache Spark

So let us now understand Apache Spark problems and when not to use Spark.

1. No Support for Real-time Processing :

In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient distributed Datasets (RDDs), then these RDDs are processed using the operations like map, flatMap, filter, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.

2. Problem with Small File :

If we use Spark with Hadoop, we come across a problem of a small file. HDFS provides a limited number of large files rather than a large number of small files. Another place where Spark legs behind is we store the data gzipped in S3. This pattern is very nice except when there are lots of small gzipped files. Now the work of the Spark is to keep those files on network and uncompress them. The gzipped files can be uncompressed only if the entire file is on one core. So a large span of time will be spent in burning their core unzipping files in sequence.

In the resulting RDD, each file will become a partition; hence there will be a large amount of tiny partition within an RDD. Now if we want efficiency in our processing, the RDDs should be repartitioned into some manageable format. This requires extensive shuffling over the network.

3. No File Management System :

Apache Spark does not have its own file management system, thus it relies on some other platform like Hadoop or another cloud-based platform which is one of the Spark known issues.

4. Expensive :

In-Memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive, the memory consumption is very high, and it is not handled in a user-friendly manner. Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high.

5. Less number of Algorithms :

Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.

6. Manual Optimization :

The Spark job requires to be manually optimized and is adequate to specific datasets. If we want to partition and cache in Spark to be correct, it should be controlled manually.

7. Iterative Processing :

In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

8. Latency :

Apache Spark has higher latency as compared to Apache Flink.

9. Window Criteria :

Spark does not support record based window criteria. It only has time-based window criteria.

10. Back Pressure Handling :

Back pressure is build up of data at an input-output when the buffer is full and not able to receive the additional incoming data. No data is transferred until the buffer is empty. Apache Spark is not capable of handling pressure implicitly rather it is done manually.
These are some of the major pros and cons of Apache Spark. We can overcome these limitations of Spark by using Apache Flink – 4G of Big Data.

Finally We can say that though Spark has many drawbacks, but it is still popular in the market for Big Data solution. But there are various technologies that are overtaking Spark. Like Stream Processing is much better using Apache Flink than Apache Spark as it is real time processing.

Hope you enjoyed this post. Thank you so much to read this. If you have any suggestions or query, kindly feel free to leave your valuable comments and feedback.

Thank You!

Search This Blog

Big Data!! Simple Solutions!!!

Limitations of Apache Spark

Comments

Post a Comment

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1