Limitations of Apache Spark
Dear friends,
Today we
will discuss about the limitations of
Apache Spark and its Apache Spark disadvantages.
As all we
know that Apache Spark is a fast, in-memory data
processing engine with elegant and expressive development APIs to allow data workers
to efficiently execute Streaming, Machine Learning or SQL workloads that
require fast iterative access to datasets.
Below are the few
features of it :
Spark offers over 80 high-level operators that make it easy to build parallel apps and we can use it interactively from the Scala, Python, R, and SQL shells. e.g.
df = spark.read.json("json_file_name.json") df.where("age
> 21") .select("name.first") .show()
// Spark's Python DataFrame API, Read JSON files
with automatic schema inference
3.
It is a kind of Combine SQL, Streaming,
and Complex Analytics. Spark powers
a stack of libraries including SQL and DataFrames, MLib for Machine Learning, GraphX,
and Spark Streaming. We can combine these libraries seamlessly in the same
application.
4.
Spark runs everywhere like on
Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access
diverse data sources.
Limitations of Apache Spark :
As we know
that Apache Spark is the next Generation of Big data tool that is being widely
used by industries but except above advantages, Some of the drawbacks of Apache
Spark are there like no support for real-time processing, Problem with
small file, no dedicated File management system, Expensive and much more due to
these limitations of Apache Spark, industries have started shifting to Apache
Flink – 4G of Big Data.
Limitations of Apache Spark
So let us now understand Apache Spark problems and when not to use
Spark.
1. No
Support for Real-time Processing :
In Spark Streaming, the arriving live stream of data is divided into
batches of the pre-defined interval, and each batch of data is treated like
Spark Resilient distributed Datasets (RDDs), then these RDDs are processed using the
operations like map, flatMap, filter, reduce, join etc. The result of these
operations is returned in batches. Thus, it is not real time processing
but Spark is near real-time processing of live data. Micro-batch processing
takes place in Spark Streaming.
2. Problem
with Small File :
If we use Spark with Hadoop, we come across a problem of a small file.
HDFS provides a limited number of large files rather than a large number of
small files. Another place where Spark legs behind is we store the data gzipped
in S3. This pattern is very nice except when there are lots of small
gzipped files. Now the work of the Spark is to keep those files on network and
uncompress them. The gzipped files can be uncompressed only if the entire file
is on one core. So a large span of time will be spent in burning their core
unzipping files in sequence.
In the resulting RDD, each file will become a partition; hence there
will be a large amount of tiny partition within an RDD. Now if we want
efficiency in our processing, the RDDs should be repartitioned into some
manageable format. This requires extensive shuffling over the network.
3. No File
Management System :
Apache Spark does not have its own file management system, thus it
relies on some other platform like Hadoop or another cloud-based
platform which is one of the Spark known issues.
4. Expensive
:
In-Memory capability can become a bottleneck when we want cost-efficient
processing of big data as keeping data in memory is quite expensive, the memory
consumption is very high, and it is not handled in a user-friendly manner.
Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark is
quite high.
5. Less
number of Algorithms :
Spark MLlib lags
behind in terms of a number of available algorithms like Tanimoto distance.
6. Manual
Optimization :
The Spark job requires to be manually optimized and is adequate to
specific datasets. If we want to partition and cache in Spark to be correct, it
should be controlled manually.
7. Iterative
Processing :
In Spark, the data iterates in batches and each iteration is scheduled
and executed separately.
8. Latency
:
Apache Spark has higher latency as compared to Apache Flink.
9. Window
Criteria :
Spark does not support record based window criteria. It only has
time-based window criteria.
10. Back
Pressure Handling :
Back pressure is build up of data at an input-output when the buffer is
full and not able to receive the additional incoming data. No data is
transferred until the buffer is empty. Apache Spark is not capable of handling
pressure implicitly rather it is done manually.
These are some of the major pros and cons of Apache Spark. We can overcome these limitations of Spark by using Apache Flink – 4G of Big Data.
These are some of the major pros and cons of Apache Spark. We can overcome these limitations of Spark by using Apache Flink – 4G of Big Data.
Finally We can say that though Spark has many drawbacks, but it is still popular in the market
for Big Data solution. But there are various technologies that are overtaking
Spark. Like Stream Processing is much better using Apache Flink than Apache Spark
as it is real time processing.
Thank
You!
Comments
Post a Comment