Posts

Showing posts from December, 2018

Partitions in Hive

Image
Hello friends,  Today we will discuss about partitioning in Hive and ways to use it. The bigger problem with hive is that when we apply  where clause in our query then even a simple query in Hive also reads the entire dataset and this situation decreases the efficiency and becomes a bottleneck when we are required to run the queries on large tables, but this issue can be overcome by implementing partitions on hive tables. Partitions in hive : Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys such as date, city, and department.  Partition is helpful when the table has one or more partitions keys. Partition keys are basic elements for determining how the data is stored in the table. In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensi...

Transformations and Actions in Spark

Image
Hello Friends, Today I’ll try to explain about Transformations and Actions in Spark. We know that Basically, Spark  RDD supports two types of Operations :          ·    Transformations          ·    Actions Transformations : Transformations are kind of operations which will transform our RDD data from one form to another form, and when we apply this operation on any RDD, we get a new RDD with transformed data   as RDDs in Spark are immutable. Operations like map, flatMap, filter, join, union are transformations. Now there is a point to be noted here and that is when we apply the transformation on any RDD it will not perform the operation immediately. It creates a DAG(Directed Acyclic Graph) using the applied operations, source RDD and functions used for transformation. And it keeps on building this graph using the references till we apply any action operation on the last lined up R...