Interacting Spark with Hive
My dear friends, today we will learn that in Spark
how to interact with hive and do whatever we want!
Sample data to use for Spark-Hive Interaction :
File Name : Hive_Test.csv
111,Raghav,Delhi
222,Hemant,Bangalore
333,Anushka,Chennai
444,John,Mumbai
555,Karan,Delhi
First we need to import below classes :
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import spark.implicits._
import spark.sql
Now we have to create a
variable warehouseLocation.
val warehouseLocation =
"spark-warehouse"
Next, we need to run below
command to get required all configuration :
val Hive_with_Spark =
SparkSession.builder().appName("Spark Hive Example").
config("spark.sql.warehouse.dir",
warehouseLocation).
enableHiveSupport().getOrCreate()
Fianally, We are ready
to work with Hive using Spark as below commands :
Here, first I’ll show you in
hive that there is no database named learning.
But after run the below
command we will be able to find this database learning along with table spark_table
and whatever we did in Hive. Let’s start!
Using below code, I’ll create
the database learning, then using
next command create the hive table spark_table
and load the data into it from local, then see the output of table.
sql("create database
learning")
sql("use learning")
sql("use learning")
sql("CREATE TABLE IF NOT EXISTS spark_table (Emp_No int, Emp_Name string, Emp_City string)row format delimited fields terminated by ','")
sql("LOAD DATA LOCAL INPATH '/home/g.aspiredev.001/hive_test.csv' INTO TABLE spark_table")
sql("SELECT * FROM spark_table ").show()
Finally, we got our desired output, let’s verify
in hive :
**********************************
Same above
output we can achieve using below codes too, where we are using Spark variables
to store all this process, which can be use with another variables for different operations or store to any different locations as per our needs.
Here I’ll show
you the same above code using Spark variables with little highlighted difference, but both will give the same output.
1. Create Spark Variables using above created variable Hive_with_Spark.
val create_db = Hive_with_Spark.sql("create
database learning1")
val use_db = Hive_with_Spark.sql("use
learning1")
val create_tbl = Hive_with_Spark.sql("CREATE TABLE
IF NOT EXISTS spark_table1 (Emp_No int, Emp_Name string, Emp_City string)row
format delimited fields terminated by ','")
val load_data = Hive_with_Spark.sql("LOAD DATA LOCAL INPATH '/home/g.aspiredev.001/hive_test.csv'
INTO TABLE spark_table1")
val select_data = Hive_with_Spark.sql("SELECT * FROM spark_table1
")
val show_data = select_data.show()
Let’s verify in hive (database learning1 and table spark_table1 created and loaded data) :
2. Create Spark Variables without using above created variable Hive_with_Spark.
val create_db = sql("create database
learning2")
val use_db = sql("use learning2")
val create_tbl = sql("CREATE TABLE IF NOT EXISTS
spark_table2 (Emp_No int, Emp_Name string, Emp_City string)row format delimited
fields terminated by ','")
val load_data = sql("LOAD DATA LOCAL INPATH '/home/g.aspiredev.001/hive_test.csv'
INTO TABLE spark_table2")
val select_data = sql("SELECT * FROM spark_table2 ")
val show_data = select_data.show()
Let’s verify in hive (database learning2 and table spark_table2 created and loaded data) :
Thank
You!
Should not it be df=Hive_with_Spark.sql("sql query")
ReplyDeleteWe can also communicate directly sqlContext.sql(sql query)..
ReplyDeleteHive-site.xml needs to be in spark/conf