Create Schema in Spark using inferShema and StructType
Dear friends,
Today we’ll learn to create schema in Spark using inferShema and StructType.
Use Case : Suppose I have a file with header. Now,
1. I want to use header of the file as schema (in this case we
will use inferSchema)
2. I don’t want to use the header of the file as schema, We want
to create custom schema for our data. (In this case we will use StructType to create
the schema.)
Note : in Real time mostly we get the files/data having header, but
we need to create custom schema for our business, So we used StructType to create custom schema, rarely we choose the header of the file as schema.
For example we have this below dataset having
header to use for our above scenarios. (File name is : test_file_with_header.csv)
cust_no,cust_nm,orders,price,city
1,Divyanshu,Laptop,75000,Delhi
2,Himanshu,Laptop,95000,Delhi
3,Sushil,Hard
Disk,20000,Delhi
4,Sidharth,Mobile,75000,Kolkata
5,Surendra,Desktop,70000,Kolkata
6,Mohit,Pen
Drive,25000,Kolkata
7,Mohit,Laptop,65000,Hyderabad
8,Prabhu,Laptop,50000,Hyderabad
9,Rajesh,Head
Phone,10000,Hyderabad
10,Mohit,Mobile,40000,Chennai
11,Sonia,Desktop,25000,Chennai
12,Deepthi,Head
Phone,10000,Chennai
13,Priyanka,Desktop,56000,Noida
14,Prashant,Hard
Disk,35000,Noida
15,Ayush,Mobile,15000,Noida
Case 1 : Using inferSchema we
can generate the schema from file itself as file having the header.
Using below command we can create the schema for the above
file : here we need to enable header and inferSchema as true using .option. We have to set schema as cust_no, cust_nm, orders, price, city which is the header of
the file.
So First we need to create SQLContext using
following command : Here sc means SparkContext object.
val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
Now we’ll use below command to create the schema :
val schema_with_infer = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load("/user/g.aspiredev.001/Anamika/test_file_with_header.csv")
Finally, We can see the output with schema in a tabular
format :
schema_with_infer.show
Case 2 : Using StructType we can create the custom schema
for our data even file have the header.
In our file header is :
cust_no,cust_nm,orders,price,city
And we want to create schema as :
customer_no, Customer_Name,
Customer_Order, Price_of_Order, Customer_City
First we need to import these below classes , then
we can create a RDD Custom_Schema
To create the schema using StructType :
import spark.implicits._
import
org.apache.spark.sql.types._
import
org.apache.spark.sql.functions._
val Custom_Schema =
StructType(
Array(
StructField("customer_no", IntegerType, nullable=false),
StructField("Customer_Name", StringType, nullable=false),
StructField("Customer_Order", StringType, nullable=false),
StructField("Price_of_Order", IntegerType, nullable=false),
StructField("Customer_City",
StringType, nullable=false)
))
Now next step is to create
SQLContext using following command : Here sc means SparkContext object.
val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
and now using following
command we will use our custom schema as column name :
val use_custom_schema = sqlContext.
read.
format("com.databricks.spark.csv").
schema(Custom_Schema). load("/user/g.aspiresit.001/Anamika/test_file_with_header.csv")
Finally let’s see the output
with our custom schema :
use_custom_schema.show
Finally we got our solutions! Thank you so much
friends to read this, if you have any questions or suggestions, kindly leave
your comment, I’ll get back to you soon.
Let’s meet with another topic with amazing
solutions in next blog.
Thank You!
Comments
Post a Comment