Create Schema in Spark using inferShema and StructType



Dear friends,
Today we’ll learn to create schema in Spark using inferShema and StructType.

Use Case : Suppose I have a file with header. Now,
    1. I want to use header of the file as schema (in this case we will use inferSchema)

    2. I don’t want to use the header of the file as schema, We want to create custom schema for our data. (In this case we will use StructType to create the schema.)


Note : in Real time mostly we get the files/data having header, but we need to create custom schema for our business, So we used StructType to create custom schema, rarely we choose the header of the file as schema.

For example we have this below dataset having header to use for our above scenarios. (File name is : test_file_with_header.csv)
cust_no,cust_nm,orders,price,city
1,Divyanshu,Laptop,75000,Delhi
2,Himanshu,Laptop,95000,Delhi
3,Sushil,Hard Disk,20000,Delhi
4,Sidharth,Mobile,75000,Kolkata
5,Surendra,Desktop,70000,Kolkata
6,Mohit,Pen Drive,25000,Kolkata
7,Mohit,Laptop,65000,Hyderabad
8,Prabhu,Laptop,50000,Hyderabad
9,Rajesh,Head Phone,10000,Hyderabad
10,Mohit,Mobile,40000,Chennai
11,Sonia,Desktop,25000,Chennai
12,Deepthi,Head Phone,10000,Chennai
13,Priyanka,Desktop,56000,Noida
14,Prashant,Hard Disk,35000,Noida
15,Ayush,Mobile,15000,Noida




Case 1 : Using inferSchema we can generate the schema from file itself as file having the header.

Using below command we can create the schema for the above file : here we need to enable header and inferSchema as true using .option. We have to set schema as cust_no, cust_nm, orders, price, city which is the header of  the file.



So First we need to create SQLContext using following command : Here sc means SparkContext object.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)


Now we’ll use below command to create the schema :

val schema_with_infer = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load("/user/g.aspiredev.001/Anamika/test_file_with_header.csv")





Finally, We can see the output with schema in a tabular format :

schema_with_infer.show



Case 2 : Using StructType we can create the custom schema for our data even file have the header.

In our file header is :
cust_no,cust_nm,orders,price,city
And we want to create schema as :
customer_no, Customer_Name, Customer_Order, Price_of_Order, Customer_City

First we need to import these below classes , then we can create a RDD Custom_Schema
To create the schema using StructType :



import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val Custom_Schema =
StructType(
            Array(
              StructField("customer_no", IntegerType, nullable=false),
              StructField("Customer_Name", StringType, nullable=false),
              StructField("Customer_Order", StringType, nullable=false),
              StructField("Price_of_Order", IntegerType, nullable=false),
              StructField("Customer_City", StringType, nullable=false)
          ))




Now next step is to create SQLContext using following command : Here sc means SparkContext object.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

and now using following command we will use our custom schema as column name :
val use_custom_schema = sqlContext.
          read.
          format("com.databricks.spark.csv").
          schema(Custom_Schema).                              load("/user/g.aspiresit.001/Anamika/test_file_with_header.csv")



Finally let’s see the output with our custom schema :

use_custom_schema.show



Finally we got our solutions! Thank you so much friends to read this, if you have any questions or suggestions, kindly leave your comment, I’ll get back to you soon.
Let’s meet with another topic with amazing solutions in next blog.
Thank You!





Comments

Popular posts from this blog

Transformations and Actions in Spark

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1