Create Schema in Spark using inferShema and StructType

December 23, 2018

Dear friends,

Today we’ll learn to create schema in Spark using inferShema and StructType.

Use Case : Suppose I have a file with header. Now,

1. I want to use header of the file as schema (in this case we will use inferSchema)

2. I don’t want to use the header of the file as schema, We want to create custom schema for our data. (In this case we will use StructType to create the schema.)

Note : in Real time mostly we get the files/data having header, but we need to create custom schema for our business, So we used StructType to create custom schema, rarely we choose the header of the file as schema.

For example we have this below dataset having header to use for our above scenarios. (File name is : test_file_with_header.csv)

cust_no,cust_nm,orders,price,city

1,Divyanshu,Laptop,75000,Delhi

2,Himanshu,Laptop,95000,Delhi

3,Sushil,Hard Disk,20000,Delhi

4,Sidharth,Mobile,75000,Kolkata

5,Surendra,Desktop,70000,Kolkata

6,Mohit,Pen Drive,25000,Kolkata

7,Mohit,Laptop,65000,Hyderabad

8,Prabhu,Laptop,50000,Hyderabad

9,Rajesh,Head Phone,10000,Hyderabad

10,Mohit,Mobile,40000,Chennai

11,Sonia,Desktop,25000,Chennai

12,Deepthi,Head Phone,10000,Chennai

13,Priyanka,Desktop,56000,Noida

14,Prashant,Hard Disk,35000,Noida

15,Ayush,Mobile,15000,Noida

Case 1 : Using inferSchema we can generate the schema from file itself as file having the header.

Using below command we can create the schema for the above file : here we need to enable header and inferSchema as true using .option. We have to set schema as cust_no, cust_nm, orders, price, city which is the header of the file.

So First we need to create SQLContext using following command : Here sc means SparkContext object.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Now we’ll use below command to create the schema :

val schema_with_infer = sqlContext.read.

format("com.databricks.spark.csv").

option("header", "true").

option("inferSchema", "true").

load("/user/g.aspiredev.001/Anamika/test_file_with_header.csv")

Finally, We can see the output with schema in a tabular format :

schema_with_infer.show

Case 2 : Using StructType we can create the custom schema for our data even file have the header.

In our file header is :

cust_no,cust_nm,orders,price,city

And we want to create schema as :

customer_no, Customer_Name, Customer_Order, Price_of_Order, Customer_City

First we need to import these below classes , then we can create a RDD Custom_Schema

To create the schema using StructType :

import spark.implicits._

import org.apache.spark.sql.types._

import org.apache.spark.sql.functions._

val Custom_Schema =

StructType(

Array(

StructField("customer_no", IntegerType, nullable=false),

StructField("Customer_Name", StringType, nullable=false),

StructField("Customer_Order", StringType, nullable=false),

StructField("Price_of_Order", IntegerType, nullable=false),

StructField("Customer_City", StringType, nullable=false)

))

Now next step is to create SQLContext using following command : Here sc means SparkContext object.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

and now using following command we will use our custom schema as column name :

val use_custom_schema = sqlContext.

read.

format("com.databricks.spark.csv").

schema(Custom_Schema). load("/user/g.aspiresit.001/Anamika/test_file_with_header.csv")

Finally let’s see the output with our custom schema :

use_custom_schema.show

Finally we got our solutions! Thank you so much friends to read this, if you have any questions or suggestions, kindly leave your comment, I’ll get back to you soon.

Let’s meet with another topic with amazing solutions in next blog.

Thank You!

Search This Blog

Big Data!! Simple Solutions!!!

Create Schema in Spark using inferShema and StructType

Comments

Post a Comment

Popular posts from this blog

Knowledge about Apache Sqoop and its all basic commands to import and export the Data

Transformations and Actions in Spark

How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 1