How to Convert a Spark DataFrame String Type Column to Array Type and Split all the json files of this column into rows : Part - 2
Hi Friends,
In this post, I'd like to explore a project scenario of json data.
Suppose, We are getting a DataFrame from Source which has a column ArrayOfJsonStrings, which is actually an Array of Json files/data, but Data Type of this Column is String.
We need to Split All the json files of this ArrayOfJsonStrings column into possible number of rows.
This above use case has been already detailed explained in this previous post. In This Post I'll explain with another approach to solve the same use case and get expected output.
Below is the Input and Output DataFrames :
Input DataFrame :
Output DataFrame :
In this post, I'd like to explore a project scenario of json data.
Suppose, We are getting a DataFrame from Source which has a column ArrayOfJsonStrings, which is actually an Array of Json files/data, but Data Type of this Column is String.
We need to Split All the json files of this ArrayOfJsonStrings column into possible number of rows.
This above use case has been already detailed explained in this previous post. In This Post I'll explain with another approach to solve the same use case and get expected output.
Below is the Input and Output DataFrames :
Input DataFrame :
Output DataFrame :
Below is the code with explanation and output.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SplitArrayOfJsonsToRows {
def main(args: Array[String]) {
// Creating SparkSession
lazy val conf = new SparkConf().setAppName("split-array-of-json-to-row").set("spark.default.parallelism", "1")
.setIfMissing("spark.master", "local[*]")
lazy val sparkSession = SparkSession.builder().config(conf).getOrCreate()
// Creating raw DataFrame which has a column of Array of Jsons, which DataType is string and needs to split into possible number of Rows.
val rawDF = sparkSession.sql(""" select string ("1") as id """).withColumn("ArrayOfJsonStrings", lit("""[{"First":{"Info":"ABCD123","Res":"1.0"}},{"Second":{"Info":"EFGH456","Res":"2.0"}},{"Third":{"Info":"IJKL789","Res":"3.0"}}]"""))
rawDF.show(false)
// Printing Schema to show it's string type
rawDF.printSchema()
// To make the String column to valid array we first replace (},) with (}},) then remove ("[|]") and split on (},)
// It results array and finally that can be explode on the array to Split all the json in to row from files of Array.
val splitJsontoRow = rawDF.selectExpr("id", """explode(split(regexp_replace(regexp_replace(ArrayOfJsonStrings,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as splittedJson""")
splitJsontoRow.show(false)
}
}
Output for the above steps in code :
I hope, this post was helpful, please do like, comment and share.
Thank You!
Comments
Post a Comment