How to read json with schema in spark dataframes/spark sql?

Question

sql/dataframes, please help me out or provide some good suggestion on how to read this json

{
    "billdate":"2016-08-08",
    "accountid":"xxx"
    "accountdetails":{
        "total":"1.1"
        "category":[
        {
            "desc":"one",
            "currentinfo":{
            "value":"10"
        },
            "subcategory":[
            {
                "categoryDesc":"sub",
                "value":"10",
                "currentinfo":{
                    "value":"10"
                }
            }]
        }]
    }
}

Thanks,

Stephen Rauch · Accepted Answer · 2018-08-07 00:42:19Z

20

You can try the following code to read the JSON file based on Schema in Spark 2.2

import org.apache.spark.sql.types.{DataType, StructType}

//Read Json Schema and Create Schema_Json
val schema_json=spark.read.json("/user/Files/ActualJson.json").schema.json

//add the schema 
val newSchema=DataType.fromJson(schema_json).asInstanceOf[StructType]

//read the json files based on schema
val df=spark.read.schema(newSchema).json("Json_Files/Folder Path")

edited Aug 7, 2018 at 0:42

Stephen Rauch♦

50.1k32 gold badges118 silver badges143 bronze badges

answered Aug 7, 2018 at 0:22

Raghavan

3233 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ram Ghadiyaram · Accepted Answer · 2016-09-06 19:11:37Z

11

Seems like your json is not valid. pls check with http://www.jsoneditoronline.org/

Please see an-introduction-to-json-support-in-spark-sql.html

if you want to register as the table you can register like below and print the schema.

DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();
    df.registerTempTable("df");
    df.printSchema();

Below is sample code snippet

DataFrame app = df.select("toplevel");
        app.registerTempTable("toplevel");
        app.printSchema();
        app.show();
DataFrame appName = app.select("toplevel.sublevel");
        appName.registerTempTable("sublevel");
        appName.printSchema();
        appName.show();

Example with scala :

{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}

 val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame

Reading top level field

val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])

 names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)

Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.

Flatten and Read a JSON Array

each Person has an array of "cities". Let's flatten these arrays and read out all their elements.

val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame

val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]

 allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)

The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.

Read an Array of Nested JSON Objects, Unflattened

read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:

 val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]


val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]

 schoolsArr.foreach(schools => {
    schools.map(row => print(row.getString(0), row.getLong(1)))
    print("\n")
 })
(stanford,2010)(berkeley,2012) 
(ucsb,2011) 
(berkeley,2014)

Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.

edited Sep 6, 2016 at 19:11

answered Sep 6, 2016 at 18:40

Ram Ghadiyaram

29.4k16 gold badges102 silver badges133 bronze badges

2 Comments

raj kumar Over a year ago

Hi RamPrasad,thanks for the prompt reply, I will try out the examples provided. btw the json i provided is a valid one. { "billdate":"2016-08-08", "accountid":"xxx", "accountdetails":{ "total":"1.1", "category":[ { "desc":"one", "currentinfo":{"value":"10"}, "subcategory":[ { "categoryDesc":"sub", "value":"10", "currentinfo":{ "value":"10" }}] }]} } do you have any insight on how to read the json with predefined schema? if yes pls let me know, Thanks and Appreciate your help!!

raj kumar Over a year ago

i have more fields in the json than what i have mentioned here, so I want to set my schema while reading the json and extract only those filed and flattern to tables.

Collectives™ on Stack Overflow

How to read json with schema in spark dataframes/spark sql?

2 Answers 2

Comments

Example with scala :

Reading top level field

Flatten and Read a JSON Array

Read an Array of Nested JSON Objects, Unflattened

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Example with scala :

Reading top level field

Flatten and Read a JSON Array

Read an Array of Nested JSON Objects, Unflattened

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related