11

sql/dataframes, please help me out or provide some good suggestion on how to read this json

{
    "billdate":"2016-08-08",
    "accountid":"xxx"
    "accountdetails":{
        "total":"1.1"
        "category":[
        {
            "desc":"one",
            "currentinfo":{
            "value":"10"
        },
            "subcategory":[
            {
                "categoryDesc":"sub",
                "value":"10",
                "currentinfo":{
                    "value":"10"
                }
            }]
        }]
    }
}

Thanks,

0

2 Answers 2

20

You can try the following code to read the JSON file based on Schema in Spark 2.2

import org.apache.spark.sql.types.{DataType, StructType}

//Read Json Schema and Create Schema_Json
val schema_json=spark.read.json("/user/Files/ActualJson.json").schema.json

//add the schema 
val newSchema=DataType.fromJson(schema_json).asInstanceOf[StructType]

//read the json files based on schema
val df=spark.read.schema(newSchema).json("Json_Files/Folder Path")
Sign up to request clarification or add additional context in comments.

Comments

11

Seems like your json is not valid. pls check with http://www.jsoneditoronline.org/

Please see an-introduction-to-json-support-in-spark-sql.html

if you want to register as the table you can register like below and print the schema.

DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();
    df.registerTempTable("df");
    df.printSchema();

Below is sample code snippet

DataFrame app = df.select("toplevel");
        app.registerTempTable("toplevel");
        app.printSchema();
        app.show();
DataFrame appName = app.select("toplevel.sublevel");
        appName.registerTempTable("sublevel");
        appName.printSchema();
        appName.show();

Example with scala :

{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}

 val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame

Reading top level field

val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])

 names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)

Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.

Flatten and Read a JSON Array

each Person has an array of "cities". Let's flatten these arrays and read out all their elements.

val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame

val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]

 allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)

The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.

Read an Array of Nested JSON Objects, Unflattened

read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:

 val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]


val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]

 schoolsArr.foreach(schools => {
    schools.map(row => print(row.getString(0), row.getLong(1)))
    print("\n")
 })
(stanford,2010)(berkeley,2012) 
(ucsb,2011) 
(berkeley,2014)

Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.

2 Comments

Hi RamPrasad,thanks for the prompt reply, I will try out the examples provided. btw the json i provided is a valid one. { "billdate":"2016-08-08", "accountid":"xxx", "accountdetails":{ "total":"1.1", "category":[ { "desc":"one", "currentinfo":{"value":"10"}, "subcategory":[ { "categoryDesc":"sub", "value":"10", "currentinfo":{ "value":"10" }}] }]} } do you have any insight on how to read the json with predefined schema? if yes pls let me know, Thanks and Appreciate your help!!
i have more fields in the json than what i have mentioned here, so I want to set my schema while reading the json and extract only those filed and flattern to tables.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.