Avoid parsing json subfield in Spark

Question

I have json files with a complex schema (see below) that I am reading using Spark. I found out that some of the fields are duplicated in the source data thus Spark throws an error during reading (as expected). The duplicate names are under the storageidlist field. What I would like to do is to load the storageidlist field as an unparsed string into a string type column and parse it manually afterwards. Would this be possible in Spark?

root
 |-- errorcode: string (nullable = true)
 |-- errormessage: string (nullable = true)
 |-- ip: string (nullable = true)
 |-- label: string (nullable = true)
 |-- status: string (nullable = true)
 |-- storageidlist: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- errorcode: string (nullable = true)
 |    |    |-- errormessage: string (nullable = true)
 |    |    |-- fedirectorList: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- directorId: string (nullable = true)
 |    |    |    |    |-- errorcode: string (nullable = true)
 |    |    |    |    |-- errordesc: string (nullable = true)
 |    |    |    |    |-- metrics: string (nullable = true)
 |    |    |    |    |-- portMetricDataList: array (nullable = true)
 |    |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- data: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- ts: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |    |    |    |-- errorcode: string (nullable = true)
 |    |    |    |    |    |    |    |-- errordesc: string (nullable = true)
 |    |    |    |    |    |    |    |-- metricid: string (nullable = true)
 |    |    |    |    |    |    |    |-- portid: string (nullable = true)
 |    |    |    |    |    |    |    |-- status: string (nullable = true)
 |    |    |    |    |-- status: string (nullable = true)
 |    |    |-- metrics: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- storageGroupList: string (nullable = true)
 |    |    |-- storageid: string (nullable = true)
 |-- sublabel: string (nullable = true)
 |-- ts: string (nullable = true)

Neethu Lalitha · Accepted Answer · 2021-11-17 11:03:32Z

1

One of the options is to create a Java Class for this JSON object . In that way, you can read the input JSON and spark won't throw an error during reading. Duplicates are allowed as far as the schema you have defined matches the input schema.

    spark.read()
            .schema(Encoders.bean(YourPOJO.class).schema())
            .option("encoding", "UTF-8")
            .option("mode", "FAILFAST")
            .json("data.json")
            .as(Encoders.bean(YourPOJO.class));

}

answered Nov 17, 2021 at 11:03

Neethu Lalitha

3,0715 gold badges38 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Avoid parsing json subfield in Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related