0

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example

{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}

As I create a data frame by reading json file, it is creating data frame as below

val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]

As we can see "escapedJsonPayload" is String and I need it to be Struct.

Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"

I have tried below steps

  1. val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)

  2. val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))

  3. val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))

  4. val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])

Any help would be appreciated

1 Answer 1

4

First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:

{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}

Next, to parse the JSON correctly from the above JSON, you have to use following code:

val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd

val df = spark.read.json(rdd)

Above code will give you following output:

df.show(false)

+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload                   |
+----------------+-------------------------------------+
|[null,abc]      |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+

With following schema:

df.printSchema

root
 |-- clientAttributes: struct (nullable = true)
 |    |-- backfillId: string (nullable = true)
 |    |-- clientPrimaryKey: string (nullable = true)
 |-- escapedJsonPayload: struct (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- itemId: string (nullable = true)
 |    |    |    |-- itemName: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- surname: string (nullable = true)

I hope this helps !

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks himanshuIIITian@ for detailed answer. Regarding "First of all the JSON you have provided is of wrong format (syntactically)", we do not have control over this input file and we are getting this format from up-stream service. So it would appear like this "escapedJsonPayload":"{\"name\":\"Akash\"}". Can you please let us know how to handle this in spark?
Simple...just apply replace("\\", "") on JSON value. Like I mentioned in my answer.
@AkashPatel Please accept the answer or provide feedback.
Accepted the answer. Thanks for you help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.