I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated