Save DF with JSON string as JSON without escape characters with Apache Spark

Question

I have a dataframe which contains some column and json string:

val df = Seq (
 (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
 (1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }""")
).toDF("id", "json")

Which I want to save as json - without a nested json string in it but a 'raw' one instead. When I df.write.json("path") It saves my json column as string:

{"id":0,"json":"{\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }"}

And what I need is:

{"id": 0,"json": {"device_id": 0,"device_type": "sensor-ipad","ip": "68.161.225.1","cca3": "USA","cn": "United States","temp": 25,"signal": 23,"battery_level": 8,"c02_level": 917,"timestamp": 1475600496}}

How can I achieve it? Please not that the structure of json could be different for each row, it can contain additional fields.

koiralo · Accepted Answer · 2021-09-02 10:45:56Z

1

You can use from_json function to get the json string data as a new column

// get schema of the json data
// You can also define your own schema 
import org.apache.spark.sql.functions._
val json_schema = spark.read.json(df.select("json").as[String]).schema

val resultDf = df.withColumn("json", from_json($"json", json_schema))

Output:

{"id":0,"json":{"battery_level":8,"c02_level":917,"cca3":"USA","cn":"United States","device_id":0,"device_type":"sensor-ipad","ip":"68.161.225.1","signal":23,"temp":25,"timestamp":1475600496}}
{"id":1,"json":{"battery_level":6,"c02_level":1413,"cca3":"NOR","cn":"Norway","device_id":1,"device_type":"sensor-igauge","ip":"213.161.254.1","signal":18,"temp":30,"timestamp":1475600498}}

edited Sep 2, 2021 at 10:45

answered Sep 2, 2021 at 9:48

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Byrdziu Over a year ago

This is not really what I'm looking for. The first issue is that it will not work if json structure is different in each row, ie. for row 2 you have additional fields. Also, it flattens the structure which I don't want. Also, the solution doesn't seem to work in the current form, I'm getting a DF with _corrupt_record column.

koiralo Over a year ago

This should work in case you have different json structure, You can write without flattening. and what do you mean by doesn't work, can you add more detail. I used the dataframe you provide and it works fine

Collectives™ on Stack Overflow

Save DF with JSON string as JSON without escape characters with Apache Spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related