2

I have a dataframe which contains some column and json string:

val df = Seq (
 (0, """{​"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }​"""),
 (1, """{​"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }​""")
).toDF("id", "json")

Which I want to save as json - without a nested json string in it but a 'raw' one instead. When I df.write.json("path") It saves my json column as string:

{"id":0,"json":"{​\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }​"}

And what I need is:

{"id": 0,"json": {"device_id": 0,"device_type": "sensor-ipad","ip": "68.161.225.1","cca3": "USA","cn": "United States","temp": 25,"signal": 23,"battery_level": 8,"c02_level": 917,"timestamp": 1475600496}}

How can I achieve it? Please not that the structure of json could be different for each row, it can contain additional fields.

1 Answer 1

1

You can use from_json function to get the json string data as a new column

// get schema of the json data
// You can also define your own schema 
import org.apache.spark.sql.functions._
val json_schema = spark.read.json(df.select("json").as[String]).schema

val resultDf = df.withColumn("json", from_json($"json", json_schema))

Output:

{"id":0,"json":{"battery_level":8,"c02_level":917,"cca3":"USA","cn":"United States","device_id":0,"device_type":"sensor-ipad","ip":"68.161.225.1","signal":23,"temp":25,"timestamp":1475600496}}
{"id":1,"json":{"battery_level":6,"c02_level":1413,"cca3":"NOR","cn":"Norway","device_id":1,"device_type":"sensor-igauge","ip":"213.161.254.1","signal":18,"temp":30,"timestamp":1475600498}}
Sign up to request clarification or add additional context in comments.

2 Comments

This is not really what I'm looking for. The first issue is that it will not work if json structure is different in each row, ie. for row 2 you have additional fields. Also, it flattens the structure which I don't want. Also, the solution doesn't seem to work in the current form, I'm getting a DF with _corrupt_record column.
This should work in case you have different json structure, You can write without flattening. and what do you mean by doesn't work, can you add more detail. I used the dataframe you provide and it works fine

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.