0

I have many jsons with structure like that

{
    "parent_id": "parent_id1",
    "devices" : "HERE_IS_STRUCT_SERIALIZED_AS_STRING_SEE BELOW"
}

{
    "0x0034" : { "id": "0x0034", "p1": "p1v1", "p2": "p2v1" },
    "0xAB34" : { "id": "0xAB34", "p1": "p1v2", "p2": "p2v2" },
    "0xCC34" : { "id": "0xCC34", "p1": "p1v3", "p2": "p2v3" },
    "0xFFFF" : { "id": "0xFFFF", "p1": "p1v4", "p2": "p2v4" },
    ....
    "0x0023" : { "id": "0x0023", "p1": "p1vN", "p2": "p2vN" },
}

As you can see instead of making array of objects, the telemetry developers serialize every element as a property of object, also the property names vary depending on id.

Using Spark DataFrame or RDD API, I want to transform it into a table like that

parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5

Here is sample data:

{
    "parent_1": "parent_v1",
    "devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}

{
    "parent_2": "parent_v1",
    "devices" : "{ \"0x0045\" : { \"id\": \"0x0045\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xC5C1\" : { \"id\": \"0xC5C1\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }}"
}

Desired output

parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5

parent_id2, 0x0045, p1v1, p2v1
parent_id2, 0xC5C1, p1v2, p2v2

I thought about passing devices as parameter of from_json function and than somehow transform the returned object into a JSON array and then explode it.... But from_json wants schema as input, but the schema tends to vary...

1 Answer 1

2

There is probably a more pythonic or sparkian way to do this but this worked for me:

Input Data

data = {
    "parent_id": "parent_v1",
    "devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}

Get Dataframe

import json

def get_df_from_json(json_data):
    #convert string to json
    json_data['devices'] = json.loads(json_data['devices'])
    list_of_dicts = []
    for device_name, device_details in json_data['devices'].items():
        row = {
          "parent_id": json_data['parent_id'],
          "device": device_name
        }
        for key in device_details.keys():
            row[key] = device_details[key]
        list_of_dicts.append(row)
    return spark.read.json(sc.parallelize(list_of_dicts), multiLine=True)
display(get_df_from_json(data))

Output

+--------+--------+------+------+-----------+
| device |   id   |  p1  |  p2  | parent_id |
+--------+--------+------+------+-----------+
| 0x0034 | 0x0034 | p1v1 | p2v1 | parent_v1 |
| 0x0023 | 0x0023 | p1vN | p2vN | parent_v1 |
| 0xFFFF | 0xFFFF | p1v4 | p2v4 | parent_v1 |
| 0xCC34 | 0xCC34 | p1v3 | p2v3 | parent_v1 |
| 0xAB34 | 0xAB34 | p1v2 | p2v2 | parent_v1 |
+--------+--------+------+------+-----------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I actually started using RDD for parallel reading the source data and parsing it with JSON lib... (I have a few tens of TBs of it)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.