0

I am trying to access inner attributes of following json using pyspark

[
 {
    "432": [
        {
            "atttr1": null,
            "atttr2": "7DG6",
            "id":432,
            "score": 100
        }
    ]
},
 {
    "238": [
        {
            "atttr1": null,
            "atttr2": "7SS8",
            "id":432,
            "score": 100
        }
    ]
}
]

In the output, I am looking for something like below in form of csv atttr1, atttr2,id,score null,"7DG6",432,100 null,"7SS8",238,100

I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.

print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())

I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.

inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")

Any help will be highly appreciated. I am using spark 2.4

2
  • what is peopleDF? Could you show the output of peopleDF.show()? Commented Apr 8, 2021 at 12:40
  • that's input df. Renamed it. Also the output .show() is +--------------------+--------------------+ | 238| 432| +--------------------+--------------------+ | null|[[, 7DG6, 432, 100]]| |[[, 7SS8, 432, 100]]| null| +--------------------+--------------------+ Commented Apr 9, 2021 at 5:49

1 Answer 1

1

Without using pyspark features, you can do it like this:

data = json.loads(json_str)  # or whatever way you're getting the data

columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns))  # headers

for item in data:
    for obj in list(item.values())[0]:  # since each list has only one object
        print(','.join(str(obj[col]) for col in columns))

Output:

atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100

Or

for item in data:
    obj = list(item.values())[0][0]  # since the object is the one and only item in list
    print(','.join(str(obj[col]) for col in columns))

FYI, you can store those in a variable or write it out to csv instead of/and also printing it.

And if you're just looking to dump that to csv, see this answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.