0

I am trying to Convert a nested JSON to a flattened DataFrame.

I have read in the JSON as follows:

df = spark.read.json("/mnt/ins/duedil/combined.json")

The resulting dataframe looks like the following:

enter image description here

I have made a start on flattening the dataframe as follows

display(df.select ("companyId","countryCode"))

The above will display the following

enter image description here

I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"

Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?

1 Answer 1

1

Your financials is an array so if you want to extract something within the financials, you need some array transformations.

One example is to use transform.

from pyspark.sql import functions as F
df.select(
    "companyId",
    "countryCode",
    F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)

This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi emma, thanks for reaching out. I'm getting the error that F is not defined. Should that be a function?
Yes it is the pyspark's functions. I added the import line.
Thank you Emma. You're a star

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.