Databricks Flatten Nested JSON to Dataframe with PySpark

Question

I am trying to Convert a nested JSON to a flattened DataFrame.

I have read in the JSON as follows:

df = spark.read.json("/mnt/ins/duedil/combined.json")

The resulting dataframe looks like the following:

I have made a start on flattening the dataframe as follows

display(df.select ("companyId","countryCode"))

The above will display the following

I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"

Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?

Emma · Accepted Answer · 2022-10-12 13:19:42Z

1

Your financials is an array so if you want to extract something within the financials, you need some array transformations.

One example is to use transform.

from pyspark.sql import functions as F
df.select(
    "companyId",
    "countryCode",
    F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)

This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

edited Oct 12, 2022 at 13:19

answered Oct 10, 2022 at 20:04

Emma

9,6331 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Patterson Over a year ago

Hi emma, thanks for reaching out. I'm getting the error that F is not defined. Should that be a function?

Emma Over a year ago

Yes it is the pyspark's functions. I added the import line.

Patterson Over a year ago

Thank you Emma. You're a star

Collectives™ on Stack Overflow

Databricks Flatten Nested JSON to Dataframe with PySpark

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related