4

I'm struggling to flatten a JSON schema that has structs inside structs. The problem is that the inner structs names are dynamic so I can't access them easily using "." notion

The schema is similar to :

    root
 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |    |-- RfS: struct (nullable = true)
 |        |-- A
 |        |-- B
 |    |-- RtW: struct (nullable = true)
 |        |-- A
 |        |-- B

so As and Bs are fixed, but every JSON file has different number structs with different names (RfS,RtW) .. could be 2 .. could be 5 .. with dyanmic names that I don't know.

How can I flatten this structure easily in a dynamic way ?

2
  • could u provide required output in printSchema() or tabular form? one way to go about this could be to first identify ur structs and fields df.schema and then use struct.* to dynamically select all.. Commented Jun 15, 2020 at 0:54
  • @Morano88, can you please let us know, how you created such dynamic column with sample data? Commented Jun 15, 2020 at 3:24

1 Answer 1

2

The next solution is using a single select and chain function for flattening the final columns:

from pyspark.sql.functions import col
from itertools import chain

jsonData = """{
  "A" : "some A",
  "Plugins": {
    "RfS": {
      "A" : "RfSA",
      "B" : "RfSB"
    },
    "RtW" : {
      "A" : "RtWA",
      "B" : "RtWA"
    }
  }
}"""

df = spark.read.json(sc.parallelize([jsonData]))

no_plug_cols = ["A"] # cols not in Plugins i.e A
plug_df = df.select("A", "Plugins.*")

# plug_df.printSchema()
# root
#  |-- A: string (nullable = true)
#  |-- RfS: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)
#  |-- RtW: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)

# note that we use sets i.e set(plug_df.columns) - set(no_plug_cols) to retrieve cols in Plugins only
icols = [(col(f"{c}.A").alias(f"{c}.A"), col(f"{c}.B").alias(f"{c}.B")) 
         for c in (set(plug_df.columns) - set(no_plug_cols))]

# we use chain to flatten icols which is a list of tuples
plug_df.select(no_plug_cols + list(chain(*icols))).show()

# +------+-----+-----+-----+-----+
# |     A|RfS.A|RfS.B|RtW.A|RtW.B|
# +------+-----+-----+-----+-----+
# |some A| RfSA| RfSB| RtWA| RtWA|
# +------+-----+-----+-----+-----+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.